2024-12-31

Title: Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Authors: Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2412.19806
Pdf URL: https://arxiv.org/pdf/2412.19806
Copy Paste: [[2412.19806]] Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing(https://arxiv.org/abs/2412.19806)
Keywords: large language model
Abstract: Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. In this paper, we present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which VITRON supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. To ensure an effective and precise message passing from LLM to backend modules for function invocation, we propose a novel hybrid method by simultaneously integrating discrete textual instructions and continuous signal embeddings. Further, we design various pixel-level spatiotemporal vision-language alignment learning for VITRON to reach the best fine-grained visual capability. Finally, a cross-task synergy module is advised to learn to maximize the task-invariant fine-grained visual features, enhancing the synergy between different visual tasks. Demonstrated over 12 visual tasks and evaluated across 22 datasets, VITRON showcases its extensive capabilities in the four main vision task clusters. Overall, this work illuminates the great potential of developing a more unified multimodal generalist. Project homepage: this https URL

Title: GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection

Authors: Xutao Liao, Shaohui Li, Yuhui Xu, Zhi Li, Yu Liu, You He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19820
Pdf URL: https://arxiv.org/pdf/2412.19820
Copy Paste: [[2412.19820]] GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection(https://arxiv.org/abs/2412.19820)
Keywords: large language model
Abstract: Recent low-rank training methods, such as GaLore, have significantly reduced the memory required to optimize large language models (LLMs). However, these methods often suffer from time-consuming low-rank projection estimations. In particular, the singular value decomposition (SVD) in GaLore can consume more than 80\% of the total training time. To address this issue, we propose GaLore$+$, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. In addition, we employ randomized subspace iteration to achieve fast SVD. To further enhance performance, we propose sparsely coded residuals to reduce the errors caused by low-rank approximation on the first- and second-order moments of the optimizers and weight updates. We evaluate GaLore$+$ on arithmetic reasoning and natural language generation datasets. Our experiments demonstrate that GaLore$+$ delivers superior performance while achieving approximately $4\times$ fine-tuning speed compared to vanilla GaLore.

Title: Back To The Future: A Hybrid Transformer-XGBoost Model for Action-oriented Future-proofing Nowcasting

Authors: Ziheng Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19832
Pdf URL: https://arxiv.org/pdf/2412.19832
Copy Paste: [[2412.19832]] Back To The Future: A Hybrid Transformer-XGBoost Model for Action-oriented Future-proofing Nowcasting(https://arxiv.org/abs/2412.19832)
Keywords: interpretability, transformer
Abstract: Inspired by the iconic movie Back to the Future, this paper explores an innovative adaptive nowcasting approach that reimagines the relationship between present actions and future outcomes. In the movie, characters travel through time to manipulate past events, aiming to create a better future. Analogously, our framework employs predictive insights about the future to inform and adjust present conditions. This dual-stage model integrates the forecasting power of Transformers (future visionary) with the interpretability and efficiency of XGBoost (decision maker), enabling a seamless loop of future prediction and present adaptation. Through experimentation with meteorological datasets, we demonstrate the framework's advantage in achieving more accurate forecasting while guiding actionable interventions for real-time applications.

Title: Multi-atlas Ensemble Graph Neural Network Model For Major Depressive Disorder Detection Using Functional MRI Data

Authors: Nojod M. Alotaibi, Areej M. Alhothali, Manar S. Ali
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19833
Pdf URL: https://arxiv.org/pdf/2412.19833
Copy Paste: [[2412.19833]] Multi-atlas Ensemble Graph Neural Network Model For Major Depressive Disorder Detection Using Functional MRI Data(https://arxiv.org/abs/2412.19833)
Keywords: segmentation
Abstract: Major depressive disorder (MDD) is one of the most common mental disorders, with significant impacts on many daily activities and quality of life. It stands as one of the most common mental disorders globally and ranks as the second leading cause of disability. The current diagnostic approach for MDD primarily relies on clinical observations and patient-reported symptoms, overlooking the diverse underlying causes and pathophysiological factors contributing to depression. Therefore, scientific researchers and clinicians must gain a deeper understanding of the pathophysiological mechanisms involved in MDD. There is growing evidence in neuroscience that depression is a brain network disorder, and the use of neuroimaging, such as magnetic resonance imaging (MRI), plays a significant role in identifying and treating MDD. Rest-state functional MRI (rs-fMRI) is among the most popular neuroimaging techniques used to study MDD. Deep learning techniques have been widely applied to neuroimaging data to help with early mental health disorder detection. Recent years have seen a rise in interest in graph neural networks (GNNs), which are deep neural architectures specifically designed to handle graph-structured data like rs-fMRI. This research aimed to develop an ensemble-based GNN model capable of detecting discriminative features from rs-fMRI images for the purpose of diagnosing MDD. Specifically, we constructed an ensemble model by combining features from multiple brain region segmentation atlases to capture brain complexity and detect distinct features more accurately than single atlas-based models. Further, the effectiveness of our model is demonstrated by assessing its performance on a large multi-site MDD dataset. The best performing model among all folds achieved an accuracy of 75.80%, a sensitivity of 88.89%, a specificity of 61.84%, a precision of 71.29%, and an F1-score of 79.12%.

Title: RoboSignature: Robust Signature and Watermarking on Network Attacks

Authors: Aryaman Shaan, Garvit Banga, Raghav Mantri
Subjects: cs.CR, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19834
Pdf URL: https://arxiv.org/pdf/2412.19834
Copy Paste: [[2412.19834]] RoboSignature: Robust Signature and Watermarking on Network Attacks(https://arxiv.org/abs/2412.19834)
Keywords: attack, robust, watermark, diffusion, generative, large language model
Abstract: Generative models have enabled easy creation and generation of images of all kinds given a single prompt. However, this has also raised ethical concerns about what is an actual piece of content created by humans or cameras compared to model-generated content like images or videos. Watermarking data generated by modern generative models is a popular method to provide information on the source of the content. The goal is for all generated images to conceal an invisible watermark, allowing for future detection or identification. The Stable Signature finetunes the decoder of Latent Diffusion Models such that a unique watermark is rooted in any image produced by the decoder. In this paper, we present a novel adversarial fine-tuning attack that disrupts the model's ability to embed the intended watermark, exposing a significant vulnerability in existing watermarking methods. To address this, we further propose a tamper-resistant fine-tuning algorithm inspired by methods developed for large language models, tailored to the specific requirements of watermarking in LDMs. Our findings emphasize the importance of anticipating and defending against potential vulnerabilities in generative systems.

Title: Data Poisoning Attacks to Local Differential Privacy Protocols for Graphs

Authors: Xi He, Kai Huang, Qingqing Ye, Haibo Hu
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2412.19837
Pdf URL: https://arxiv.org/pdf/2412.19837
Copy Paste: [[2412.19837]] Data Poisoning Attacks to Local Differential Privacy Protocols for Graphs(https://arxiv.org/abs/2412.19837)
Keywords: security, privacy, defense, attack
Abstract: Graph analysis has become increasingly popular with the prevalence of big data and machine learning. Traditional graph data analysis methods often assume the existence of a trusted third party to collect and store the graph data, which does not align with real-world situations. To address this, some research has proposed utilizing Local Differential Privacy (LDP) to collect graph data or graph metrics (e.g., clustering coefficient). This line of research focuses on collecting two atomic graph metrics (the adjacency bit vectors and node degrees) from each node locally under LDP to synthesize an entire graph or generate graph metrics. However, they have not considered the security issues of LDP for graphs. In this paper, we bridge the gap by demonstrating that an attacker can inject fake users into LDP protocols for graphs and design data poisoning attacks to degrade the quality of graph metrics. In particular, we present three data poisoning attacks to LDP protocols for graphs. As a proof of concept, we focus on data poisoning attacks on two classical graph metrics: degree centrality and clustering coefficient. We further design two countermeasures for these data poisoning attacks. Experimental study on real-world datasets demonstrates that our attacks can largely degrade the quality of collected graph metrics, and the proposed countermeasures cannot effectively offset the effect, which calls for the development of new defenses.

Title: Multi-View Fusion Neural Network for Traffic Demand Prediction

Authors: Dongran Zhang, Jun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19839
Pdf URL: https://arxiv.org/pdf/2412.19839
Copy Paste: [[2412.19839]] Multi-View Fusion Neural Network for Traffic Demand Prediction(https://arxiv.org/abs/2412.19839)
Keywords: extraction
Abstract: The extraction of spatial-temporal features is a crucial research in transportation studies, and current studies typically use a unified temporal modeling mechanism and fixed spatial graph for this purpose. However, the fixed spatial graph restricts the extraction of spatial features for similar but not directly connected nodes, while the unified temporal modeling mechanism overlooks the heterogeneity of temporal variation of different nodes. To address these challenges, a multi-view fusion neural network (MVFN) approach is proposed. In this approach, spatial local features are extracted through the use of a graph convolutional network (GCN), and spatial global features are extracted using a cosine re-weighting linear attention mechanism (CLA). The GCN and CLA are combined to create a graph-cosine module (GCM) for the extraction of overall spatial features. Additionally, the multi-channel separable temporal convolutional network (MSTCN) makes use of a multi-channel temporal convolutional network (MTCN) at each layer to extract unified temporal features, and a separable temporal convolutional network (STCN) to extract independent temporal features. Finally, the spatial-temporal feature data is input into the prediction layer to obtain the final result. The model has been validated on two traffic demand datasets and achieved the best prediction accuracy.

Title: ERPA: Efficient RPA Model Integrating OCR and LLMs for Intelligent Document Processing

Authors: Osama Abdellaif, Abdelrahman Nader, Ali Hamdi
Subjects: cs.CV, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2412.19840
Pdf URL: https://arxiv.org/pdf/2412.19840
Copy Paste: [[2412.19840]] ERPA: Efficient RPA Model Integrating OCR and LLMs for Intelligent Document Processing(https://arxiv.org/abs/2412.19840)
Keywords: extraction, large language model
Abstract: This paper presents ERPA, an innovative Robotic Process Automation (RPA) model designed to enhance ID data extraction and optimize Optical Character Recognition (OCR) tasks within immigration workflows. Traditional RPA solutions often face performance limitations when processing large volumes of documents, leading to inefficiencies. ERPA addresses these challenges by incorporating Large Language Models (LLMs) to improve the accuracy and clarity of extracted text, effectively handling ambiguous characters and complex structures. Benchmark comparisons with leading platforms like UiPath and Automation Anywhere demonstrate that ERPA significantly reduces processing times by up to 94 percent, completing ID data extraction in just 9.94 seconds. These findings highlight ERPA's potential to revolutionize document automation, offering a faster and more reliable alternative to current RPA solutions.

Title: Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network

Authors: Dongran Zhang, Jiangnan Yan, Kemal Polat, Adi Alhudhaif, Jun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19842
Pdf URL: https://arxiv.org/pdf/2412.19842
Copy Paste: [[2412.19842]] Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network(https://arxiv.org/abs/2412.19842)
Keywords: extraction
Abstract: Traffic flow prediction plays a crucial role in the management and operation of urban transportation systems. While extensive research has been conducted on predictions for individual transportation modes, there is relatively limited research on joint prediction across different transportation modes. Furthermore, existing multimodal traffic joint modeling methods often lack flexibility in spatial-temporal feature extraction. To address these issues, we propose a method called Graph Sparse Attention Mechanism with Bidirectional Temporal Convolutional Network (GSABT) for multimodal traffic spatial-temporal joint prediction. First, we use a multimodal graph multiplied by self-attention weights to capture spatial local features, and then employ the Top-U sparse attention mechanism to obtain spatial global features. Second, we utilize a bidirectional temporal convolutional network to enhance the temporal feature correlation between the output and input data, and extract inter-modal and intra-modal temporal features through the share-unique module. Finally, we have designed a multimodal joint prediction framework that can be flexibly extended to both spatial and temporal dimensions. Extensive experiments conducted on three real datasets indicate that the proposed model consistently achieves state-of-the-art predictive performance.

Title: A Review of Latent Representation Models in Neuroimaging

Authors: C. Vázquez-García, F. J. Martínez-Murcia, F. Segovia Román, Juan M. Górriz
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19844
Pdf URL: https://arxiv.org/pdf/2412.19844
Copy Paste: [[2412.19844]] A Review of Latent Representation Models in Neuroimaging(https://arxiv.org/abs/2412.19844)
Keywords: diffusion, generative
Abstract: Neuroimaging data, particularly from techniques like MRI or PET, offer rich but complex information about brain structure and activity. To manage this complexity, latent representation models - such as Autoencoders, Generative Adversarial Networks (GANs), and Latent Diffusion Models (LDMs) - are increasingly applied. These models are designed to reduce high-dimensional neuroimaging data to lower-dimensional latent spaces, where key patterns and variations related to brain function can be identified. By modeling these latent spaces, researchers hope to gain insights into the biology and function of the brain, including how its structure changes with age or disease, or how it encodes sensory information, predicts and adapts to new inputs. This review discusses how these models are used for clinical applications, like disease diagnosis and progression monitoring, but also for exploring fundamental brain mechanisms such as active inference and predictive coding. These approaches provide a powerful tool for both understanding and simulating the brain's complex computational tasks, potentially advancing our knowledge of cognition, perception, and neural disorders.

Title: Symbolic Disentangled Representations for Images

Authors: Alexandr Korchemnyi, Alexey K. Kovalev, Aleksandr I. Panov
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19847
Pdf URL: https://arxiv.org/pdf/2412.19847
Copy Paste: [[2412.19847]] Symbolic Disentangled Representations for Images(https://arxiv.org/abs/2412.19847)
Keywords: generative
Abstract: The idea of disentangled representations is to reduce the data to a set of generative factors that produce it. Typically, such representations are vectors in latent space, where each coordinate corresponds to one of the generative factors. The object can then be modified by changing the value of a particular coordinate, but it is necessary to determine which coordinate corresponds to the desired generative factor -- a difficult task if the vector representation has a high dimension. In this article, we propose ArSyD (Architecture for Symbolic Disentanglement), which represents each generative factor as a vector of the same dimension as the resulting representation. In ArSyD, the object representation is obtained as a superposition of the generative factor vector representations. We call such a representation a \textit{symbolic disentangled representation}. We use the principles of Hyperdimensional Computing (also known as Vector Symbolic Architectures), where symbols are represented as hypervectors, allowing vector operations on them. Disentanglement is achieved by construction, no additional assumptions about the underlying distributions are made during training, and the model is only trained to reconstruct images in a weakly supervised manner. We study ArSyD on the dSprites and CLEVR datasets and provide a comprehensive analysis of the learned symbolic disentangled representations. We also propose new disentanglement metrics that allow comparison of methods using latent representations of different dimensions. ArSyD allows to edit the object properties in a controlled and interpretable way, and the dimensionality of the object property representation coincides with the dimensionality of the object representation itself.

Title: Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Authors: Dapeng Zhao, Yue Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19848
Pdf URL: https://arxiv.org/pdf/2412.19848
Copy Paste: [[2412.19848]] Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction(https://arxiv.org/abs/2412.19848)
Keywords: robust, generative
Abstract: Single-view 3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the input is unobstructed faces which makes their method not suitable for in-the-wild conditions. We present a method for performing a 3D face that removes eyeglasses from a single image. Existing facial reconstruction methods fail to remove eyeglasses automatically for generating a photo-realistic 3D face "in-the-wild".The innovation of our method lies in a process for identifying the eyeglasses area robustly and remove it intelligently. In this work, we estimate the 2D face structure of the reasonable position of the eyeglasses area, which is used for the construction of 3D texture. An excellent anti-eyeglasses face reconstruction method should ensure the authenticity of the output, including the topological structure between the eyes, nose, and mouth. We achieve this via a deep learning architecture that performs direct regression of a 3DMM representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related face parsing task can be incorporated into the proposed framework and help improve reconstruction quality. We conduct extensive experiments on existing 3D face reconstruction tasks as concrete examples to demonstrate the method's superior regulation ability over existing methods often break down.

Title: Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation

Authors: Nadav Z. Cohen, Oron Nir, Ariel Shamir
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19853
Pdf URL: https://arxiv.org/pdf/2412.19853
Copy Paste: [[2412.19853]] Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation(https://arxiv.org/abs/2412.19853)
Keywords: diffusion
Abstract: Balancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both. This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.

Title: Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales

Authors: Shuokai Pan, Gerti Tuzi, Sudarshan Sreeram, Dibakar Gope
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19867
Pdf URL: https://arxiv.org/pdf/2412.19867
Copy Paste: [[2412.19867]] Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales(https://arxiv.org/abs/2412.19867)
Keywords: diffusion, data-free
Abstract: Despite the revolutionary breakthroughs of large-scale textto-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training data, the generalization performance of quantized diffusion models is safely guaranteed. For text-to-image generation task, the 8-bit fully-quantized diffusion model with Winograd provides near-lossless quality (FID and CLIP scores) in comparison to the full-precision model. For image classification, our method outperforms the state-of-the-art Winograd PTQ method by 1.62% and 2.56% in top-1 ImageNet accuracy on ResNet18 and ResNet-34, respectively, with Winograd F(6, 3).

Title: Neighbor Does Matter: Density-Aware Contrastive Learning for Medical Semi-supervised Segmentation

Authors: Feilong Tang, Zhongxing Xu, Ming Hu, Wenxue Li, Peng Xia, Yiheng Zhong, Hanjun Wu, Jionglong Su, Zongyuan Ge
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19871
Pdf URL: https://arxiv.org/pdf/2412.19871
Copy Paste: [[2412.19871]] Neighbor Does Matter: Density-Aware Contrastive Learning for Medical Semi-supervised Segmentation(https://arxiv.org/abs/2412.19871)
Keywords: segmentation
Abstract: In medical image analysis, multi-organ semi-supervised segmentation faces challenges such as insufficient labels and low contrast in soft tissues. To address these issues, existing studies typically employ semi-supervised segmentation techniques using pseudo-labeling and consistency regularization. However, these methods mainly rely on individual data samples for training, ignoring the rich neighborhood information present in the feature space. In this work, we argue that supervisory information can be directly extracted from the geometry of the feature space. Inspired by the density-based clustering hypothesis, we propose using feature density to locate sparse regions within feature clusters. Our goal is to increase intra-class compactness by addressing sparsity issues. To achieve this, we propose a Density-Aware Contrastive Learning (DACL) strategy, pushing anchored features in sparse regions towards cluster centers approximated by high-density positive samples, resulting in more compact clusters. Specifically, our method constructs density-aware neighbor graphs using labeled and unlabeled data samples to estimate feature density and locate sparse regions. We also combine label-guided co-training with density-guided geometric regularization to form complementary supervision for unlabeled data. Experiments on the Multi-Organ Segmentation Challenge dataset demonstrate that our proposed method outperforms state-of-the-art methods, highlighting its efficacy in medical image segmentation tasks.

Title: Minimax-Optimal Multi-Agent Robust Reinforcement Learning

Authors: Yuchen Jiao, Gen Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.19873
Pdf URL: https://arxiv.org/pdf/2412.19873
Copy Paste: [[2412.19873]] Minimax-Optimal Multi-Agent Robust Reinforcement Learning(https://arxiv.org/abs/2412.19873)
Keywords: robust, generative
Abstract: Multi-agent robust reinforcement learning, also known as multi-player robust Markov games (RMGs), is a crucial framework for modeling competitive interactions under environmental uncertainties, with wide applications in multi-agent systems. However, existing results on sample complexity in RMGs suffer from at least one of three obstacles: restrictive range of uncertainty level or accuracy, the curse of multiple agents, and the barrier of long horizons, all of which cause existing results to significantly exceed the information-theoretic lower bound. To close this gap, we extend the Q-FTRL algorithm \citep{li2022minimax} to the RMGs in finite-horizon setting, assuming access to a generative model. We prove that the proposed algorithm achieves an $\varepsilon$-robust coarse correlated equilibrium (CCE) with a sample complexity (up to log factors) of $\widetilde{O}\left(H^3S\sum_{i=1}^mA_i\min\left\{H,1/R\right\}/\varepsilon^2\right)$, where $S$ denotes the number of states, $A_i$ is the number of actions of the $i$-th agent, $H$ is the finite horizon length, and $R$ is uncertainty level. We also show that this sample compelxity is minimax optimal by combining an information-theoretic lower bound. Additionally, in the special case of two-player zero-sum RMGs, the algorithm achieves an $\varepsilon$-robust Nash equilibrium (NE) with the same sample complexity.

Title: YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO

Authors: Taoran Yue, Xiaojin Lu, Jiaxi Cai, Yuanping Chen, Shibing Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19878
Pdf URL: https://arxiv.org/pdf/2412.19878
Copy Paste: [[2412.19878]] YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO(https://arxiv.org/abs/2412.19878)
Keywords: robust
Abstract: With the advancement of aerospace technology and the increasing demands of military applications, the development of low false-alarm and high-precision infrared small target detection algorithms has emerged as a key focus of research globally. However, the traditional model-driven method is not robust enough when dealing with features such as noise, target size, and contrast. The existing deep-learning methods have limited ability to extract and fuse key features, and it is difficult to achieve high-precision detection in complex backgrounds and when target features are not obvious. To solve these problems, this paper proposes a deep-learning infrared small target detection method that combines image super-resolution technology with multi-scale observation. First, the input infrared images are preprocessed with super-resolution and multiple data enhancements are performed. Secondly, based on the YOLOv5 model, we proposed a new deep-learning network named YOLO-MST. This network includes replacing the SPPF module with the self-designed MSFA module in the backbone, optimizing the neck, and finally adding a multi-scale dynamic detection head to the prediction head. By dynamically fusing features from different scales, the detection head can better adapt to complex scenes. The mAP@0.5 detection rates of this method on two public datasets, SIRST and IRIS, reached 96.4% and 99.5% respectively, more effectively solving the problems of missed detection, false alarms, and low precision.

Title: Leveraging Scene Geometry and Depth Information for Robust Image Deraining

Authors: Ningning Xu, Jidong J. Yang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.19913
Pdf URL: https://arxiv.org/pdf/2412.19913
Copy Paste: [[2412.19913]] Leveraging Scene Geometry and Depth Information for Robust Image Deraining(https://arxiv.org/abs/2412.19913)
Keywords: robust
Abstract: Image deraining holds great potential for enhancing the vision of autonomous vehicles in rainy conditions, contributing to safer driving. Previous works have primarily focused on employing a single network architecture to generate derained images. However, they often fail to fully exploit the rich prior knowledge embedded in the scenes. Particularly, most methods overlook the depth information that can provide valuable context about scene geometry and guide more robust deraining. In this work, we introduce a novel learning framework that integrates multiple networks: an AutoEncoder for deraining, an auxiliary network to incorporate depth information, and two supervision networks to enforce feature consistency between rainy and clear scenes. This multi-network design enables our model to effectively capture the underlying scene structure, producing clearer and more accurately derained images, leading to improved object detection for autonomous vehicles. Extensive experiments on three widely-used datasets demonstrated the effectiveness of our proposed method.

Title: Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts

Authors: Enze Xie, Jiaho Lyu, Daiqing Wu, Huawen Shen, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19917
Pdf URL: https://arxiv.org/pdf/2412.19917
Copy Paste: [[2412.19917]] Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts(https://arxiv.org/abs/2412.19917)
Keywords: segmentation
Abstract: The recent emergence of the Segment Anything Model (SAM) enables various domain-specific segmentation tasks to be tackled cost-effectively by using bounding boxes as prompts. However, in scene text segmentation, SAM can not achieve desirable performance. The word-level bounding box as prompts is too coarse for characters, while the character-level bounding box as prompts suffers from over-segmentation and under-segmentation issues. In this paper, we propose an automatic annotation pipeline named Char-SAM, that turns SAM into a low-cost segmentation annotator with a Character-level visual prompt. Specifically, leveraging some existing text detection datasets with word-level bounding box annotations, we first generate finer-grained character-level bounding box prompts using the Character Bounding-box Refinement CBR module. Next, we employ glyph information corresponding to text character categories as a new prompt in the Character Glyph Refinement (CGR) module to guide SAM in producing more accurate segmentation masks, addressing issues of over-segmentation and under-segmentation. These modules fully utilize the bbox-to-mask capability of SAM to generate high-quality text segmentation annotations automatically. Extensive experiments on TextSeg validate the effectiveness of Char-SAM. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.

Title: Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models

Authors: Mateusz Michalkiewicz, Sheena Bai, Mahsa Baktashmotlagh, Varun Jampani, Guha Balakrishnan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19920
Pdf URL: https://arxiv.org/pdf/2412.19920
Copy Paste: [[2412.19920]] Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models(https://arxiv.org/abs/2412.19920)
Keywords: robust
Abstract: In this paper, we analyze the viewpoint stability of foundational models - specifically, their sensitivity to changes in viewpoint- and define instability as significant feature variations resulting from minor changes in viewing angle, leading to generalization gaps in 3D reasoning tasks. We investigate nine foundational models, focusing on their responses to viewpoint changes, including the often-overlooked accidental viewpoints where specific camera orientations obscure an object's true 3D structure. Our methodology enables recognizing and classifying out-of-distribution (OOD), accidental, and stable viewpoints using feature representations alone, without accessing the actual images. Our findings indicate that while foundation models consistently encode accidental viewpoints, they vary in their interpretation of OOD viewpoints due to inherent biases, at times leading to object misclassifications based on geometric resemblance. Through quantitative and qualitative evaluations on three downstream tasks - classification, VQA, and 3D reconstruction - we illustrate the impact of viewpoint instability and underscore the importance of feature robustness across diverse viewing conditions.

Title: HADES: Hardware Accelerated Decoding for Efficient Speculation in Large Language Models

Authors: Ze Yang, Yihong Jin, Xinhe Xu
Subjects: cs.CL, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2412.19925
Pdf URL: https://arxiv.org/pdf/2412.19925
Copy Paste: [[2412.19925]] HADES: Hardware Accelerated Decoding for Efficient Speculation in Large Language Models(https://arxiv.org/abs/2412.19925)
Keywords: large language model
Abstract: Large Language Models (LLMs) have revolutionized natural language processing by understanding and generating human-like text. However, the increasing demand for more sophisticated LLMs presents significant computational challenges due to their scale and complexity. This paper introduces Hardware Accelerated Decoding (HADES), a novel approach to enhance the performance and energy efficiency of LLMs. We address the design of an LLM accelerator with hardware-level speculative decoding support, a concept not previously explored in existing literature. Our work demonstrates how speculative decoding can significantly improve the efficiency of LLM operations, paving the way for more advanced and practical applications of these models.

Title: Assessing Text Classification Methods for Cyberbullying Detection on Social Media Platforms

Authors: Adamu Gaston Philipo, Doreen Sebastian Sarwatt, Jianguo Ding, Mahmoud Daneshmand, Huansheng Ning
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2412.19928
Pdf URL: https://arxiv.org/pdf/2412.19928
Copy Paste: [[2412.19928]] Assessing Text Classification Methods for Cyberbullying Detection on Social Media Platforms(https://arxiv.org/abs/2412.19928)
Keywords: generative, large language model
Abstract: Cyberbullying significantly contributes to mental health issues in communities by negatively impacting the psychology of victims. It is a prevalent problem on social media platforms, necessitating effective, real-time detection and monitoring systems to identify harmful messages. However, current cyberbullying detection systems face challenges related to performance, dataset quality, time efficiency, and computational costs. This research aims to conduct a comparative study by adapting and evaluating existing text classification techniques within the cyberbullying detection domain. The study specifically evaluates the effectiveness and performance of these techniques in identifying cyberbullying instances on social media platforms. It focuses on leveraging and assessing large language models, including BERT, RoBERTa, XLNet, DistilBERT, and GPT-2.0, for their suitability in this domain. The results show that BERT strikes a balance between performance, time efficiency, and computational resources: Accuracy of 95%, Precision of 95%, Recall of 95%, F1 Score of 95%, Error Rate of 5%, Inference Time of 0.053 seconds, RAM Usage of 35.28 MB, CPU/GPU Usage of 0.4%, and Energy Consumption of 0.000263 kWh. The findings demonstrate that generative AI models, while powerful, do not consistently outperform fine-tuned models on the tested benchmarks. However, state-of-the-art performance can still be achieved through strategic adaptation and fine-tuning of existing models for specific datasets and tasks.

Title: Outfox: a Packet Format for a Layered Mixnet

Authors: Alfredo Rial, Ania M. Piotrowska
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.19937
Pdf URL: https://arxiv.org/pdf/2412.19937
Copy Paste: [[2412.19937]] Outfox: a Packet Format for a Layered Mixnet(https://arxiv.org/abs/2412.19937)
Keywords: security
Abstract: We propose Outfox, a packet format based on layered encryption that is suitable for mixnets in which all paths have the same length and where all mix nodes are associated with a single layer. Outfox is a variant of the packet format Sphinx that removes unnecessary padding and optimizes the computation cost of packet processing by halving the number of public key operations performed by mix nodes. Outfox uses a KEM scheme as a building block and is quantum-safe when instantiated with a quantum-safe KEM scheme. To analyze the security of Outfox, we describe an ideal functionality for a layered replyable mixnet that requires reply-request indistinguishability, and a construction based on Outfox that realizes our ideal functionality.

Title: Standard-Deviation-Inspired Regularization for Improving Adversarial Robustness

Authors: Olukorede Fakorede, Modeste Atsague, Jin Tian
Subjects: cs.LG, cs.AI, cs.CR, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2412.19947
Pdf URL: https://arxiv.org/pdf/2412.19947
Copy Paste: [[2412.19947]] Standard-Deviation-Inspired Regularization for Improving Adversarial Robustness(https://arxiv.org/abs/2412.19947)
Keywords: attack, robust
Abstract: Adversarial Training (AT) has been demonstrated to improve the robustness of deep neural networks (DNNs) against adversarial attacks. AT is a min-max optimization procedure where in adversarial examples are generated to train a more robust DNN. The inner maximization step of AT increases the losses of inputs with respect to their actual classes. The outer minimization involves minimizing the losses on the adversarial examples obtained from the inner maximization. This work proposes a standard-deviation-inspired (SDI) regularization term to improve adversarial robustness and generalization. We argue that the inner maximization in AT is similar to minimizing a modified standard deviation of the model's output probabilities. Moreover, we suggest that maximizing this modified standard deviation can complement the outer minimization of the AT framework. To support our argument, we experimentally show that the SDI measure can be used to craft adversarial examples. Additionally, we demonstrate that combining the SDI regularization term with existing AT variants enhances the robustness of DNNs against stronger attacks, such as CW and Auto-attack, and improves generalization.

Title: ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers

Authors: Chao Fan, Qipei Mei, Xiaonan Wang, Xinming Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19954
Pdf URL: https://arxiv.org/pdf/2412.19954
Copy Paste: [[2412.19954]] ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers(https://arxiv.org/abs/2412.19954)
Keywords: generative
Abstract: In the construction sector, workers often endure prolonged periods of high-intensity physical work and prolonged use of tools, resulting in injuries and illnesses primarily linked to postural ergonomic risks, a longstanding predominant health concern. To mitigate these risks, researchers have applied various technological methods to identify the ergonomic risks that construction workers face. However, traditional ergonomic risk assessment (ERA) techniques do not offer interactive feedback. The rapidly developing vision-language models (VLMs), capable of generating textual descriptions or answering questions about ergonomic risks based on image inputs, have not yet received widespread attention. This research introduces an interactive visual query system tailored to assess the postural ergonomic risks of construction workers. The system's capabilities include visual question answering (VQA), which responds to visual queries regarding workers' exposure to postural ergonomic risks, and image captioning (IC), which generates textual descriptions of these risks from images. Additionally, this study proposes a dataset designed for training and testing such methodologies. Systematic testing indicates that the VQA functionality delivers an accuracy of 96.5%. Moreover, evaluations using nine metrics for IC and assessments from human experts indicate that the proposed approach surpasses the performance of a method using the same architecture trained solely on generic datasets. This study sets a new direction for future developments in interactive ERA using generative artificial intelligence (AI) technologies.

Title: DepthMamba with Adaptive Fusion

Authors: Zelin Meng, Zhichen Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19964
Pdf URL: https://arxiv.org/pdf/2412.19964
Copy Paste: [[2412.19964]] DepthMamba with Adaptive Fusion(https://arxiv.org/abs/2412.19964)
Keywords: robust, extraction
Abstract: Multi-view depth estimation has achieved impressive performance over various benchmarks. However, almost all current multi-view systems rely on given ideal camera poses, which are unavailable in many real-world scenarios, such as autonomous driving. In this work, we propose a new robustness benchmark to evaluate the depth estimation system under various noisy pose settings. Surprisingly, we find current multi-view depth estimation methods or single-view and multi-view fusion methods will fail when given noisy pose settings. To tackle this challenge, we propose a two-branch network architecture which fuses the depth estimation results of single-view and multi-view branch. In specific, we introduced mamba to serve as feature extraction backbone and propose an attention-based fusion methods which adaptively select the most robust estimation results between the two branches. Thus, the proposed method can perform well on some challenging scenes including dynamic objects, texture-less regions, etc. Ablation studies prove the effectiveness of the backbone and fusion method, while evaluation experiments on challenging benchmarks (KITTI and DDAD) show that the proposed method achieves a competitive performance compared to the state-of-the-art methods.

Title: Bridging Context Gaps: Enhancing Comprehension in Long-Form Social Conversations Through Contextualized Excerpts

Authors: Shrestha Mohanty, Sarah Xuan, Jacob Jobraeel, Anurag Kumar, Deb Roy, Jad Kabbara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19966
Pdf URL: https://arxiv.org/pdf/2412.19966
Copy Paste: [[2412.19966]] Bridging Context Gaps: Enhancing Comprehension in Long-Form Social Conversations Through Contextualized Excerpts(https://arxiv.org/abs/2412.19966)
Keywords: large language model
Abstract: We focus on enhancing comprehension in small-group recorded conversations, which serve as a medium to bring people together and provide a space for sharing personal stories and experiences on crucial social matters. One way to parse and convey information from these conversations is by sharing highlighted excerpts in subsequent conversations. This can help promote a collective understanding of relevant issues, by highlighting perspectives and experiences to other groups of people who might otherwise be unfamiliar with and thus unable to relate to these experiences. The primary challenge that arises then is that excerpts taken from one conversation and shared in another setting might be missing crucial context or key elements that were previously introduced in the original conversation. This problem is exacerbated when conversations become lengthier and richer in themes and shared experiences. To address this, we explore how Large Language Models (LLMs) can enrich these excerpts by providing socially relevant context. We present approaches for effective contextualization to improve comprehension, readability, and empathy. We show significant improvements in understanding, as assessed through subjective and objective evaluations. While LLMs can offer valuable context, they struggle with capturing key social aspects. We release the Human-annotated Salient Excerpts (HSE) dataset to support future work. Additionally, we show how context-enriched excerpts can provide more focused and comprehensive conversation summaries.

Title: MobileNetV2: A lightweight classification model for home-based sleep apnea screening

Authors: Hui Pan, Yanxuan Yu, Jilun Ye, Xu Zhang
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2412.19967
Pdf URL: https://arxiv.org/pdf/2412.19967
Copy Paste: [[2412.19967]] MobileNetV2: A lightweight classification model for home-based sleep apnea screening(https://arxiv.org/abs/2412.19967)
Keywords: robust
Abstract: This study proposes a novel lightweight neural network model leveraging features extracted from electrocardiogram (ECG) and respiratory signals for early OSA screening. ECG signals are used to generate feature spectrograms to predict sleep stages, while respiratory signals are employed to detect sleep-related breathing abnormalities. By integrating these predictions, the method calculates the apnea-hypopnea index (AHI) with enhanced accuracy, facilitating precise OSA diagnosis. The method was validated on three publicly available sleep apnea databases: the Apnea-ECG database, the UCDDB dataset, and the MIT-BIH Polysomnographic database. Results showed an overall OSA detection accuracy of 0.978, highlighting the model's robustness. Respiratory event classification achieved an accuracy of 0.969 and an area under the receiver operating characteristic curve (ROC-AUC) of 0.98. For sleep stage classification, in UCDDB dataset, the ROC-AUC exceeded 0.85 across all stages, with recall for Sleep reaching 0.906 and specificity for REM and Wake states at 0.956 and 0.937, respectively. This study underscores the potential of integrating lightweight neural networks with multi-signal analysis for accurate, portable, and cost-effective OSA screening, paving the way for broader adoption in home-based and wearable health monitoring systems.

Title: MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

Authors: Haoyu Zheng, Wenqiao Zhang, Zheqi Lv, Yu Zhong, Yang Dai, Jianxiang An, Yongliang Shen, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19978
Pdf URL: https://arxiv.org/pdf/2412.19978
Copy Paste: [[2412.19978]] MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation(https://arxiv.org/abs/2412.19978)
Keywords: diffusion
Abstract: Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging task, specifically in multi-attribute editing (MAE) in video. Contemporary video editing approaches either require extensive fine-tuning or rely on additional networks (such as ControlNet) for modeling multi-object appearances, yet they remain in their infancy, offering only coarse-grained MAE solutions. In this paper, we present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing. Our approach preserves video structure and appearance information by incorporating attention maps and features from the inversion process during denoising. To facilitate precise editing of multiple attributes, we introduce mask-guided attention modulation, enhancing correlations between spatially corresponding tokens and suppressing cross-attribute interference in both self-attention and cross-attention layers. To balance video frame generation quality and efficiency, we implement consistent feature propagation, which generates frame sequences by editing keyframes and propagating their features throughout the sequence. Extensive experiments demonstrate that MAKIMA outperforms existing baselines in open-domain multi-attribute video editing tasks, achieving superior results in both editing accuracy and temporal consistency while maintaining computational efficiency.

Title: Explainable Semantic Federated Learning Enabled Industrial Edge Network for Fire Surveillance

Authors: Li Dong, Yubo Peng, Feibo Jiang, Kezhi Wang, Kun Yang
Subjects: cs.LG, cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2412.19979
Pdf URL: https://arxiv.org/pdf/2412.19979
Copy Paste: [[2412.19979]] Explainable Semantic Federated Learning Enabled Industrial Edge Network for Fire Surveillance(https://arxiv.org/abs/2412.19979)
Keywords: security, privacy, federate, explainability
Abstract: In fire surveillance, Industrial Internet of Things (IIoT) devices require transmitting large monitoring data frequently, which leads to huge consumption of spectrum resources. Hence, we propose an Industrial Edge Semantic Network (IESN) to allow IIoT devices to send warnings through Semantic communication (SC). Thus, we should consider (1) Data privacy and security. (2) SC model adaptation for heterogeneous devices. (3) Explainability of semantics. Therefore, first, we present an eXplainable Semantic Federated Learning (XSFL) to train the SC model, thus ensuring data privacy and security. Then, we present an Adaptive Client Training (ACT) strategy to provide a specific SC model for each device according to its Fisher information matrix, thus overcoming the heterogeneity. Next, an Explainable SC (ESC) mechanism is designed, which introduces a leakyReLU-based activation mapping to explain the relationship between the extracted semantics and monitoring data. Finally, simulation results demonstrate the effectiveness of XSFL.

Title: The Fifth International Verification of Neural Networks Competition (VNN-COMP 2024): Summary and Results

Authors: Christopher Brix, Stanley Bak, Taylor T. Johnson, Haoze Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19985
Pdf URL: https://arxiv.org/pdf/2412.19985
Copy Paste: [[2412.19985]] The Fifth International Verification of Neural Networks Competition (VNN-COMP 2024): Summary and Results(https://arxiv.org/abs/2412.19985)
Keywords: fair
Abstract: This report summarizes the 5th International Verification of Neural Networks Competition (VNN-COMP 2024), held as a part of the 7th International Symposium on AI Verification (SAIV), that was collocated with the 36th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2024 iteration, 8 teams participated on a diverse set of 12 regular and 8 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.

Title: Delayed Random Partial Gradient Averaging for Federated Learning

Authors: Xinyi Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19987
Pdf URL: https://arxiv.org/pdf/2412.19987
Copy Paste: [[2412.19987]] Delayed Random Partial Gradient Averaging for Federated Learning(https://arxiv.org/abs/2412.19987)
Keywords: privacy, federate
Abstract: Federated learning (FL) is a distributed machine learning paradigm that enables multiple clients to train a shared model collaboratively while preserving privacy. However, the scaling of real-world FL systems is often limited by two communication bottlenecks:(a) while the increasing computing power of edge devices enables the deployment of large-scale Deep Neural Networks (DNNs), the limited bandwidth constraints frequent transmissions over large DNNs; and (b) high latency cost greatly degrades the performance of FL. In light of these bottlenecks, we propose a Delayed Random Partial Gradient Averaging (DPGA) to enhance FL. Under DPGA, clients only share partial local model gradients with the server. The size of the shared part in a local model is determined by the update rate, which is coarsely initialized and subsequently refined over the temporal dimension. Moreover, DPGA largely reduces the system run time by enabling computation in parallel with communication. We conduct experiments on non-IID CIFAR-10/100 to demonstrate the efficacy of our method.

Title: Caesar: A Low-deviation Compression Approach for Efficient Federated Learning

Authors: Jiaming Yan, Jianchun Liu, Hongli Xu, Liusheng Huang, Jiantao Gong, Xudong Liu, Kun Hou
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2412.19989
Pdf URL: https://arxiv.org/pdf/2412.19989
Copy Paste: [[2412.19989]] Caesar: A Low-deviation Compression Approach for Efficient Federated Learning(https://arxiv.org/abs/2412.19989)
Keywords: federate
Abstract: Compression is an efficient way to relieve the tremendous communication overhead of federated learning (FL) systems. However, for the existing works, the information loss under compression will lead to unexpected model/gradient deviation for the FL training, significantly degrading the training performance, especially under the challenges of data heterogeneity and model obsolescence. To strike a delicate trade-off between model accuracy and traffic cost, we propose Caesar, a novel FL framework with a low-deviation compression approach. For the global model download, we design a greedy method to optimize the compression ratio for each device based on the staleness of the local model, ensuring a precise initial model for local training. Regarding the local gradient upload, we utilize the device's local data properties (\ie, sample volume and label distribution) to quantify its local gradient's importance, which then guides the determination of the gradient compression ratio. Besides, with the fine-grained batch size optimization, Caesar can significantly diminish the devices' idle waiting time under the synchronized barrier. We have implemented Caesar on two physical platforms with 40 smartphones and 80 NVIDIA Jetson devices. Extensive results show that Caesar can reduce the traffic costs by about 25.54%$\thicksim$37.88% compared to the compression-based baselines with the same target accuracy, while incurring only a 0.68% degradation in final test accuracy relative to the full-precision communication.

Title: A Robust Federated Learning Framework for Undependable Devices at Scale

Authors: Shilong Wang, Jianchun Liu, Hongli Xu, Chunming Qiao, Huarong Deng, Qiuye Zheng, Jiantao Gong
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2412.19991
Pdf URL: https://arxiv.org/pdf/2412.19991
Copy Paste: [[2412.19991]] A Robust Federated Learning Framework for Undependable Devices at Scale(https://arxiv.org/abs/2412.19991)
Keywords: robust, federate
Abstract: In a federated learning (FL) system, many devices, such as smartphones, are often undependable (e.g., frequently disconnected from WiFi) during training. Existing FL frameworks always assume a dependable environment and exclude undependable devices from training, leading to poor model performance and resource wastage. In this paper, we propose FLUDE to effectively deal with undependable environments. First, FLUDE assesses the dependability of devices based on the probability distribution of their historical behaviors (e.g., the likelihood of successfully completing training). Based on this assessment, FLUDE adaptively selects devices with high dependability for training. To mitigate resource wastage during the training phase, FLUDE maintains a model cache on each device, aiming to preserve the latest training state for later use in case local training on an undependable device is interrupted. Moreover, FLUDE proposes a staleness-aware strategy to judiciously distribute the global model to a subset of devices, thus significantly reducing resource wastage while maintaining model performance. We have implemented FLUDE on two physical platforms with 120 smartphones and NVIDIA Jetson devices. Extensive experimental results demonstrate that FLUDE can effectively improve model performance and resource efficiency of FL training in undependable environments.

Title: An Ordinary Differential Equation Sampler with Stochastic Start for Diffusion Bridge Models

Authors: Yuang Wang, Pengfei Jin, Li Zhang, Quanzheng Li, Zhiqiang Chen, Dufan Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19992
Pdf URL: https://arxiv.org/pdf/2412.19992
Copy Paste: [[2412.19992]] An Ordinary Differential Equation Sampler with Stochastic Start for Diffusion Bridge Models(https://arxiv.org/abs/2412.19992)
Keywords: diffusion, generative
Abstract: Diffusion bridge models have demonstrated promising performance in conditional image generation tasks, such as image restoration and translation, by initializing the generative process from corrupted images instead of pure Gaussian noise. However, existing diffusion bridge models often rely on Stochastic Differential Equation (SDE) samplers, which result in slower inference speed compared to diffusion models that employ high-order Ordinary Differential Equation (ODE) solvers for acceleration. To mitigate this gap, we propose a high-order ODE sampler with a stochastic start for diffusion bridge models. To overcome the singular behavior of the probability flow ODE (PF-ODE) at the beginning of the reverse process, a posterior sampling approach was introduced at the first reverse step. The sampling was designed to ensure a smooth transition from corrupted images to the generative trajectory while reducing discretization errors. Following this stochastic start, Heun's second-order solver is applied to solve the PF-ODE, achieving high perceptual quality with significantly reduced neural function evaluations (NFEs). Our method is fully compatible with pretrained diffusion bridge models and requires no additional training. Extensive experiments on image restoration and translation tasks, including super-resolution, JPEG restoration, Edges-to-Handbags, and DIODE-Outdoor, demonstrated that our sampler outperforms state-of-the-art methods in both visual quality and Frechet Inception Distance (FID).

Title: Discrete Curvature Graph Information Bottleneck

Authors: Xingcheng Fu, Jian Wang, Yisen Gao, Qingyun Sun, Haonan Yuan, Jianxin Li, Xianxian Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.19993
Pdf URL: https://arxiv.org/pdf/2412.19993
Copy Paste: [[2412.19993]] Discrete Curvature Graph Information Bottleneck(https://arxiv.org/abs/2412.19993)
Keywords: interpretability
Abstract: Graph neural networks(GNNs) have been demonstrated to depend on whether the node effective information is sufficiently passing. Discrete curvature (Ricci curvature) is used to study graph connectivity and information propagation efficiency with a geometric perspective, and has been raised in recent years to explore the efficient message-passing structure of GNNs. However, most empirical studies are based on directly observed graph structures or heuristic topological assumptions and lack in-depth exploration of underlying optimal information transport structures for downstream tasks. We suggest that graph curvature optimization is more in-depth and essential than directly rewiring or learning for graph structure with richer message-passing characterization and better information transport interpretability. From both graph geometry and information theory perspectives, we propose the novel Discrete Curvature Graph Information Bottleneck (CurvGIB) framework to optimize the information transport structure and learn better node representations simultaneously. CurvGIB advances the Variational Information Bottleneck (VIB) principle for Ricci curvature optimization to learn the optimal information transport pattern for specific downstream tasks. The learned Ricci curvature is used to refine the optimal transport structure of the graph, and the node representation is fully and efficiently learned. Moreover, for the computational complexity of Ricci curvature differentiation, we combine Ricci flow and VIB to deduce a curvature optimization approximation to form a tractable IB objective function. Extensive experiments on various datasets demonstrate the superior effectiveness and interpretability of CurvGIB.

Title: Comprehensive Review of EEG-to-Output Research: Decoding Neural Signals into Images, Videos, and Audio

Authors: Yashvir Sabharwal, Balaji Rama
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2412.19999
Pdf URL: https://arxiv.org/pdf/2412.19999
Copy Paste: [[2412.19999]] Comprehensive Review of EEG-to-Output Research: Decoding Neural Signals into Images, Videos, and Audio(https://arxiv.org/abs/2412.19999)
Keywords: transformer, generative
Abstract: Electroencephalography (EEG) is an invaluable tool in neuroscience, offering insights into brain activity with high temporal resolution. Recent advancements in machine learning and generative modeling have catalyzed the application of EEG in reconstructing perceptual experiences, including images, videos, and audio. This paper systematically reviews EEG-to-output research, focusing on state-of-the-art generative methods, evaluation metrics, and data challenges. Using PRISMA guidelines, we analyze 1800 studies and identify key trends, challenges, and opportunities in the field. The findings emphasize the potential of advanced models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformers, while highlighting the pressing need for standardized datasets and cross-subject generalization. A roadmap for future research is proposed that aims to improve decoding accuracy and broadening real-world applications.

Title: Learning Adaptive and View-Invariant Vision Transformer with Multi-Teacher Knowledge Distillation for Real-Time UAV Tracking

Authors: You Wu, Yongxin Li, Mengyuan Liu, Xucheng Wang, Xiangyang Yang, Hengzhou Ye, Dan Zeng, Qijun Zhao, Shuiwang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20002
Pdf URL: https://arxiv.org/pdf/2412.20002
Copy Paste: [[2412.20002]] Learning Adaptive and View-Invariant Vision Transformer with Multi-Teacher Knowledge Distillation for Real-Time UAV Tracking(https://arxiv.org/abs/2412.20002)
Keywords: transformer
Abstract: Visual tracking has made significant strides due to the adoption of transformer-based models. Most state-of-the-art trackers struggle to meet real-time processing demands on mobile platforms with constrained computing resources, particularly for real-time unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we introduce AVTrack, an adaptive computation framework designed to selectively activate transformer blocks for real-time UAV tracking. The proposed Activation Module (AM) dynamically optimizes the ViT architecture by selectively engaging relevant components, thereby enhancing inference efficiency without significant compromise to tracking performance. Furthermore, to tackle the challenges posed by extreme changes in viewing angles often encountered in UAV tracking, the proposed method enhances ViTs' effectiveness by learning view-invariant representations through mutual information (MI) maximization. Two effective design principles are proposed in the AVTrack. Building on it, we propose an improved tracker, dubbed AVTrack-MD, which introduces the novel MI maximization-based multi-teacher knowledge distillation (MD) framework. It harnesses the benefits of multiple teachers, specifically the off-the-shelf tracking models from the AVTrack, by integrating and refining their outputs, thereby guiding the learning process of the compact student network. Specifically, we maximize the MI between the softened feature representations from the multi-teacher models and the student model, leading to improved generalization and performance of the student model, particularly in noisy conditions. Extensive experiments on multiple UAV tracking benchmarks demonstrate that AVTrack-MD not only achieves performance comparable to the AVTrack baseline but also reduces model complexity, resulting in a significant 17\% increase in average tracking speed.

Title: OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

Authors: Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen
Subjects: cs.CL, cs.AI, cs.DB, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20005
Pdf URL: https://arxiv.org/pdf/2412.20005
Copy Paste: [[2412.20005]] OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System(https://arxiv.org/abs/2412.20005)
Keywords: extraction
Abstract: We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE's efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open-sourced the Code at this https URL and released a Video at this http URL.

Title: Adversarial Robustness for Deep Learning-based Wildfire Detection Models

Authors: Ryo Ide, Lei Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20006
Pdf URL: https://arxiv.org/pdf/2412.20006
Copy Paste: [[2412.20006]] Adversarial Robustness for Deep Learning-based Wildfire Detection Models(https://arxiv.org/abs/2412.20006)
Keywords: attack, robust, transformer
Abstract: Smoke detection using Deep Neural Networks (DNNs) is an effective approach for early wildfire detection. However, because smoke is temporally and spatially anomalous, there are limitations in collecting sufficient training data. This raises overfitting and bias concerns in existing DNN-based wildfire detection models. Thus, we introduce WARP (Wildfire Adversarial Robustness Procedure), the first model-agnostic framework for evaluating the adversarial robustness of DNN-based wildfire detection models. WARP addresses limitations in smoke image diversity using global and local adversarial attack methods. The global attack method uses image-contextualized Gaussian noise, while the local attack method uses patch noise injection, tailored to address critical aspects of wildfire detection. Leveraging WARP's model-agnostic capabilities, we assess the adversarial robustness of real-time Convolutional Neural Networks (CNNs) and Transformers. The analysis revealed valuable insights into the models' limitations. Specifically, the global attack method demonstrates that the Transformer model has more than 70\% precision degradation than the CNN against global noise. In contrast, the local attack method shows that both models are susceptible to cloud image injections when detecting smoke-positive instances, suggesting a need for model improvements through data augmentation. WARP's comprehensive robustness analysis contributed to the development of wildfire-specific data augmentation strategies, marking a step toward practicality.

Title: Calibre: Towards Fair and Accurate Personalized Federated Learning with Self-Supervised Learning

Authors: Sijia Chen, Ningxin Su, Baochun Li
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2412.20020
Pdf URL: https://arxiv.org/pdf/2412.20020
Copy Paste: [[2412.20020]] Calibre: Towards Fair and Accurate Personalized Federated Learning with Self-Supervised Learning(https://arxiv.org/abs/2412.20020)
Keywords: federate, fair
Abstract: In the context of personalized federated learning, existing approaches train a global model to extract transferable representations, based on which any client could train personalized models with a limited number of data samples. Self-supervised learning is considered a promising direction as the global model it produces is generic and facilitates personalization for all clients fairly. However, when data is heterogeneous across clients, the global model trained using SSL is unable to learn high-quality personalized models. In this paper, we show that when the global model is trained with SSL without modifications, its produced representations have fuzzy class boundaries. As a result, personalized learning within each client produces models with low accuracy. In order to improve SSL towards better accuracy without sacrificing its advantage in fairness, we propose Calibre, a new personalized federated learning framework designed to calibrate SSL representations by maintaining a suitable balance between more generic and more client-specific representations. Calibre is designed based on theoretically-sound properties, and introduces (1) a client-specific prototype loss as an auxiliary training objective; and (2) an aggregation algorithm guided by such prototypes across clients. Our experimental results in an extensive array of non-i.i.d.~settings show that Calibre achieves state-of-the-art performance in terms of both mean accuracy and fairness across clients. Code repo: this https URL.

Title: A Robust Adversarial Ensemble with Causal (Feature Interaction) Interpretations for Image Classification

Authors: Chunheng Zhao, Pierluigi Pisu, Gurcan Comert, Negash Begashaw, Varghese Vaidyan, Nina Christine Hubig
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20025
Pdf URL: https://arxiv.org/pdf/2412.20025
Copy Paste: [[2412.20025]] A Robust Adversarial Ensemble with Causal (Feature Interaction) Interpretations for Image Classification(https://arxiv.org/abs/2412.20025)
Keywords: attack, robust, extraction, interpretability, generative
Abstract: Deep learning-based discriminative classifiers, despite their remarkable success, remain vulnerable to adversarial examples that can mislead model predictions. While adversarial training can enhance robustness, it fails to address the intrinsic vulnerability stemming from the opaque nature of these black-box models. We present a deep ensemble model that combines discriminative features with generative models to achieve both high accuracy and adversarial robustness. Our approach integrates a bottom-level pre-trained discriminative network for feature extraction with a top-level generative classification network that models adversarial input distributions through a deep latent variable model. Using variational Bayes, our model achieves superior robustness against white-box adversarial attacks without adversarial training. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate our model's superior adversarial robustness. Through evaluations using counterfactual metrics and feature interaction-based metrics, we establish correlations between model interpretability and adversarial robustness. Additionally, preliminary results on Tiny-ImageNet validate our approach's scalability to more complex datasets, offering a practical solution for developing robust image classification models.

Title: STAYKATE: Hybrid In-Context Example Selection Combining Representativeness Sampling and Retrieval-based Approach -- A Case Study on Science Domains

Authors: Chencheng Zhu, Kazutaka Shimada, Tomoki Taniguchi, Tomoko Ohkuma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20043
Pdf URL: https://arxiv.org/pdf/2412.20043
Copy Paste: [[2412.20043]] STAYKATE: Hybrid In-Context Example Selection Combining Representativeness Sampling and Retrieval-based Approach -- A Case Study on Science Domains(https://arxiv.org/abs/2412.20043)
Keywords: extraction, large language model
Abstract: Large language models (LLMs) demonstrate the ability to learn in-context, offering a potential solution for scientific information extraction, which often contends with challenges such as insufficient training data and the high cost of annotation processes. Given that the selection of in-context examples can significantly impact performance, it is crucial to design a proper method to sample the efficient ones. In this paper, we propose STAYKATE, a static-dynamic hybrid selection method that combines the principles of representativeness sampling from active learning with the prevalent retrieval-based approach. The results across three domain-specific datasets indicate that STAYKATE outperforms both the traditional supervised methods and existing selection methods. The enhancement in performance is particularly pronounced for entity types that other methods pose challenges.

Title: Enhancing Diffusion Models for Inverse Problems with Covariance-Aware Posterior Sampling

Authors: Shayan Mohajer Hamidi, En-Hui Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20045
Pdf URL: https://arxiv.org/pdf/2412.20045
Copy Paste: [[2412.20045]] Enhancing Diffusion Models for Inverse Problems with Covariance-Aware Posterior Sampling(https://arxiv.org/abs/2412.20045)
Keywords: diffusion
Abstract: Inverse problems exist in many disciplines of science and engineering. In computer vision, for example, tasks such as inpainting, deblurring, and super resolution can be effectively modeled as inverse problems. Recently, denoising diffusion probabilistic models (DDPMs) are shown to provide a promising solution to noisy linear inverse problems without the need for additional task specific training. Specifically, with the prior provided by DDPMs, one can sample from the posterior by approximating the likelihood. In the literature, approximations of the likelihood are often based on the mean of conditional densities of the reverse process, which can be obtained using Tweedie formula. To obtain a better approximation to the likelihood, in this paper we first derive a closed form formula for the covariance of the reverse process. Then, we propose a method based on finite difference method to approximate this covariance such that it can be readily obtained from the existing pretrained DDPMs, thereby not increasing the complexity compared to existing approaches. Finally, based on the mean and approximated covariance of the reverse process, we present a new approximation to the likelihood. We refer to this method as covariance-aware diffusion posterior sampling (CA-DPS). Experimental results show that CA-DPS significantly improves reconstruction performance without requiring hyperparameter tuning. The code for the paper is put in the supplementary materials.

Title: GSplatLoc: Ultra-Precise Camera Localization via 3D Gaussian Splatting

Authors: Atticus J. Zeller (Southeast University Chengxian College, Nanjing, China)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20056
Pdf URL: https://arxiv.org/pdf/2412.20056
Copy Paste: [[2412.20056]] GSplatLoc: Ultra-Precise Camera Localization via 3D Gaussian Splatting(https://arxiv.org/abs/2412.20056)
Keywords: robust
Abstract: We present GSplatLoc, a camera localization method that leverages the differentiable rendering capabilities of 3D Gaussian splatting for ultra-precise pose estimation. By formulating pose estimation as a gradient-based optimization problem that minimizes discrepancies between rendered depth maps from a pre-existing 3D Gaussian scene and observed depth images, GSplatLoc achieves translational errors within 0.01 cm and near-zero rotational errors on the Replica dataset - significantly outperforming existing methods. Evaluations on the Replica and TUM RGB-D datasets demonstrate the method's robustness in challenging indoor environments with complex camera motions. GSplatLoc sets a new benchmark for localization in dense mapping, with important implications for applications requiring accurate real-time localization, such as robotics and augmented reality.

Title: "My life is miserable, have to sign 500 autographs everyday": Exposing Humblebragging, the Brags in Disguise

Authors: Sharath Naganna, Saprativa Bhattacharjee, Pushpak Bhattacharyya, Biplab Banerjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20057
Pdf URL: https://arxiv.org/pdf/2412.20057
Copy Paste: [[2412.20057]] "My life is miserable, have to sign 500 autographs everyday": Exposing Humblebragging, the Brags in Disguise(https://arxiv.org/abs/2412.20057)
Keywords: large language model
Abstract: Humblebragging is a phenomenon where individuals present self-promotional statements under the guise of modesty or complaints. For example, a statement like, "Ugh, I can't believe I got promoted to lead the entire team. So stressful!", subtly highlights an achievement while pretending to be complaining. Detecting humblebragging is important for machines to better understand the nuances of human language, especially in tasks like sentiment analysis and intent recognition. However, this topic has not yet been studied in computational linguistics. For the first time, we introduce the task of automatically detecting humblebragging in text. We formalize the task by proposing a 4-tuple definition of humblebragging and evaluate machine learning, deep learning, and large language models (LLMs) on this task, comparing their performance with humans. We also create and release a dataset called HB24, containing 3,340 humblebrags generated using GPT-4o. Our experiments show that detecting humblebragging is non-trivial, even for humans. Our best model achieves an F1-score of 0.88. This work lays the foundation for further exploration of this nuanced linguistic phenomenon and its integration into broader natural language understanding systems.

Title: Comparative Analysis of Listwise Reranking with Large Language Models in Limited-Resource Language Contexts

Authors: Yanxin Shen, Lun Wang, Chuanqi Shi, Shaoshuai Du, Yiyi Tao, Yixian Shen, Hang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20061
Pdf URL: https://arxiv.org/pdf/2412.20061
Copy Paste: [[2412.20061]] Comparative Analysis of Listwise Reranking with Large Language Models in Limited-Resource Language Contexts(https://arxiv.org/abs/2412.20061)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated significant effectiveness across various NLP tasks, including text ranking. This study assesses the performance of large language models (LLMs) in listwise reranking for limited-resource African languages. We compare proprietary models RankGPT3.5, Rank4o-mini, RankGPTo1-mini and RankClaude-sonnet in cross-lingual contexts. Results indicate that these LLMs significantly outperform traditional baseline methods such as BM25-DT in most evaluation metrics, particularly in nDCG@10 and MRR@100. These findings highlight the potential of LLMs in enhancing reranking tasks for low-resource languages and offer insights into cost-effective solutions.

Title: MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion

Authors: Zechao Zhan, Dehong Gao, Jinxia Zhang, Jiale Huang, Yang Hu, Xin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20062
Pdf URL: https://arxiv.org/pdf/2412.20062
Copy Paste: [[2412.20062]] MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion(https://arxiv.org/abs/2412.20062)
Keywords: diffusion, large language model
Abstract: Text-guided image editing model has achieved great success in general domain. However, directly applying these models to the fashion domain may encounter two issues: (1) Inaccurate localization of editing region; (2) Weak editing magnitude. To address these issues, the MADiff model is proposed. Specifically, to more accurately identify editing region, the MaskNet is proposed, in which the foreground region, densepose and mask prompts from large language model are fed into a lightweight UNet to predict the mask for editing region. To strengthen the editing magnitude, the Attention-Enhanced Diffusion Model is proposed, where the noise map, attention map, and the mask from MaskNet are fed into the proposed Attention Processor to produce a refined noise map. By integrating the refined noise map into the diffusion model, the edited image can better align with the target prompt. Given the absence of benchmarks in fashion image editing, we constructed a dataset named Fashion-E, comprising 28390 image-text pairs in the training set, and 2639 image-text pairs for four types of fashion tasks in the evaluation set. Extensive experiments on Fashion-E demonstrate that our proposed method can accurately predict the mask of editing region and significantly enhance editing magnitude in fashion image editing compared to the state-of-the-art methods.

Title: VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition

Authors: Lan Chen, Haoxiang Yang, Pengpeng Shao, Haoyu Song, Xiao Wang, Zhicheng Zhao, Yaowei Wang, Yonghong Tian
Subjects: cs.CV, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2412.20064
Pdf URL: https://arxiv.org/pdf/2412.20064
Copy Paste: [[2412.20064]] VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition(https://arxiv.org/abs/2412.20064)
Keywords: transformer
Abstract: Pattern recognition leveraging both RGB and Event cameras can significantly enhance performance by deploying deep neural networks that utilize a fine-tuning strategy. Inspired by the successful application of large models, the introduction of such large models can also be considered to further enhance the performance of multi-modal tasks. However, fully fine-tuning these models leads to inefficiency and lightweight fine-tuning methods such as LoRA and Adapter have been proposed to achieve a better balance between efficiency and performance. To our knowledge, there is currently no work that has conducted parameter-efficient fine-tuning (PEFT) for RGB-Event recognition based on pre-trained foundation models. To address this issue, this paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification. Specifically, given the RGB frames and event streams, we extract the RGB and event features based on the vision foundation model ViT with a modality-specific LoRA tuning strategy. The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network. These features are concatenated and fed into high-level Transformer layers for efficient multi-modal feature learning via modality-shared LoRA tuning. Finally, we concatenate these features and feed them into a classification head to achieve efficient fine-tuning. The source code and pre-trained models will be released on \url{this https URL}.

Title: On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Authors: Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20070
Pdf URL: https://arxiv.org/pdf/2412.20070
Copy Paste: [[2412.20070]] On the Compositional Generalization of Multimodal LLMs for Medical Imaging(https://arxiv.org/abs/2412.20070)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks, providing limited guidance on selecting datasets to enhance specific tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. Therefore, we assembled 106 medical datasets to create Med-MAT for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and delivers consistent performance across different backbones, highlighting its versatility and broad applicability. Med-MAT is publicly available at this https URL.

Title: Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Authors: Chongjian Yue, Xinrun Xu, Xiaojun Ma, Lun Du, Zhiming Ding, Shi Han, Dongmei Zhang, Qi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20072
Pdf URL: https://arxiv.org/pdf/2412.20072
Copy Paste: [[2412.20072]] Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset(https://arxiv.org/abs/2412.20072)
Keywords: extraction, large language model
Abstract: Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. The hybrid text often appears in the form of hybrid long documents (HLDs), which far exceed the token limit of LLMs. Consequently, we apply an Automated Information Extraction framework (AIE) to enable LLMs to process the HLDs and carry out experiments to analyse four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive AIE has adaptability in many complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.

Title: MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

Authors: Shuo Wang, Wanting Li, Yongcai Wang, Zhaoxin Fan, Zhe Huang, Xudong Cai, Jian Zhao, Deying Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20082
Pdf URL: https://arxiv.org/pdf/2412.20082
Copy Paste: [[2412.20082]] MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing(https://arxiv.org/abs/2412.20082)
Keywords: robust
Abstract: Deep visual odometry has demonstrated great advancements by learning-to-optimize technology. This approach heavily relies on the visual matching across frames. However, ambiguous matching in challenging scenarios leads to significant errors in geometric modeling and bundle adjustment optimization, which undermines the accuracy and robustness of pose estimation. To address this challenge, this paper proposes MambaVO, which conducts robust initialization, Mamba-based sequential matching refinement, and smoothed training to enhance the matching quality and improve the pose estimation in deep visual odometry. Specifically, when a new frame is received, it is matched with the closest keyframe in the maintained Point-Frame Graph (PFG) via the semi-dense based Geometric Initialization Module (GIM). Then the initialized PFG is processed by a proposed Geometric Mamba Module (GMM), which exploits the matching features to refine the overall inter-frame pixel-to-pixel matching. The refined PFG is finally processed by deep BA to optimize the poses and the map. To deal with the gradient variance, a Trending-Aware Penalty (TAP) is proposed to smooth training by balancing the pose loss and the matching loss to enhance convergence and stability. A loop closure module is finally applied to enable MambaVO++. On public benchmarks, MambaVO and MambaVO++ demonstrate SOTA accuracy performance, while ensuring real-time running performance with low GPU memory requirement. Codes will be publicly available.

Title: STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection

Authors: Zhangxun Li, Mengyang Zhao, Xuan Yang, Yang Liu, Jiamu Sheng, Xinhua Zeng, Tian Wang, Kewei Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20084
Pdf URL: https://arxiv.org/pdf/2412.20084
Copy Paste: [[2412.20084]] STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection(https://arxiv.org/abs/2412.20084)
Keywords: transformer
Abstract: Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems. However, most existing methods based on CNNs and transformers still suffer from substantial computational burdens and have room for improvement in learning spatial-temporal normality. Recently, Mamba has shown great potential for modeling long-range dependencies with linear complexity, providing an effective solution to the above dilemma. To this end, we propose a lightweight and effective Mamba-based network named STNMamba, which incorporates carefully designed Mamba modules to enhance the learning of spatial-temporal normality. Firstly, we develop a dual-encoder architecture, where the spatial encoder equipped with Multi-Scale Vision Space State Blocks (MS-VSSB) extracts multi-scale appearance features, and the temporal encoder employs Channel-Aware Vision Space State Blocks (CA-VSSB) to capture significant motion patterns. Secondly, a Spatial-Temporal Interaction Module (STIM) is introduced to integrate spatial and temporal information across multiple levels, enabling effective modeling of intrinsic spatial-temporal consistency. Within this module, the Spatial-Temporal Fusion Block (STFB) is proposed to fuse the spatial and temporal features into a unified feature space, and the memory bank is utilized to store spatial-temporal prototypes of normal patterns, restricting the model's ability to represent anomalies. Extensive experiments on three benchmark datasets demonstrate that our STNMamba achieves competitive performance with fewer parameters and lower computational costs than existing methods.

Title: Enhancing Marine Debris Acoustic Monitoring by Optical Flow-Based Motion Vector Analysis

Authors: Xiaoteng Zhou, Katsunori Mizuno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20085
Pdf URL: https://arxiv.org/pdf/2412.20085
Copy Paste: [[2412.20085]] Enhancing Marine Debris Acoustic Monitoring by Optical Flow-Based Motion Vector Analysis(https://arxiv.org/abs/2412.20085)
Keywords: robust
Abstract: With the development of coastal construction, a large amount of human-generated waste, particularly plastic debris, is continuously entering the ocean, posing a severe threat to marine ecosystems. The key to effectively addressing plastic pollution lies in the ability to autonomously monitor such debris. Currently, marine debris monitoring primarily relies on optical sensors, but these methods are limited in their applicability to underwater and seafloor areas due to low-visibility constraints. The acoustic camera, also known as high-resolution forward-looking sonar (FLS), has demonstrated considerable potential in the autonomous monitoring of marine debris, as they are unaffected by water turbidity and dark environments. The appearance of targets in sonar images changes with variations in the imaging viewpoint, while challenges such as low signal-to-noise ratio, weak textures, and imaging distortions in sonar imagery present significant obstacles to debris monitoring based on prior class labels. This paper proposes an optical flow-based method for marine debris monitoring, aiming to fully utilize the time series information captured by the acoustic camera to enhance the performance of marine debris monitoring without relying on prior category labels of the targets. The proposed method was validated through experiments conducted in a circulating water tank, demonstrating its feasibility and robustness. This approach holds promise for providing novel insights into the spatial and temporal distribution of debris.

Title: MAFT: Efficient Model-Agnostic Fairness Testing for Deep Neural Networks via Zero-Order Gradient Search

Authors: Zhaohui Wang, Min Zhang, Jingran Yang, Bojie Shao, Min Zhang
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2412.20086
Pdf URL: https://arxiv.org/pdf/2412.20086
Copy Paste: [[2412.20086]] MAFT: Efficient Model-Agnostic Fairness Testing for Deep Neural Networks via Zero-Order Gradient Search(https://arxiv.org/abs/2412.20086)
Keywords: fair
Abstract: Deep neural networks (DNNs) have shown powerful performance in various applications and are increasingly being used in decision-making systems. However, concerns about fairness in DNNs always persist. Some efficient white-box fairness testing methods about individual fairness have been proposed. Nevertheless, the development of black-box methods has stagnated, and the performance of existing methods is far behind that of white-box methods. In this paper, we propose a novel black-box individual fairness testing method called Model-Agnostic Fairness Testing (MAFT). By leveraging MAFT, practitioners can effectively identify and address discrimination in DL models, regardless of the specific algorithm or architecture employed. Our approach adopts lightweight procedures such as gradient estimation and attribute perturbation rather than non-trivial procedures like symbol execution, rendering it significantly more scalable and applicable than existing methods. We demonstrate that MAFT achieves the same effectiveness as state-of-the-art white-box methods whilst improving the applicability to large-scale networks. Compared to existing black-box approaches, our approach demonstrates distinguished performance in discovering fairness violations w.r.t effectiveness (approximately 14.69 times) and efficiency (approximately 32.58 times).

Title: On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs

Authors: Atmane Ayoub Mansour Bahar, Ahmad Samer Wazan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20087
Pdf URL: https://arxiv.org/pdf/2412.20087
Copy Paste: [[2412.20087]] On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs(https://arxiv.org/abs/2412.20087)
Keywords: attack, large language model
Abstract: This research investigates the effectiveness of established vulnerability metrics, such as the Common Vulnerability Scoring System (CVSS), in evaluating attacks against Large Language Models (LLMs), with a focus on Adversarial Attacks (AAs). The study explores the influence of both general and specific metric factors in determining vulnerability scores, providing new perspectives on potential enhancements to these metrics. This study adopts a quantitative approach, calculating and comparing the coefficient of variation of vulnerability scores across 56 adversarial attacks on LLMs. The attacks, sourced from various research papers, and obtained through online databases, were evaluated using multiple vulnerability metrics. Scores were determined by averaging the values assessed by three distinct LLMs. The results indicate that existing scoring-systems yield vulnerability scores with minimal variation across different attacks, suggesting that many of the metric factors are inadequate for assessing adversarial attacks on LLMs. This is particularly true for context-specific factors or those with predefined value sets, such as those in CVSS. These findings support the hypothesis that current vulnerability metrics, especially those with rigid values, are limited in evaluating AAs on LLMs, highlighting the need for the development of more flexible, generalized metrics tailored to such attacks. This research offers a fresh analysis of the effectiveness and applicability of established vulnerability metrics, particularly in the context of Adversarial Attacks on Large Language Models, both of which have gained significant attention in recent years. Through extensive testing and calculations, the study underscores the limitations of these metrics and opens up new avenues for improving and refining vulnerability assessment frameworks specifically tailored for LLMs.

Title: SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis

Authors: Wenkun He, Yun Liu, Ruitao Liu, Li Yi
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.20104
Pdf URL: https://arxiv.org/pdf/2412.20104
Copy Paste: [[2412.20104]] SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis(https://arxiv.org/abs/2412.20104)
Keywords: diffusion
Abstract: Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. This complexity introduces significant challenges in synchronizing motions due to the high correlations and mutual influences among bodies. To address these challenges, we introduce SyncDiff, a novel method for multi-body interaction synthesis using a synchronized motion diffusion strategy. SyncDiff employs a single diffusion model to capture the joint distribution of multi-body motions. To enhance motion fidelity, we propose a frequency-domain motion decomposition scheme. Additionally, we introduce a new set of alignment scores to emphasize the synchronization of different body motions. SyncDiff jointly optimizes both data sample likelihood and alignment likelihood through an explicit synchronization strategy. Extensive experiments across four datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.

Title: ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

Authors: Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20105
Pdf URL: https://arxiv.org/pdf/2412.20105
Copy Paste: [[2412.20105]] ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming(https://arxiv.org/abs/2412.20105)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of visual tokens incurs a significant computational cost. Existing analysis of the MLLM attention mechanisms remains shallow, leading to coarse-grain token pruning strategies that fail to effectively balance speed and accuracy. In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. Based on this insight, we propose Spatial-Temporal Visual Token Trimming ($\textbf{ST}^{3}$), a framework designed to accelerate MLLM inference without retraining. $\textbf{ST}^{3}$ consists of two primary components: 1) Progressive Visual Token Pruning (\textbf{PVTP}), which eliminates inattentive visual tokens across layers, and 2) Visual Token Annealing (\textbf{VTA}), which dynamically reduces the number of visual tokens in each layer as the generated tokens grow. Together, these techniques deliver around $\mathbf{2\times}$ faster inference with only about $\mathbf{30\%}$ KV cache memory compared to the original LLaVA, while maintaining consistent performance across various datasets. Crucially, $\textbf{ST}^{3}$ can be seamlessly integrated into existing pre-trained MLLMs, providing a plug-and-play solution for efficient inference.

Title: M-MAD: Multidimensional Multi-Agent Debate Framework for Fine-grained Machine Translation Evaluation

Authors: Zhaopeng Feng, Jiayuan Su, Jiamei Zheng, Jiahan Ren, Yan Zhang, Jian Wu, Hongwei Wang, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20127
Pdf URL: https://arxiv.org/pdf/2412.20127
Copy Paste: [[2412.20127]] M-MAD: Multidimensional Multi-Agent Debate Framework for Fine-grained Machine Translation Evaluation(https://arxiv.org/abs/2412.20127)
Keywords: robust, large language model
Abstract: Recent advancements in large language models (LLMs) have given rise to the LLM-as-a-judge paradigm, showcasing their potential to deliver human-like judgments. However, in the field of machine translation (MT) evaluation, current LLM-as-a-judge methods fall short of learned automatic metrics. In this paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our findings demonstrate that M-MAD achieves significant advancements by (1) decoupling heuristic MQM criteria into distinct evaluation dimensions for fine-grained assessments; (2) employing multi-agent debates to harness the collaborative reasoning capabilities of LLMs; (3) synthesizing dimension-specific results into a final evaluation judgment to ensure robust and reliable outcomes. Comprehensive experiments show that M-MAD not only outperforms all existing LLM-as-a-judge methods but also competes with state-of-the-art reference-based automatic metrics, even when powered by a suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight the superiority of our framework design, offering a fresh perspective for LLM-as-a-judge paradigm. Our code and data are publicly available at this https URL.

Title: Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

Authors: Wei Zhou, Mohsen Mesgar, Annemarie Friedrich, Heike Adel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20145
Pdf URL: https://arxiv.org/pdf/2412.20145
Copy Paste: [[2412.20145]] Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering(https://arxiv.org/abs/2412.20145)
Keywords: large language model
Abstract: Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain, and utilizing closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework that requires neither closed-source models nor fine-tuning. In MACT, a planning agent and a coding agent that also make use of tools collaborate to answer questions. Our experiments on four TQA benchmarks show that MACT outperforms previous SoTA systems on three out of four benchmarks and that it performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. We conduct extensive analyses to prove the effectiveness of MACT's multi-agent collaboration in TQA.

Title: Distilled Transformers with Locally Enhanced Global Representations for Face Forgery Detection

Authors: Yaning Zhang, Qiufu Li, Zitong Yu, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20156
Pdf URL: https://arxiv.org/pdf/2412.20156
Copy Paste: [[2412.20156]] Distilled Transformers with Locally Enhanced Global Representations for Face Forgery Detection(https://arxiv.org/abs/2412.20156)
Keywords: robust, transformer
Abstract: Face forgery detection (FFD) is devoted to detecting the authenticity of face images. Although current CNN-based works achieve outstanding performance in FFD, they are susceptible to capturing local forgery patterns generated by various manipulation methods. Though transformer-based detectors exhibit improvements in modeling global dependencies, they are not good at exploring local forgery artifacts. Hybrid transformer-based networks are designed to capture local and global manipulated traces, but they tend to suffer from the attention collapse issue as the transformer block goes deeper. Besides, soft labels are rarely available. In this paper, we propose a distilled transformer network (DTN) to capture both rich local and global forgery traces and learn general and common representations for different forgery faces. Specifically, we design a mixture of expert (MoE) module to mine various robust forgery embeddings. Moreover, a locally-enhanced vision transformer (LEVT) module is proposed to learn locally-enhanced global representations. We design a lightweight multi-attention scaling (MAS) module to avoid attention collapse, which can be plugged and played in any transformer-based models with only a slight increase in computational costs. In addition, we propose a deepfake self-distillation (DSD) scheme to provide the model with abundant soft label information. Extensive experiments show that the proposed method surpasses the state of the arts on five deepfake datasets.

Title: UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity

Authors: Jingbo Lin, Zhilu Zhang, Wenbo Li, Renjing Pei, Hang Xu, Hongzhi Zhang, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20157
Pdf URL: https://arxiv.org/pdf/2412.20157
Copy Paste: [[2412.20157]] UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity(https://arxiv.org/abs/2412.20157)
Keywords: robust
Abstract: Recently, considerable progress has been made in allin-one image restoration. Generally, existing methods can be degradation-agnostic or degradation-aware. However, the former are limited in leveraging degradation-specific restoration, and the latter suffer from the inevitable error in degradation estimation. Consequently, the performance of existing methods has a large gap compared to specific single-task models. In this work, we make a step forward in this topic, and present our UniRestorer with improved restoration performance. Specifically, we perform hierarchical clustering on degradation space, and train a multi-granularity mixture-of-experts (MoE) restoration model. Then, UniRestorer adopts both degradation and granularity estimation to adaptively select an appropriate expert for image restoration. In contrast to existing degradation-agnostic and -aware methods, UniRestorer can leverage degradation estimation to benefit degradationspecific restoration, and use granularity estimation to make the model robust to degradation estimation error. Experimental results show that our UniRestorer outperforms stateof-the-art all-in-one methods by a large margin, and is promising in closing the performance gap to specific single task models. The code and pre-trained models will be publicly available at this https URL.

Title: Multi-Modality Driven LoRA for Adverse Condition Depth Estimation

Authors: Guanglei Yang, Rui Tian, Yongqiang Zhang, Zhun Zhong, Yongqiang Li, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20162
Pdf URL: https://arxiv.org/pdf/2412.20162
Copy Paste: [[2412.20162]] Multi-Modality Driven LoRA for Adverse Condition Depth Estimation(https://arxiv.org/abs/2412.20162)
Keywords: robust, diffusion, generative
Abstract: The autonomous driving community is increasingly focused on addressing corner case problems, particularly those related to ensuring driving safety under adverse conditions (e.g., nighttime, fog, rain). To this end, the task of Adverse Condition Depth Estimation (ACDE) has gained significant attention. Previous approaches in ACDE have primarily relied on generative models, which necessitate additional target images to convert the sunny condition into adverse weather, or learnable parameters for feature augmentation to adapt domain gaps, resulting in increased model complexity and tuning efforts. Furthermore, unlike CLIP-based methods where textual and visual features have been pre-aligned, depth estimation models lack sufficient alignment between multimodal features, hindering coherent understanding under adverse conditions. To address these limitations, we propose Multi-Modality Driven LoRA (MMD-LoRA), which leverages low-rank adaptation matrices for efficient fine-tuning from source-domain to target-domain. It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning(VTCCL). During PDDA, the image encoder with MMD-LoRA generates target-domain visual representations, supervised by alignment loss that the source-target difference between language and image should be equal. Meanwhile, VTCCL bridges the gap between textual features from CLIP and visual features from diffusion model, pushing apart different weather representations (vision and text) and bringing together similar ones. Through extensive experiments, the proposed method achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets, underscoring robustness and efficiency in adapting to varied adverse environments.

Title: StyleAutoEncoder for manipulating image attributes using pre-trained StyleGAN

Authors: Andrzej Bedychaj, Jacek Tabor, Marek Śmieja
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20164
Pdf URL: https://arxiv.org/pdf/2412.20164
Copy Paste: [[2412.20164]] StyleAutoEncoder for manipulating image attributes using pre-trained StyleGAN(https://arxiv.org/abs/2412.20164)
Keywords: generative
Abstract: Deep conditional generative models are excellent tools for creating high-quality images and editing their attributes. However, training modern generative models from scratch is very expensive and requires large computational resources. In this paper, we introduce StyleAutoEncoder (StyleAE), a lightweight AutoEncoder module, which works as a plugin for pre-trained generative models and allows for manipulating the requested attributes of images. The proposed method offers a cost-effective solution for training deep generative models with limited computational resources, making it a promising technique for a wide range of applications. We evaluate StyleAutoEncoder by combining it with StyleGAN, which is currently one of the top generative models. Our experiments demonstrate that StyleAutoEncoder is at least as effective in manipulating image attributes as the state-of-the-art algorithms based on invertible normalizing flows. However, it is simpler, faster, and gives more freedom in designing neural

Title: Real-time Calibration Model for Low-cost Sensor in Fine-grained Time series

Authors: Seokho Ahn, Hyungjin Kim, Sungbok Shin, Young-Duk Seo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20170
Pdf URL: https://arxiv.org/pdf/2412.20170
Copy Paste: [[2412.20170]] Real-time Calibration Model for Low-cost Sensor in Fine-grained Time series(https://arxiv.org/abs/2412.20170)
Keywords: transformer
Abstract: Precise measurements from sensors are crucial, but data is usually collected from low-cost, low-tech systems, which are often inaccurate. Thus, they require further calibrations. To that end, we first identify three requirements for effective calibration under practical low-tech sensor conditions. Based on the requirements, we develop a model called TESLA, Transformer for effective sensor calibration utilizing logarithmic-binned attention. TESLA uses a high-performance deep learning model, Transformers, to calibrate and capture non-linear components. At its core, it employs logarithmic binning to minimize attention complexity. TESLA achieves consistent real-time calibration, even with longer sequences and finer-grained time series in hardware-constrained systems. Experiments show that TESLA outperforms existing novel deep learning and newly crafted linear models in accuracy, calibration speed, and energy efficiency.

Title: Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation

Authors: Guanglei Yang, Yongqiang Zhang, Wanlong Li, Yu Tang, Weize Shang, Feng Wen, Hongbo Zhang, Mingli Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20171
Pdf URL: https://arxiv.org/pdf/2412.20171
Copy Paste: [[2412.20171]] Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation(https://arxiv.org/abs/2412.20171)
Keywords: transformer, segmentation
Abstract: Convolutional Neural Networks (CNNs) have significantly impacted various computer vision tasks, however, they inherently struggle to model long-range dependencies explicitly due to the localized nature of convolution operations. Although Transformers have addressed limitations in long-range dependencies for the spatial dimension, the temporal dimension remains underexplored. In this paper, we first highlight that 3D CNNs exhibit limitations in capturing long-range temporal dependencies. Though Transformers mitigate spatial dimension issues, they result in a considerable increase in parameter and processing speed reduction. To overcome these challenges, we introduce a simple yet effective module, Geographically Masked Convolutional Gated Recurrent Unit (Geo-ConvGRU), tailored for Bird's-Eye View segmentation. Specifically, we substitute the 3D CNN layers with ConvGRU in the temporal module to bolster the capacity of networks for handling temporal dependencies. Additionally, we integrate a geographical mask into the Convolutional Gated Recurrent Unit to suppress noise introduced by the temporal module. Comprehensive experiments conducted on the NuScenes dataset substantiate the merits of the proposed Geo-ConvGRU, revealing that our approach attains state-of-the-art performance in Bird's-Eye View segmentation.

Title: Pushing the Envelope of Low-Bit LLM via Dynamic Error Compensation

Authors: Yeonhong Park, Jake Hyun, Hojoon Kim, Jae W. Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20185
Pdf URL: https://arxiv.org/pdf/2412.20185
Copy Paste: [[2412.20185]] Pushing the Envelope of Low-Bit LLM via Dynamic Error Compensation(https://arxiv.org/abs/2412.20185)
Keywords: large language model
Abstract: Quantization of Large Language Models (LLMs) has recently gained popularity, particularly for on-device settings with limited hardware resources. While efficient, quantization inevitably degrades model quality, especially in aggressive low-bit settings such as 3-bit and 4-bit precision. In this paper, we propose QDEC, an inference scheme that improves the quality of low-bit LLMs while preserving the key benefits of quantization: GPU memory savings and inference latency reduction. QDEC stores the residual matrix -- the difference between full-precision and quantized weights -- in CPU, and dynamically fetches the residuals for only a small portion of the weights. This portion corresponds to the salient channels, marked by activation outliers, with the fetched residuals helping to correct quantization errors in these channels. Salient channels are identified dynamically at each decoding step by analyzing the input activations -- this allows for the adaptation to the dynamic nature of activation distribution, and thus maximizes the effectiveness of error compensation. We demonstrate the effectiveness of QDEC by augmenting state-of-the-art quantization methods. For example, QDEC reduces the perplexity of a 3-bit Llama-3-8B-Instruct model from 10.15 to 9.12 -- outperforming its 3.5-bit counterpart -- while adding less than 0.0003\% to GPU memory usage and incurring only a 1.7\% inference slowdown on NVIDIA RTX 4050 Mobile GPU. The code will be publicly available soon.

Title: Lower bounds on transformers with infinite precision

Authors: Alexander Kozachinskiy
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.20195
Pdf URL: https://arxiv.org/pdf/2412.20195
Copy Paste: [[2412.20195]] Lower bounds on transformers with infinite precision(https://arxiv.org/abs/2412.20195)
Keywords: transformer
Abstract: In this note, we use the VC dimension technique to prove the first lower bound against one-layer softmax transformers with infinite precision. We do so for two tasks: function composition, considered by Peng, Narayanan, and Papadimitriou, and the SUM$_2$ task, considered by Sanford, Hsu, and Telgarsky.

Title: Federated Unlearning with Gradient Descent and Conflict Mitigation

Authors: Zibin Pan, Zhichao Wang, Chi Li, Kaiyan Zheng, Boqi Wang, Xiaoying Tang, Junhua Zhao
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2412.20200
Pdf URL: https://arxiv.org/pdf/2412.20200
Copy Paste: [[2412.20200]] Federated Unlearning with Gradient Descent and Conflict Mitigation(https://arxiv.org/abs/2412.20200)
Keywords: privacy, federate
Abstract: Federated Learning (FL) has received much attention in recent years. However, although clients are not required to share their data in FL, the global model itself can implicitly remember clients' local data. Therefore, it's necessary to effectively remove the target client's data from the FL global model to ease the risk of privacy leakage and implement ``the right to be forgotten". Federated Unlearning (FU) has been considered a promising way to remove data without full retraining. But the model utility easily suffers significant reduction during unlearning due to the gradient conflicts. Furthermore, when conducting the post-training to recover the model utility, the model is prone to move back and revert what has already been unlearned. To address these issues, we propose Federated Unlearning with Orthogonal Steepest Descent (FedOSD). We first design an unlearning Cross-Entropy loss to overcome the convergence issue of the gradient ascent. A steepest descent direction for unlearning is then calculated in the condition of being non-conflicting with other clients' gradients and closest to the target client's gradient. This benefits to efficiently unlearn and mitigate the model utility reduction. After unlearning, we recover the model utility by maintaining the achievement of unlearning. Finally, extensive experiments in several FL scenarios verify that FedOSD outperforms the SOTA FU algorithms in terms of unlearning and model utility.

Title: Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems

Authors: Wen-Dong Jiang, Chih-Yung Chang, Hsiang-Chuan Chang, Ji-Yuan Chen, Diptendu Sinha Roy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20201
Pdf URL: https://arxiv.org/pdf/2412.20201
Copy Paste: [[2412.20201]] Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems(https://arxiv.org/abs/2412.20201)
Keywords: interpretability, explainability
Abstract: Weakly Supervised Monitoring Anomaly Detection (WSMAD) utilizes weak supervision learning to identify anomalies, a critical task for smart city monitoring. However, existing multimodal approaches often fail to meet the real-time and interpretability requirements of edge devices due to their complexity. This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning to enable efficient, accurate, and interpretable anomaly detection on edge this http URL operates in two stages: coarse-grained rapid classification and fine-grained detailed analysis. In the first stage, TCVADS extracts features from video frames and inputs them into a time series analysis module, which acts as the teacher model. Insights are then transferred via knowledge distillation to a simplified convolutional network (student model) for binary classification. Upon detecting an anomaly, the second stage is triggered, employing a fine-grained multi-class classification model. This stage uses CLIP for cross-modal contrastive learning with text and images, enhancing interpretability and achieving refined classification through specially designed triplet textual relationships. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability, offering valuable contributions to smart city monitoring applications.

Title: Towards Visual Grounding: A Survey

Authors: Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, Changsheng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20206
Pdf URL: https://arxiv.org/pdf/2412.20206
Copy Paste: [[2412.20206]] Towards Visual Grounding: A Survey(https://arxiv.org/abs/2412.20206)
Keywords: fair
Abstract: Visual Grounding is also known as Referring Expression Comprehension and Phrase Grounding. It involves localizing a natural number of specific regions within an image based on a given textual description. The objective of this task is to emulate the prevalent referential relationships in social conversations, equipping machines with human-like multimodal comprehension capabilities. Consequently, it has extensive applications in various domains. However, since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training, grounding multimodal LLMs, generalized visual grounding, and giga-pixel grounding, which have brought numerous new challenges. In this survey, we initially examine the developmental history of visual grounding and provide an overview of essential background knowledge. We systematically track and summarize the advancements and meticulously organize the various settings in visual grounding, thereby establishing precise definitions of these settings to standardize future research and ensure a fair comparison. Additionally, we delve into several advanced topics and highlight numerous applications of visual grounding. Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers. By extracting common technical details, this survey encompasses the representative works in each subtopic over the past decade. To the best, this paper presents the most comprehensive overview currently available in the field of grounding. This survey is designed to be suitable for both beginners and experienced researchers, serving as an invaluable resource for understanding key concepts and tracking the latest research developments. We keep tracing related works at this https URL.

Title: Towards Real-Time 2D Mapping: Harnessing Drones, AI, and Computer Vision for Advanced Insights

Authors: Bharath Kumar Agnur
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20210
Pdf URL: https://arxiv.org/pdf/2412.20210
Copy Paste: [[2412.20210]] Towards Real-Time 2D Mapping: Harnessing Drones, AI, and Computer Vision for Advanced Insights(https://arxiv.org/abs/2412.20210)
Keywords: defense
Abstract: Real-time 2D mapping is a vital tool in aerospace and defense, where accurate and timely geographic data is essential for operations like surveillance, reconnaissance, and target tracking. This project introduces a cutting-edge mapping system that integrates drone imagery with machine learning and computer vision to address challenges in processing speed, accuracy, and adaptability to diverse terrains. By automating feature detection, image matching, and stitching, the system generates seamless, high-resolution maps with minimal delay, providing strategic advantages in defense operations. Implemented in Python, the system leverages OpenCV for image processing, NumPy for efficient computations, and this http URL for parallel processing. ORB (Oriented FAST and Rotated BRIEF) handles feature detection, while FLANN (Fast Library for Approximate Nearest Neighbors) ensures precise keypoint matching. Homography transformations align overlapping images, creating distortion-free maps in real time. This automated approach eliminates manual intervention, enabling live updates critical in dynamic environments. Designed for adaptability, the system performs well under varying light conditions and rugged terrains, making it highly effective in aerospace and defense scenarios. Testing demonstrates significant improvements in speed and accuracy compared to traditional methods, enhancing situational awareness and decision-making. This scalable solution leverages advanced technologies to deliver reliable, actionable data for mission-critical operations.

Title: Generative Regression Based Watch Time Prediction for Video Recommendation: Model and Performance

Authors: Hongxu Ma, Kai Tian, Tao Zhang, Xuefeng Zhang, Chunjie Chen, Han Li, Jihong Guan, Shuigeng Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20211
Pdf URL: https://arxiv.org/pdf/2412.20211
Copy Paste: [[2412.20211]] Generative Regression Based Watch Time Prediction for Video Recommendation: Model and Performance(https://arxiv.org/abs/2412.20211)
Keywords: generative
Abstract: Watch time prediction (WTP) has emerged as a pivotal task in short video recommendation systems, designed to encapsulate user interests. Predicting users' watch times on videos often encounters challenges, including wide value ranges and imbalanced data distributions, which can lead to significant bias when directly regressing watch time. Recent studies have tried to tackle these issues by converting the continuous watch time estimation into an ordinal classification task. While these methods are somewhat effective, they exhibit notable limitations. Inspired by language modeling, we propose a novel Generative Regression (GR) paradigm for WTP based on sequence generation. This approach employs structural discretization to enable the lossless reconstruction of original values while maintaining prediction fidelity. By formulating the prediction problem as a numerical-to-sequence mapping, and with meticulously designed vocabulary and label encodings, each watch time is transformed into a sequence of tokens. To expedite model training, we introduce the curriculum learning with an embedding mixup strategy which can mitigate training-and-inference inconsistency associated with teacher forcing. We evaluate our method against state-of-the-art approaches on four public datasets and one industrial dataset. We also perform online A/B testing on Kuaishou, a leading video app with about 400 million DAUs, to demonstrate the real-world efficacy of our method. The results conclusively show that GR outperforms existing techniques significantly. Furthermore, we successfully apply GR to another regression task in recommendation systems, i.e., Lifetime Value (LTV) prediction, which highlights its potential as a novel and effective solution to general regression challenges.

Title: Building a Rich Dataset to Empower the Persian Question Answering Systems

Authors: Mohsen Yazdinejad, Marjan Kaedi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20212
Pdf URL: https://arxiv.org/pdf/2412.20212
Copy Paste: [[2412.20212]] Building a Rich Dataset to Empower the Persian Question Answering Systems(https://arxiv.org/abs/2412.20212)
Keywords: robust
Abstract: Question answering systems provide short, precise, and specific answers to questions. So far, many robust question answering systems have been developed for English, while some languages with fewer resources, like Persian, have few numbers of standard dataset. In this study, a comprehensive open-domain dataset is presented for Persian. This dataset is called NextQuAD and has 7,515 contexts, including 23,918 questions and answers. Then, a BERT-based question answering model has been applied to this dataset using two pre-trained language models, including ParsBERT and XLM-RoBERTa. The results of these two models have been ensembled using mean logits. Evaluation on the development set shows 0.95 Exact Match (EM) and 0.97 Fl_score. Also, to compare the NextQuAD with other Persian datasets, our trained model on the NextQuAD, is evaluated on two other datasets named PersianQA and ParSQuAD. Comparisons show that the proposed model increased EM by 0.39 and 0.14 respectively in PersianQA and ParSQuAD-manual, while a slight EM decline of 0.007 happened in ParSQuAD-automatic.

Title: IMSSA: Deploying modern state-space models on memristive in-memory compute hardware

Authors: Sebastian Siegel, Ming-Jay Yang, John-Paul Strachan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20215
Pdf URL: https://arxiv.org/pdf/2412.20215
Copy Paste: [[2412.20215]] IMSSA: Deploying modern state-space models on memristive in-memory compute hardware(https://arxiv.org/abs/2412.20215)
Keywords: transformer
Abstract: Processing long temporal sequences is a key challenge in deep learning. In recent years, Transformers have become state-of-the-art for this task, but suffer from excessive memory requirements due to the need to explicitly store the sequences. To address this issue, structured state-space sequential (S4) models recently emerged, offering a fixed memory state while still enabling the processing of very long sequence contexts. The recurrent linear update of the state in these models makes them highly efficient on modern graphics processing units (GPU) by unrolling the recurrence into a convolution. However, this approach demands significant memory and massively parallel computation, which is only available on the latest GPUs. In this work, we aim to bring the power of S4 models to edge hardware by significantly reducing the size and computational demand of an S4D model through quantization-aware training, even achieving ternary weights for a simple real-world task. To this end, we extend conventional quantization-aware training to tailor it for analog in-memory compute hardware. We then demonstrate the deployment of recurrent S4D kernels on memrisitve crossbar arrays, enabling their computation in an in-memory compute fashion. To our knowledge, this is the first implementation of S4 kernels on in-memory compute hardware.

Title: YAD: Leveraging T5 for Improved Automatic Diacritization of Yor\`ub\'a Text

Authors: Akindele Michael Olawole, Jesujoba O. Alabi, Aderonke Busayo Sakpere, David I. Adelani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20218
Pdf URL: https://arxiv.org/pdf/2412.20218
Copy Paste: [[2412.20218]] YAD: Leveraging T5 for Improved Automatic Diacritization of Yor\`ub\'a Text(https://arxiv.org/abs/2412.20218)
Keywords: transformer
Abstract: In this work, we present Yorùbá automatic diacritization (YAD) benchmark dataset for evaluating Yorùbá diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yorùbá and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yorùbá

Title: LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning

Authors: Shuguang Chen, Guang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20227
Pdf URL: https://arxiv.org/pdf/2412.20227
Copy Paste: [[2412.20227]] LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning(https://arxiv.org/abs/2412.20227)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks but face challenges in mathematical reasoning, where complex problem-solving requires both linguistic understanding and mathematical reasoning skills. Existing approaches to address this challenge often rely on ensemble methods and suffer from the problem of data scarcity in target domains. In this work, we present a novel method to enhance LLMs' capabilities in mathematical reasoning tasks. Motivated by the need to bridge this gap, our approach incorporates a question paraphrase strategy, which aims at diversifying the linguistic forms of mathematical questions to improve generalization. Additionally, specialized training objectives are employed to guide the model's learning process, focusing on enhancing its understanding of mathematical concepts and reasoning processes. We conduct experiments on four datasets using different LLMs, and demonstrate the effectiveness of our approach in improving LLMs' performance on mathematical reasoning tasks. Our findings underscore the significance of our methodology in the advancement of large language models and its potential implications for real-world applications that require mathematical reasoning abilities.

Title: How To Think About End-To-End Encryption and AI: Training, Processing, Disclosure, and Consent

Authors: Mallory Knodel, Andrés Fábrega, Daniella Ferrari, Jacob Leiken, Betty Li Hou, Derek Yen, Sam de Alfaro, Kyunghyun Cho, Sunoo Park
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20231
Pdf URL: https://arxiv.org/pdf/2412.20231
Copy Paste: [[2412.20231]] How To Think About End-To-End Encryption and AI: Training, Processing, Disclosure, and Consent(https://arxiv.org/abs/2412.20231)
Keywords: security, privacy
Abstract: End-to-end encryption (E2EE) has become the gold standard for securing communications, bringing strong confidentiality and privacy guarantees to billions of users worldwide. However, the current push towards widespread integration of artificial intelligence (AI) models, including in E2EE systems, raises some serious security concerns. This work performs a critical examination of the (in)compatibility of AI models and E2EE applications. We explore this on two fronts: (1) the integration of AI "assistants" within E2EE applications, and (2) the use of E2EE data for training AI models. We analyze the potential security implications of each, and identify conflicts with the security guarantees of E2EE. Then, we analyze legal implications of integrating AI models in E2EE applications, given how AI integration can undermine the confidentiality that E2EE promises. Finally, we offer a list of detailed recommendations based on our technical and legal analyses, including: technical design choices that must be prioritized to uphold E2EE security; how service providers must accurately represent E2EE security; and best practices for the default behavior of AI features and for requesting user consent. We hope this paper catalyzes an informed conversation on the tensions that arise between the brisk deployment of AI and the security offered by E2EE, and guides the responsible development of new AI features.

Title: Recommender Engine Driven Client Selection in Federated Brain Tumor Segmentation

Authors: Muhammad Irfan Khan, Elina Kontio, Suleiman A. Khan, Mojtaba Jafaritadi
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.20250
Pdf URL: https://arxiv.org/pdf/2412.20250
Copy Paste: [[2412.20250]] Recommender Engine Driven Client Selection in Federated Brain Tumor Segmentation(https://arxiv.org/abs/2412.20250)
Keywords: robust, federate, segmentation
Abstract: This study presents a robust and efficient client selection protocol designed to optimize the Federated Learning (FL) process for the Federated Tumor Segmentation Challenge (FeTS 2024). In the evolving landscape of FL, the judicious selection of collaborators emerges as a critical determinant for the success and efficiency of collective learning endeavors, particularly in domains requiring high precision. This work introduces a recommender engine framework based on non-negative matrix factorization (NNMF) and a hybrid aggregation approach that blends content-based and collaborative filtering. This method intelligently analyzes historical performance, expertise, and other relevant metrics to identify the most suitable collaborators. This approach not only addresses the cold start problem where new or inactive collaborators pose selection challenges due to limited data but also significantly improves the precision and efficiency of the FL process. Additionally, we propose harmonic similarity weight aggregation (HSimAgg) for adaptive aggregation of model parameters. We utilized a dataset comprising 1,251 multi-parametric magnetic resonance imaging (mpMRI) scans from individuals diagnosed with glioblastoma (GBM) for training purposes and an additional 219 mpMRI scans for external evaluations. Our federated tumor segmentation approach achieved dice scores of 0.7298, 0.7424, and 0.8218 for enhancing tumor (ET), tumor core (TC), and whole tumor (WT) segmentation tasks respectively on the external validation set. In conclusion, this research demonstrates that selecting collaborators with expertise aligned to specific tasks, like brain tumor segmentation, improves the effectiveness of FL networks.

Title: ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty

Authors: Qing Zong, Zhaowei Wang, Tianshi Zheng, Xiyu Ren, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20251
Pdf URL: https://arxiv.org/pdf/2412.20251
Copy Paste: [[2412.20251]] ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty(https://arxiv.org/abs/2412.20251)
Keywords: robust
Abstract: The rapid development of LLMs has sparked extensive research into their factual knowledge. Current works claim that LLMs fall short on questions requiring less frequent knowledge. However, their proof is incomplete since they only study the influence of entity frequency, which can not fully represent knowledge frequency. So we introduce ComparisonQA benchmark, containing 283K abstract questions, each instantiated by a pair of high-frequency and low-frequency entities. It ensures a controllable comparison because the difference of knowledge frequency between such a pair is only related to entity frequency. In addition, to avoid possible semantic shortcuts, which is a severe problem of current LLMs study, we design a two-round method for knowledge robustness measurement utilizing both correctness and uncertainty. Experiments reveal that LLMs exhibit particularly low robustness regarding low-frequency knowledge, and GPT-4o is even the worst under this measurement. Besides, we introduce an automatic method to filter out questions with low-quality and shortcuts to form ComparisonQA-Hard. We find that uncertainty effectively identifies such questions while maintaining the data size.

Title: Election of Collaborators via Reinforcement Learning for Federated Brain Tumor Segmentation

Authors: Muhammad Irfan Khan, Elina Kontio, Suleiman A. Khan, Mojtaba Jafaritadi
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.20253
Pdf URL: https://arxiv.org/pdf/2412.20253
Copy Paste: [[2412.20253]] Election of Collaborators via Reinforcement Learning for Federated Brain Tumor Segmentation(https://arxiv.org/abs/2412.20253)
Keywords: privacy, robust, federate, segmentation
Abstract: Federated learning (FL) enables collaborative model training across decentralized datasets while preserving data privacy. However, optimally selecting participating collaborators in dynamic FL environments remains challenging. We present RL-HSimAgg, a novel reinforcement learning (RL) and similarity-weighted aggregation (simAgg) algorithm using harmonic mean to manage outlier data points. This paper proposes applying multi-armed bandit algorithms to improve collaborator selection and model generalization. By balancing exploration-exploitation trade-offs, these RL methods can promote resource-efficient training with diverse datasets. We demonstrate the effectiveness of Epsilon-greedy (EG) and upper confidence bound (UCB) algorithms for federated brain lesion segmentation. In simulation experiments on internal and external validation sets, RL-HSimAgg with UCB collaborator outperformed the EG method across all metrics, achieving higher Dice scores for Enhancing Tumor (0.7334 vs 0.6797), Tumor Core (0.7432 vs 0.6821), and Whole Tumor (0.8252 vs 0.7931) segmentation. Therefore, for the Federated Tumor Segmentation Challenge (FeTS 2024), we consider UCB as our primary client selection approach in federated Glioblastoma lesion segmentation of multi-modal MRIs. In conclusion, our research demonstrates that RL-based collaborator management, e.g. using UCB, can potentially improve model robustness and flexibility in distributed learning environments, particularly in domains like brain tumor segmentation.

Title: An Anomaly Detection System Based on Generative Classifiers for Controller Area Network

Authors: Chunheng Zhao, Stefano Longari, Michele Carminati, Pierluigi Pisu
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20255
Pdf URL: https://arxiv.org/pdf/2412.20255
Copy Paste: [[2412.20255]] An Anomaly Detection System Based on Generative Classifiers for Controller Area Network(https://arxiv.org/abs/2412.20255)
Keywords: security, attack, generative
Abstract: As electronic systems become increasingly complex and prevalent in modern vehicles, securing onboard networks is crucial, particularly as many of these systems are safety-critical. Researchers have demonstrated that modern vehicles are susceptible to various types of attacks, enabling attackers to gain control and compromise safety-critical electronic systems. Consequently, several Intrusion Detection Systems (IDSs) have been proposed in the literature to detect such cyber-attacks on vehicles. This paper introduces a novel generative classifier-based Intrusion Detection System (IDS) designed for anomaly detection in automotive networks, specifically focusing on the Controller Area Network (CAN). Leveraging variational Bayes, our proposed IDS utilizes a deep latent variable model to construct a causal graph for conditional probabilities. An auto-encoder architecture is utilized to build the classifier to estimate conditional probabilities, which contribute to the final prediction probabilities through Bayesian inference. Comparative evaluations against state-of-the-art IDSs on a public Car-hacking dataset highlight our proposed classifier's superior performance in improving detection accuracy and F1-score. The proposed IDS demonstrates its efficacy by outperforming existing models with limited training data, providing enhanced security assurance for automotive systems.

Title: Scoring with Large Language Models: A Study on Measuring Empathy of Responses in Dialogues

Authors: Henry J. Xie, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20264
Pdf URL: https://arxiv.org/pdf/2412.20264
Copy Paste: [[2412.20264]] Scoring with Large Language Models: A Study on Measuring Empathy of Responses in Dialogues(https://arxiv.org/abs/2412.20264)
Keywords: large language model
Abstract: In recent years, Large Language Models (LLMs) have become increasingly more powerful in their ability to complete complex tasks. One such task in which LLMs are often employed is scoring, i.e., assigning a numerical value from a certain scale to a subject. In this paper, we strive to understand how LLMs score, specifically in the context of empathy scoring. We develop a novel and comprehensive framework for investigating how effective LLMs are at measuring and scoring empathy of responses in dialogues, and what methods can be employed to deepen our understanding of LLM scoring. Our strategy is to approximate the performance of state-of-the-art and fine-tuned LLMs with explicit and explainable features. We train classifiers using various features of dialogues including embeddings, the Motivational Interviewing Treatment Integrity (MITI) Code, a set of explicit subfactors of empathy as proposed by LLMs, and a combination of the MITI Code and the explicit subfactors. Our results show that when only using embeddings, it is possible to achieve performance close to that of generic LLMs, and when utilizing the MITI Code and explicit subfactors scored by an LLM, the trained classifiers can closely match the performance of fine-tuned LLMs. We employ feature selection methods to derive the most crucial features in the process of empathy scoring. Our work provides a new perspective toward understanding LLM empathy scoring and helps the LLM community explore the potential of LLM scoring in social science studies.

Title: TeLU Activation Function for Fast and Stable Deep Learning

Authors: Alfredo Fernandez, Ankur Mali
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20269
Pdf URL: https://arxiv.org/pdf/2412.20269
Copy Paste: [[2412.20269]] TeLU Activation Function for Fast and Stable Deep Learning(https://arxiv.org/abs/2412.20269)
Keywords: robust, transformer
Abstract: We propose the Hyperbolic Tangent Exponential Linear Unit (TeLU), a neural network hidden activation function defined as TeLU(x)=xtanh(exp(x)). TeLU's design is grounded in the core principles of key activation functions, achieving strong convergence by closely approximating the identity function in its active region while effectively mitigating the vanishing gradient problem in its saturating region. Its simple formulation enhances computational efficiency, leading to improvements in scalability and convergence speed. Unlike many modern activation functions, TeLU seamlessly combines the simplicity and effectiveness of ReLU with the smoothness and analytic properties essential for learning stability in deep neural networks. TeLU's ability to mimic the behavior and optimal hyperparameter settings of ReLU, while introducing the benefits of smoothness and curvature, makes it an ideal drop-in replacement. Its analytic nature positions TeLU as a powerful universal approximator, enhancing both robustness and generalization across a multitude of experiments. We rigorously validate these claims through theoretical analysis and experimental validation, demonstrating TeLU's performance across challenging benchmarks; including ResNet18 on ImageNet, Dynamic-Pooling Transformers on Text8, and Recurrent Neural Networks (RNNs) on the Penn TreeBank dataset. These results highlight TeLU's potential to set a new standard in activation functions, driving more efficient and stable learning in deep neural networks, thereby accelerating scientific discoveries across various fields.

Title: Transformer-Based Contrastive Meta-Learning For Low-Resource Generalizable Activity Recognition

Authors: Junyao Wang, Mohammad Abdullah Al Faruque
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20290
Pdf URL: https://arxiv.org/pdf/2412.20290
Copy Paste: [[2412.20290]] Transformer-Based Contrastive Meta-Learning For Low-Resource Generalizable Activity Recognition(https://arxiv.org/abs/2412.20290)
Keywords: transformer
Abstract: Deep learning has been widely adopted for human activity recognition (HAR) while generalizing a trained model across diverse users and scenarios remains challenging due to distribution shifts. The inherent low-resource challenge in HAR, i.e., collecting and labeling adequate human-involved data can be prohibitively costly, further raising the difficulty of tackling DS. We propose TACO, a novel transformer-based contrastive meta-learning approach for generalizable HAR. TACO addresses DS by synthesizing virtual target domains in training with explicit consideration of model generalizability. Additionally, we extract expressive feature with the attention mechanism of Transformer and incorporate the supervised contrastive loss function within our meta-optimization to enhance representation learning. Our evaluation demonstrates that TACO achieves notably better performance across various low-resource DS scenarios.

Title: An analytic theory of creativity in convolutional diffusion models

Authors: Mason Kamb, Surya Ganguli
Subjects: cs.LG, cond-mat.dis-nn, q-bio.NC, stat.ML
Abstract URL: https://arxiv.org/abs/2412.20292
Pdf URL: https://arxiv.org/pdf/2412.20292
Copy Paste: [[2412.20292]] An analytic theory of creativity in convolutional diffusion models(https://arxiv.org/abs/2412.20292)
Keywords: diffusion
Abstract: We obtain the first analytic, interpretable and predictive theory of creativity in convolutional diffusion models. Indeed, score-based diffusion models can generate highly creative images that lie far from their training data. But optimal score-matching theory suggests that these models should only be able to produce memorized training examples. To reconcile this theory-experiment gap, we identify two simple inductive biases, locality and equivariance, that: (1) induce a form of combinatorial creativity by preventing optimal score-matching; (2) result in a fully analytic, completely mechanistically interpretable, equivariant local score (ELS) machine that, (3) without any training can quantitatively predict the outputs of trained convolution only diffusion models (like ResNets and UNets) with high accuracy (median $r^2$ of $0.90, 0.91, 0.94$ on CIFAR10, FashionMNIST, and MNIST). Our ELS machine reveals a locally consistent patch mosaic model of creativity, in which diffusion models create exponentially many novel images by mixing and matching different local training set patches in different image locations. Our theory also partially predicts the outputs of pre-trained self-attention enabled UNets (median $r^2 \sim 0.75$ on CIFAR10), revealing an intriguing role for attention in carving out semantic coherence from local patch mosaics.

Title: An experimental study on fairness-aware machine learning for credit scoring problem

Authors: Huyen Giang Thi Thu, Thang Viet Doan, Tai Le Quy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20298
Pdf URL: https://arxiv.org/pdf/2412.20298
Copy Paste: [[2412.20298]] An experimental study on fairness-aware machine learning for credit scoring problem(https://arxiv.org/abs/2412.20298)
Keywords: protect, fair
Abstract: Digitalization of credit scoring is an essential requirement for financial organizations and commercial banks, especially in the context of digital transformation. Machine learning techniques are commonly used to evaluate customers' creditworthiness. However, the predicted outcomes of machine learning models can be biased toward protected attributes, such as race or gender. Numerous fairness-aware machine learning models and fairness measures have been proposed. Nevertheless, their performance in the context of credit scoring has not been thoroughly investigated. In this paper, we present a comprehensive experimental study of fairness-aware machine learning in credit scoring. The study explores key aspects of credit scoring, including financial datasets, predictive models, and fairness measures. We also provide a detailed evaluation of fairness-aware predictive models and fairness measures on widely used financial datasets.

Title: EXAdam: The Power of Adaptive Cross-Moments

Authors: Ahmed M. Adly
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20302
Pdf URL: https://arxiv.org/pdf/2412.20302
Copy Paste: [[2412.20302]] EXAdam: The Power of Adaptive Cross-Moments(https://arxiv.org/abs/2412.20302)
Keywords: robust
Abstract: This paper introduces EXAdam ($\textbf{EX}$tended $\textbf{Adam}$), a novel optimization algorithm that builds upon the widely-used Adam optimizer. EXAdam incorporates three key enhancements: (1) new debiasing terms for improved moment estimation, (2) a gradient-based acceleration mechanism for increased responsiveness to the current loss landscape, and (3) a dynamic step size formula that allows for continuous growth of the learning rate throughout training. These innovations work synergistically to address limitations of the original Adam algorithm, potentially offering improved convergence properties, enhanced ability to escape saddle points, and greater robustness to hyperparameter choices. I provide a theoretical analysis of EXAdam's components and their interactions, highlighting the algorithm's potential advantages in navigating complex optimization landscapes. Empirical evaluations demonstrate EXAdam's superiority over Adam, achieving 48.07% faster convergence and yielding improvements of 4.6%, 4.13%, and 2.39% in training, validation, and testing accuracies, respectively, when applied to a CNN trained on the CIFAR-10 dataset. While these results are promising, further empirical validation across diverse tasks is essential to fully gauge EXAdam's efficacy. Nevertheless, EXAdam represents a significant advancement in adaptive optimization techniques, with promising implications for a wide range of machine learning applications. This work aims to contribute to the ongoing development of more efficient, adaptive, and universally applicable optimization methods in the field of machine learning and artificial intelligence.

Title: Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain

Authors: Shintaro Ozaki, Yuta Kato, Siyuan Feng, Masayo Tomita, Kazuki Hayashi, Ryoma Obara, Masafumi Oyamada, Katsuhiko Hayashi, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20309
Pdf URL: https://arxiv.org/pdf/2412.20309
Copy Paste: [[2412.20309]] Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain(https://arxiv.org/abs/2412.20309)
Keywords: large language model
Abstract: Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications. However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored, although the confidence of information is very critical in some domains, such as finance, healthcare, and medicine. Our study focuses the impact of RAG on confidence within the medical domain under various configurations and models. We evaluate confidence by treating the model's predicted probability as its output and calculating Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) scores based on the probabilities and accuracy. In addition, we analyze whether the order of retrieved documents within prompts calibrates the confidence. Our findings reveal large variation in confidence and accuracy depending on the model, settings, and the format of input prompts. These results underscore the necessity of optimizing configurations based on the specific model and conditions.

Title: Motion Transfer-Driven intra-class data augmentation for Finger Vein Recognition

Authors: Xiu-Feng Huang, Lai-Man Po, Wei-Feng Ou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20327
Pdf URL: https://arxiv.org/pdf/2412.20327
Copy Paste: [[2412.20327]] Motion Transfer-Driven intra-class data augmentation for Finger Vein Recognition(https://arxiv.org/abs/2412.20327)
Keywords: secure, biometric
Abstract: Finger vein recognition (FVR) has emerged as a secure biometric technique because of the confidentiality of vascular bio-information. Recently, deep learning-based FVR has gained increased popularity and achieved promising performance. However, the limited size of public vein datasets has caused overfitting issues and greatly limits the recognition performance. Although traditional data augmentation can partially alleviate this data shortage issue, it cannot capture the real finger posture variations due to the rigid label-preserving image transformations, bringing limited performance improvement. To address this issue, we propose a novel motion transfer (MT) model for finger vein image data augmentation via modeling the actual finger posture and rotational movements. The proposed model first utilizes a key point detector to extract the key point and pose map of the source and drive finger vein images. We then utilize a dense motion module to estimate the motion optical flow, which is fed to an image generation module for generating the image with the target pose. Experiments conducted on three public finger vein databases demonstrate that the proposed motion transfer model can effectively improve recognition accuracy. Code is available at: this https URL.

Title: Dual-Level Precision Edges Guided Multi-View Stereo with Accurate Planarization

Authors: Kehua Chen, Zhenlong Yuan, Tianlu Mao, Zhaoqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20328
Pdf URL: https://arxiv.org/pdf/2412.20328
Copy Paste: [[2412.20328]] Dual-Level Precision Edges Guided Multi-View Stereo with Accurate Planarization(https://arxiv.org/abs/2412.20328)
Keywords: robust
Abstract: The reconstruction of low-textured areas is a prominent research focus in multi-view stereo (MVS). In recent years, traditional MVS methods have performed exceptionally well in reconstructing low-textured areas by constructing plane models. However, these methods often encounter issues such as crossing object boundaries and limited perception ranges, which undermine the robustness of plane model construction. Building on previous work (APD-MVS), we propose the DPE-MVS method. By introducing dual-level precision edge information, including fine and coarse edges, we enhance the robustness of plane model construction, thereby improving reconstruction accuracy in low-textured areas. Furthermore, by leveraging edge information, we refine the sampling strategy in conventional PatchMatch MVS and propose an adaptive patch size adjustment approach to optimize matching cost calculation in both stochastic and low-textured areas. This additional use of edge information allows for more precise and robust matching. Our method achieves state-of-the-art performance on the ETH3D and Tanks & Temples benchmarks. Notably, our method outperforms all published methods on the ETH3D benchmark.

Title: Asynchronous Federated Clustering with Unknown Number of Clusters

Authors: Yunfan Zhang, Yiqun Zhang, Yang Lu, Mengke Li, Xi Chen, Yiu-ming Cheung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20341
Pdf URL: https://arxiv.org/pdf/2412.20341
Copy Paste: [[2412.20341]] Asynchronous Federated Clustering with Unknown Number of Clusters(https://arxiv.org/abs/2412.20341)
Keywords: secure, privacy, federate
Abstract: Federated Clustering (FC) is crucial to mining knowledge from unlabeled non-Independent Identically Distributed (non-IID) data provided by multiple clients while preserving their privacy. Most existing attempts learn cluster distributions at local clients, and then securely pass the desensitized information to the server for aggregation. However, some tricky but common FC problems are still relatively unexplored, including the heterogeneity in terms of clients' communication capacity and the unknown number of proper clusters $k^*$. To further bridge the gap between FC and real application scenarios, this paper first shows that the clients' communication asynchrony and unknown $k^*$ are complex coupling problems, and then proposes an Asynchronous Federated Cluster Learning (AFCL) method accordingly. It spreads the excessive number of seed points to the clients as a learning medium and coordinates them across the clients to form a consensus. To alleviate the distribution imbalance cumulated due to the unforeseen asynchronous uploading from the heterogeneous clients, we also design a balancing mechanism for seeds updating. As a result, the seeds gradually adapt to each other to reveal a proper number of clusters. Extensive experiments demonstrate the efficacy of AFCL.

Title: Deep Learning in Image Classification: Evaluating VGG19's Performance on Complex Visual Data

Authors: Weijie He, Tong Zhou, Yanlin Xiang, Yang Lin, Jiacheng Hu, Runyuan Bao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20345
Pdf URL: https://arxiv.org/pdf/2412.20345
Copy Paste: [[2412.20345]] Deep Learning in Image Classification: Evaluating VGG19's Performance on Complex Visual Data(https://arxiv.org/abs/2412.20345)
Keywords: extraction
Abstract: This study aims to explore the automatic classification method of pneumonia X-ray images based on VGG19 deep convolutional neural network, and evaluate its application effect in pneumonia diagnosis by comparing with classic models such as SVM, XGBoost, MLP, and ResNet50. The experimental results show that VGG19 performs well in multiple indicators such as accuracy (92%), AUC (0.95), F1 score (0.90) and recall rate (0.87), which is better than other comparison models, especially in image feature extraction and classification accuracy. Although ResNet50 performs well in some indicators, it is slightly inferior to VGG19 in recall rate and F1 score. Traditional machine learning models SVM and XGBoost are obviously limited in image classification tasks, especially in complex medical image analysis tasks, and their performance is relatively mediocre. The research results show that deep learning, especially convolutional neural networks, have significant advantages in medical image classification tasks, especially in pneumonia X-ray image analysis, and can provide efficient and accurate automatic diagnosis support. This research provides strong technical support for the early detection of pneumonia and the development of automated diagnosis systems and also lays the foundation for further promoting the application and development of automated medical image processing technology.

Title: HindiLLM: Large Language Model for Hindi

Authors: Sanjay Chouhan, Shubha Brata Nath, Aparajita Dutta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20357
Pdf URL: https://arxiv.org/pdf/2412.20357
Copy Paste: [[2412.20357]] HindiLLM: Large Language Model for Hindi(https://arxiv.org/abs/2412.20357)
Keywords: large language model
Abstract: The advancements in the Large Language Model (LLM) have helped in solving several problems related to language processing. Most of the researches have focused on the English language only, because of its popularity and abundance on the internet. However, a high-performance language model for Hindi and other Indic languages is lacking in the literature. In this work, we have pre-trained two autoregressive LLM models for the Hindi language, namely HindiLLM-Small and HindiLLM-Medium. We use a two-step process comprising unsupervised pre-training and supervised fine-tuning. First, we create a large and high-quality text corpus for unsupervised pre-training. Next, we train a Byte-Pair Encoding, named HindiLLM tokenizer, using the pre-training text data. We then perform training on the unlabeled data, known as the pre-training step, to get the HindiLLM base models. Furthermore, we perform fine-tuning of the HindiLLM base models for different tasks like sentiment analysis, text classification, natural language inference, and multiple choice question-answer on popular labeled datasets to measure the real-world performance. The evaluation shows that the HindiLLM-based fine-tuned models outperform several models in most of the language related tasks.

Title: Differential Evolution Integrated Hybrid Deep Learning Model for Object Detection in Pre-made Dishes

Authors: Lujia Lv, Di Wu, Yangyi Xia, Jia Wu, Xiaojing Liu, Yi He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20370
Pdf URL: https://arxiv.org/pdf/2412.20370
Copy Paste: [[2412.20370]] Differential Evolution Integrated Hybrid Deep Learning Model for Object Detection in Pre-made Dishes(https://arxiv.org/abs/2412.20370)
Keywords: transformer
Abstract: With the continuous improvement of people's living standards and fast-paced working conditions, pre-made dishes are becoming increasingly popular among families and restaurants due to their advantages of time-saving, convenience, variety, cost-effectiveness, standard quality, etc. Object detection is a key technology for selecting ingredients and evaluating the quality of dishes in the pre-made dishes industry. To date, many object detection approaches have been proposed. However, accurate object detection of pre-made dishes is extremely difficult because of overlapping occlusion of ingredients, similarity of ingredients, and insufficient light in the processing environment. As a result, the recognition scene is relatively complex and thus leads to poor object detection by a single model. To address this issue, this paper proposes a Differential Evolution Integrated Hybrid Deep Learning (DEIHDL) model. The main idea of DEIHDL is three-fold: 1) three YOLO-based and transformer-based base models are developed respectively to increase diversity for detecting objects of pre-made dishes, 2) the three base models are integrated by differential evolution optimized self-adjusting weights, and 3) weighted boxes fusion strategy is employed to score the confidence of the three base models during the integration. As such, DEIHDL possesses the multi-performance originating from the three base models to achieve accurate object detection in complex pre-made dish scenes. Extensive experiments on real datasets demonstrate that the proposed DEIHDL model significantly outperforms the base models in detecting objects of pre-made dishes.

Title: LLM2: Let Large Language Models Harness System 2 Reasoning

Authors: Cheng Yang, Chufan Shi, Siheng Li, Bo Shui, Yujiu Yang, Wai Lam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20372
Pdf URL: https://arxiv.org/pdf/2412.20372
Copy Paste: [[2412.20372]] LLM2: Let Large Language Models Harness System 2 Reasoning(https://arxiv.org/abs/2412.20372)
Keywords: large language model
Abstract: Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We posit that these limitations are rooted in the foundational autoregressive architecture of LLMs, which inherently lacks mechanisms for differentiating between desirable and undesirable results. Drawing inspiration from the dual-process theory of human cognition, we introduce LLM2, a novel framework that combines an LLM (System 1) with a process-based verifier (System 2). Within LLM2, the LLM is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs. The verifier is trained with a pairwise comparison loss on synthetic process-supervision data generated through our token quality exploration strategy. Empirical results on mathematical reasoning benchmarks substantiate the efficacy of LLM2, exemplified by an accuracy enhancement from 50.3 to 57.8 (+7.5) for Llama3-1B on GSM8K. Furthermore, when combined with self-consistency, LLM2 achieves additional improvements, boosting major@20 accuracy from 56.2 to 70.2 (+14.0).

Title: FairDiffusion: Enhancing Equity in Latent Diffusion Models via Fair Bayesian Perturbation

Authors: Yan Luo, Muhammad Osama Khan, Congcong Wen, Muhammad Muneeb Afzal, Titus Fidelis Wuermeling, Min Shi, Yu Tian, Yi Fang, Mengyu Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20374
Pdf URL: https://arxiv.org/pdf/2412.20374
Copy Paste: [[2412.20374]] FairDiffusion: Enhancing Equity in Latent Diffusion Models via Fair Bayesian Perturbation(https://arxiv.org/abs/2412.20374)
Keywords: fair, diffusion, generative
Abstract: Recent progress in generative AI, especially diffusion models, has demonstrated significant utility in text-to-image synthesis. Particularly in healthcare, these models offer immense potential in generating synthetic datasets and training medical students. However, despite these strong performances, it remains uncertain if the image generation quality is consistent across different demographic subgroups. To address this critical concern, we present the first comprehensive study on the fairness of medical text-to-image diffusion models. Our extensive evaluations of the popular Stable Diffusion model reveal significant disparities across gender, race, and ethnicity. To mitigate these biases, we introduce FairDiffusion, an equity-aware latent diffusion model that enhances fairness in both image generation quality as well as the semantic correlation of clinical features. In addition, we also design and curate FairGenMed, the first dataset for studying the fairness of medical generative models. Complementing this effort, we further evaluate FairDiffusion on two widely-used external medical datasets: HAM10000 (dermatoscopic images) and CheXpert (chest X-rays) to demonstrate FairDiffusion's effectiveness in addressing fairness concerns across diverse medical imaging modalities. Together, FairDiffusion and FairGenMed significantly advance research in fair generative learning, promoting equitable benefits of generative AI in healthcare.

Title: Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning

Authors: Yan Luo, Congcong Wen, Min Shi, Hao Huang, Yi Fang, Mengyu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20377
Pdf URL: https://arxiv.org/pdf/2412.20377
Copy Paste: [[2412.20377]] Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning(https://arxiv.org/abs/2412.20377)
Keywords: fair
Abstract: We present a comprehensive theoretical framework analyzing the relationship between data distributions and fairness guarantees in equitable deep learning. Our work establishes novel theoretical bounds that explicitly account for data distribution heterogeneity across demographic groups, while introducing a formal analysis framework that minimizes expected loss differences across these groups. We derive comprehensive theoretical bounds for fairness errors and convergence rates, and characterize how distributional differences between groups affect the fundamental trade-off between fairness and accuracy. Through extensive experiments on diverse datasets, including FairVision (ophthalmology), CheXpert (chest X-rays), HAM10000 (dermatology), and FairFace (facial recognition), we validate our theoretical findings and demonstrate that differences in feature distributions across demographic groups significantly impact model fairness, with performance disparities particularly pronounced in racial categories. The theoretical bounds we derive crroborate these empirical observations, providing insights into the fundamental limits of achieving fairness in deep learning models when faced with heterogeneous data distributions. This work advances our understanding of fairness in AI-based diagnosis systems and provides a theoretical foundation for developing more equitable algorithms. The code for analysis is publicly available via \url{this https URL}.

Title: Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control

Authors: Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, Yiran Zhong
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.20378
Pdf URL: https://arxiv.org/pdf/2412.20378
Copy Paste: [[2412.20378]] Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control(https://arxiv.org/abs/2412.20378)
Keywords: diffusion
Abstract: Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.

Title: Prot\'eg\'e: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

Authors: Jia Wei Sii, Chee Seng Chan
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.20381
Pdf URL: https://arxiv.org/pdf/2412.20381
Copy Paste: [[2412.20381]] Prot\'eg\'e: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)(https://arxiv.org/abs/2412.20381)
Keywords: generative
Abstract: Makeup is no longer confined to physical application; people now use mobile apps to digitally apply makeup to their photos, which they then share on social media. However, while this shift has made makeup more accessible, designing diverse makeup styles tailored to individual faces remains a challenge. This challenge currently must still be done manually by humans. Existing systems, such as makeup recommendation engines and makeup transfer techniques, offer limitations in creating innovative makeups for different individuals "intuitively" -- significant user effort and knowledge needed and limited makeup options available in app. Our motivation is to address this challenge by proposing Protégé, a new makeup application, leveraging recent generative model -- GANs to learn and automatically generate makeup styles. This is a task that existing makeup applications (i.e., makeup recommendation systems using expert system and makeup transfer methods) are unable to perform. Extensive experiments has been conducted to demonstrate the capability of Protégé in learning and creating diverse makeups, providing a convenient and intuitive way, marking a significant leap in digital makeup technology!

Title: Natural Language Fine-Tuning

Authors: Jia Liu, Yue Wang, Zhiqi Lin, Min Chen, Yixue Hao, Long Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20382
Pdf URL: https://arxiv.org/pdf/2412.20382
Copy Paste: [[2412.20382]] Natural Language Fine-Tuning(https://arxiv.org/abs/2412.20382)
Keywords: large language model
Abstract: Large language model fine-tuning techniques typically depend on extensive labeled data, external guidance, and feedback, such as human alignment, scalar rewards, and demonstration. However, in practical application, the scarcity of specific knowledge poses unprecedented challenges to existing fine-tuning techniques. In this paper, focusing on fine-tuning tasks in specific domains with limited data, we introduce Natural Language Fine-Tuning (NLFT), which utilizes natural language for fine-tuning for the first time. By leveraging the strong language comprehension capability of the target LM, NLFT attaches the guidance of natural language to the token-level outputs. Then, saliency tokens are identified with calculated probabilities. Since linguistic information is effectively utilized in NLFT, our proposed method significantly reduces training costs. It markedly enhances training efficiency, comprehensively outperforming reinforcement fine-tuning algorithms in accuracy, time-saving, and resource conservation. Additionally, on the macro level, NLFT can be viewed as a token-level fine-grained optimization of SFT, thereby efficiently replacing the SFT process without the need for warm-up (as opposed to ReFT requiring multiple rounds of warm-up with SFT). Compared to SFT, NLFT does not increase the algorithmic complexity, maintaining O(n). Extensive experiments on the GSM8K dataset demonstrate that NLFT, with only 50 data instances, achieves an accuracy increase that exceeds SFT by 219%. Compared to ReFT, the time complexity and space complexity of NLFT are reduced by 78.27% and 92.24%, respectively. The superior technique of NLFT is paving the way for the deployment of various innovative LLM fine-tuning applications when resources are limited at network edges. Our code has been released at this https URL.

Title: Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning

Authors: Zhifang Zhang, Shuo He, Bingquan Shen, Lei Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20392
Pdf URL: https://arxiv.org/pdf/2412.20392
Copy Paste: [[2412.20392]] Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning(https://arxiv.org/abs/2412.20392)
Keywords: defense, attack
Abstract: Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets, yet they exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns. In this paper, we disclose that CLIP's vulnerabilities primarily stem from its excessive encoding of class-irrelevant features, which can compromise the model's visual feature resistivity to input perturbations, making it more susceptible to capturing the trigger patterns inserted by backdoor attacks. Inspired by this finding, we propose Repulsive Visual Prompt Tuning (RVPT), a novel defense approach that employs specially designed deep visual prompt tuning and feature-repelling loss to eliminate excessive class-irrelevant features while simultaneously optimizing cross-entropy loss to maintain clean accuracy. Unlike existing multimodal backdoor defense methods that typically require the availability of poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot downstream clean samples and only tunes a small number of parameters. Empirical results demonstrate that RVPT tunes only 0.27\% of the parameters relative to CLIP, yet it significantly outperforms state-of-the-art baselines, reducing the attack success rate from 67.53\% to 2.76\% against SoTA attacks and effectively generalizing its defensive capabilities across multiple datasets.

Title: Open-Sora: Democratizing Efficient Video Production for All

Authors: Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, Yang You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20404
Pdf URL: https://arxiv.org/pdf/2412.20404
Copy Paste: [[2412.20404]] Open-Sora: Democratizing Efficient Video Production for All(https://arxiv.org/abs/2412.20404)
Keywords: diffusion, transformer
Abstract: Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: this https URL.

Title: A Multidisciplinary Approach to Telegram Data Analysis

Authors: Velizar Varbanov, Kalin Kopanov, Tatiana Atanasova
Subjects: cs.CR, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20406
Pdf URL: https://arxiv.org/pdf/2412.20406
Copy Paste: [[2412.20406]] A Multidisciplinary Approach to Telegram Data Analysis(https://arxiv.org/abs/2412.20406)
Keywords: security, attack
Abstract: This paper presents a multidisciplinary approach to analyzing data from Telegram for early warning information regarding cyber threats. With the proliferation of hacktivist groups utilizing Telegram to disseminate information regarding future cyberattacks or to boast about successful ones, the need for effective data analysis methods is paramount. The primary challenge lies in the vast number of channels and the overwhelming volume of data, necessitating advanced techniques for discerning pertinent risks amidst the noise. To address this challenge, we employ a combination of neural network architectures and traditional machine learning algorithms. These methods are utilized to classify and identify potential cyber threats within the Telegram data. Additionally, sentiment analysis and entity recognition techniques are incorporated to provide deeper insights into the nature and context of the communicated information. The study evaluates the effectiveness of each method in detecting and categorizing cyber threats, comparing their performance and identifying areas for improvement. By leveraging these diverse analytical tools, we aim to enhance early warning systems for cyber threats, enabling more proactive responses to potential security breaches. This research contributes to the ongoing efforts to bolster cybersecurity measures in an increasingly interconnected digital landscape.

Title: Multi-Objective Large Language Model Unlearning

Authors: Zibin Pan, Shuwen Zhang, Yuesheng Zheng, Chi Li, Yuheng Cheng, Junhua Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20412
Pdf URL: https://arxiv.org/pdf/2412.20412
Copy Paste: [[2412.20412]] Multi-Objective Large Language Model Unlearning(https://arxiv.org/abs/2412.20412)
Keywords: large language model
Abstract: Machine unlearning in the domain of large language models (LLMs) has attracted great attention recently, which aims to effectively eliminate undesirable behaviors from LLMs without full retraining from scratch. In this paper, we explore the Gradient Ascent (GA) approach in LLM unlearning, which is a proactive way to decrease the prediction probability of the model on the target data in order to remove their influence. We analyze two challenges that render the process impractical: gradient explosion and catastrophic forgetting. To address these issues, we propose Multi-Objective Large Language Model Unlearning (MOLLM) algorithm. We first formulate LLM unlearning as a multi-objective optimization problem, in which the cross-entropy loss is modified to the unlearning version to overcome the gradient explosion issue. A common descent update direction is then calculated, which enables the model to forget the target data while preserving the utility of the LLM. Our empirical results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods in terms of unlearning effect and model utility preservation.

Title: EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

Authors: Daiheng Gao, Shilin Lu, Shaw Walters, Wenbo Zhou, Jiaming Chu, Jie Zhang, Bang Zhang, Mengxi Jia, Jian Zhao, Zhaoxin Fan, Weiming Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20413
Pdf URL: https://arxiv.org/pdf/2412.20413
Copy Paste: [[2412.20413]] EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers(https://arxiv.org/abs/2412.20413)
Keywords: diffusion, transformer, generative
Abstract: Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-erasure techniques that were originally designed for the previous T2I paradigm (\textit{e.g.}, SD v1.4). In this work, we introduce \logopic \textbf{EraseAnything}, the first method specifically developed to address concept erasure within the latest flow-based T2I framework. We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer to selectively suppress undesirable activations. Furthermore, we propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones. Experimental results demonstrate that EraseAnything successfully fills the research gap left by earlier methods in this new T2I paradigm, achieving state-of-the-art performance across a wide range of concept erasure tasks.

Title: Comparative Performance of Advanced NLP Models and LLMs in Multilingual Geo-Entity Detection

Authors: Kalin Kopanov
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.20414
Pdf URL: https://arxiv.org/pdf/2412.20414
Copy Paste: [[2412.20414]] Comparative Performance of Advanced NLP Models and LLMs in Multilingual Geo-Entity Detection(https://arxiv.org/abs/2412.20414)
Keywords: security, extraction, large language model
Abstract: The integration of advanced Natural Language Processing (NLP) methodologies and Large Language Models (LLMs) has significantly enhanced the extraction and analysis of geospatial data from multilingual texts, impacting sectors such as national and international security. This paper presents a comprehensive evaluation of leading NLP models -- SpaCy, XLM-RoBERTa, mLUKE, GeoLM -- and LLMs, specifically OpenAI's GPT 3.5 and GPT 4, within the context of multilingual geo-entity detection. Utilizing datasets from Telegram channels in English, Russian, and Arabic, we examine the performance of these models through metrics such as accuracy, precision, recall, and F1 scores, to assess their effectiveness in accurately identifying geospatial references. The analysis exposes each model's distinct advantages and challenges, underscoring the complexities involved in achieving precise geo-entity identification across varied linguistic landscapes. The conclusions drawn from this experiment aim to direct the enhancement and creation of more advanced and inclusive NLP tools, thus advancing the field of geospatial analysis and its application to global security.

Title: Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

Authors: Shiyun Chen, Li Lin, Pujin Cheng, ZhiCheng Jin, JianJian Chen, HaiDong Zhu, Kenneth K. Y. Wong, Xiaoying Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20418
Pdf URL: https://arxiv.org/pdf/2412.20418
Copy Paste: [[2412.20418]] Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment(https://arxiv.org/abs/2412.20418)
Keywords: diffusion, segmentation
Abstract: Multimodal learning has been demonstrated to enhance performance across various clinical tasks, owing to the diverse perspectives offered by different modalities of data. However, existing multimodal segmentation methods rely on well-registered multimodal data, which is unrealistic for real-world clinical images, particularly for indistinct and diffuse regions such as liver tumors. In this paper, we introduce Diff4MMLiTS, a four-stage multimodal liver tumor segmentation pipeline: pre-registration of the target organs in multimodal CTs; dilation of the annotated modality's mask and followed by its use in inpainting to obtain multimodal normal CTs without tumors; synthesis of strictly aligned multimodal CTs with tumors using the latent diffusion model based on multimodal CT features and randomly generated tumor masks; and finally, training the segmentation model, thus eliminating the need for strictly aligned multimodal data. Extensive experiments on public and internal datasets demonstrate the superiority of Diff4MMLiTS over other state-of-the-art multimodal segmentation methods.

Title: Bringing Objects to Life: 4D generation from 3D objects

Authors: Ohad Rahamim, Ori Malca, Dvir Samuel, Gal Chechik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20422
Pdf URL: https://arxiv.org/pdf/2412.20422
Copy Paste: [[2412.20422]] Bringing Objects to Life: 4D generation from 3D objects(https://arxiv.org/abs/2412.20422)
Keywords: diffusion, generative
Abstract: Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object. We first convert a 3D mesh into a ``static" 4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.

Title: Integrating Natural Language Processing Techniques of Text Mining Into Financial System: Applications and Limitations

Authors: Denisa Millo, Blerina Vika, Nevila Baci
Subjects: cs.CL, cs.AI, econ.GN
Abstract URL: https://arxiv.org/abs/2412.20438
Pdf URL: https://arxiv.org/pdf/2412.20438
Copy Paste: [[2412.20438]] Integrating Natural Language Processing Techniques of Text Mining Into Financial System: Applications and Limitations(https://arxiv.org/abs/2412.20438)
Keywords: extraction, interpretability
Abstract: The financial sector, a pivotal force in economic development, increasingly uses the intelligent technologies such as natural language processing to enhance data processing and insight extraction. This research paper through a review process of the time span of 2018-2023 explores the use of text mining as natural language processing techniques in various components of the financial system including asset pricing, corporate finance, derivatives, risk management, and public finance and highlights the need to address the specific problems in the discussion section. We notice that most of the research materials combined probabilistic with vector-space models, and text-data with numerical ones. The most used technique regarding information processing is the information classification technique and the most used algorithms include the long-short term memory and bidirectional encoder models. The research noticed that new specific algorithms are developed and the focus of the financial system is mainly on asset pricing component. The research also proposes a path from engineering perspective for researchers who need to analyze financial text. The challenges regarding text mining perspective such as data quality, context-adaption and model interpretability need to be solved so to integrate advanced natural language processing models and techniques in enhancing financial analysis and prediction. Keywords: Financial System (FS), Natural Language Processing (NLP), Software and Text Engineering, Probabilistic, Vector-Space, Models, Techniques, TextData, Financial Analysis.

Title: Image Augmentation Agent for Weakly Supervised Semantic Segmentation

Authors: Wangyu Wu, Xianglin Qiu, Siqi Song, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20439
Pdf URL: https://arxiv.org/pdf/2412.20439
Copy Paste: [[2412.20439]] Image Augmentation Agent for Weakly Supervised Semantic Segmentation(https://arxiv.org/abs/2412.20439)
Keywords: diffusion, large language model, segmentation
Abstract: Weakly-supervised semantic segmentation (WSSS) has achieved remarkable progress using only image-level labels. However, most existing WSSS methods focus on designing new network structures and loss functions to generate more accurate dense labels, overlooking the limitations imposed by fixed datasets, which can constrain performance improvements. We argue that more diverse trainable images provides WSSS richer information and help model understand more comprehensive semantic pattern. Therefore in this paper, we introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective. IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS. In practice, to address the instability in prompt generation by LLMs, we develop a prompt self-refinement mechanism. It allow LLMs to re-evaluate the rationality of generated prompts to produce more coherent prompts. Additionally, we insert an online filter into diffusion generation process to dynamically ensure the quality and balance of generated images. Experimental results show that our method significantly surpasses state-of-the-art WSSS approaches on the PASCAL VOC 2012 and MS COCO 2014 datasets.

Title: Enhancing Entertainment Translation for Indian Languages using Adaptive Context, Style and LLMs

Authors: Pratik Rakesh Singh, Mohammadi Zaki, Pankaj Wasnik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20440
Pdf URL: https://arxiv.org/pdf/2412.20440
Copy Paste: [[2412.20440]] Enhancing Entertainment Translation for Indian Languages using Adaptive Context, Style and LLMs(https://arxiv.org/abs/2412.20440)
Keywords: large language model
Abstract: We address the challenging task of neural machine translation (NMT) in the entertainment domain, where the objective is to automatically translate a given dialogue from a source language content to a target language. This task has various applications, particularly in automatic dubbing, subtitling, and other content localization tasks, enabling source content to reach a wider audience. Traditional NMT systems typically translate individual sentences in isolation, without facilitating knowledge transfer of crucial elements such as the context and style from previously encountered sentences. In this work, we emphasize the significance of these fundamental aspects in producing pertinent and captivating translations. We demonstrate their significance through several examples and propose a novel framework for entertainment translation, which, to our knowledge, is the first of its kind. Furthermore, we introduce an algorithm to estimate the context and style of the current session and use these estimations to generate a prompt that guides a Large Language Model (LLM) to generate high-quality translations. Our method is both language and LLM-agnostic, making it a general-purpose tool. We demonstrate the effectiveness of our algorithm through various numerical studies and observe significant improvement in the COMET scores over various state-of-the-art LLMs. Moreover, our proposed method consistently outperforms baseline LLMs in terms of win-ratio.

Title: Sub-optimal Learning in Meta-Classifier Attacks: A Study of Membership Inference on Differentially Private Location Aggregates

Authors: Yuhan Liu, Florent Guepin, Igor Shilov, Yves-Alexandre De Montjoye
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.20456
Pdf URL: https://arxiv.org/pdf/2412.20456
Copy Paste: [[2412.20456]] Sub-optimal Learning in Meta-Classifier Attacks: A Study of Membership Inference on Differentially Private Location Aggregates(https://arxiv.org/abs/2412.20456)
Keywords: privacy, protect, attack, membership infer
Abstract: The widespread collection and sharing of location data, even in aggregated form, raises major privacy concerns. Previous studies used meta-classifier-based membership inference attacks~(MIAs) with multi-layer perceptrons~(MLPs) to estimate privacy risks in location data, including when protected by differential privacy (DP). In this work, however, we show that a significant gap exists between the expected attack accuracy given by DP and the empirical attack accuracy even with informed attackers (also known as DP attackers), indicating a potential underestimation of the privacy risk. To explore the potential causes for the observed gap, we first propose two new metric-based MIAs: the one-threshold attack and the two-threshold attack. We evaluate their performances on real-world location data and find that different data distributions require different attack strategies for optimal performance: the one-threshold attack is more effective with Gaussian DP noise, while the two-threshold attack performs better with Laplace DP noise. Comparing their performance with one of the MLP-based attack models in previous works shows that the MLP only learns the one-threshold rule, leading to a suboptimal performance under the Laplace DP noise and an underestimation of the privacy risk. Second, we theoretically prove that MLPs can encode complex rules~(\eg, the two-threshold attack rule), which can be learned when given a substantial amount of training data. We conclude by discussing the implications of our findings in practice, including broader applications extending beyond location aggregates to any differentially private datasets containing multiple observations per individual and how techniques such as synthetic data generation and pre-training might enable MLP to learn more complex optimal rules.

Title: Single-image reflection removal via self-supervised diffusion models

Authors: Zhengyang Lu, Weifan Wang, Tianhao Guo, Feng Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.20466
Pdf URL: https://arxiv.org/pdf/2412.20466
Copy Paste: [[2412.20466]] Single-image reflection removal via self-supervised diffusion models(https://arxiv.org/abs/2412.20466)
Keywords: diffusion
Abstract: Reflections often degrade the visual quality of images captured through transparent surfaces, and reflection removal methods suffers from the shortage of paired real-world this http URL paper proposes a hybrid approach that combines cycle-consistency with denoising diffusion probabilistic models (DDPM) to effectively remove reflections from single images without requiring paired training data. The method introduces a Reflective Removal Network (RRN) that leverages DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network (RSN) that re-synthesizes the input image using the separated components through a nonlinear attention-based mechanism. Experimental results demonstrate the effectiveness of the proposed method on the SIR$^2$, Flash-Based Reflection Removal (FRR) Dataset, and a newly introduced Museum Reflection Removal (MRR) dataset, showing superior performance compared to state-of-the-art methods.

Title: Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding

Authors: Alexander Blatt, Dietrich Klakow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20467
Pdf URL: https://arxiv.org/pdf/2412.20467
Copy Paste: [[2412.20467]] Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding(https://arxiv.org/abs/2412.20467)
Keywords: robust
Abstract: Operational machine-learning based assistant systems must be robust in a wide range of scenarios. This hold especially true for the air-traffic control (ATC) domain. The robustness of an architecture is particularly evident in edge cases, such as high word error rate (WER) transcripts resulting from noisy ATC recordings or partial transcripts due to clipped recordings. To increase the edge-case robustness of call-sign recognition and understanding (CRU), a core tasks in ATC speech processing, we propose the multimodal call-sign-command recovery model (CCR). The CCR architecture leads to an increase in the edge case performance of up to 15%. We demonstrate this on our second proposed architecture, CallSBERT. A CRU model that has less parameters, can be fine-tuned noticeably faster and is more robust during fine-tuning than the state of the art for CRU. Furthermore, we demonstrate that optimizing for edge cases leads to a significantly higher accuracy across a wide operational range.

Title: JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling

Authors: Haorui Ji, Rong Wang, Taojun Lin, Hongdong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20470
Pdf URL: https://arxiv.org/pdf/2412.20470
Copy Paste: [[2412.20470]] JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling(https://arxiv.org/abs/2412.20470)
Keywords: diffusion, generative
Abstract: Generative modeling of 3D human bodies have been studied extensively in computer vision. The core is to design a compact latent representation that is both expressive and semantically interpretable, yet existing approaches struggle to achieve both requirements. In this work, we introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control. Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures, modeled by joint positions, and local surface geometries, characterized by features attached to each joint. This disentangled latent space design enables geometric and semantic interpretation, facilitating users with flexible controllability. To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline where two diffusions are employed to model the distribution of skeleton structures and local surface geometries respectively. Extensive experiments are conducted on public datasets, where we demonstrate the effectiveness of JADE framework in multiple tasks in terms of autoencoding reconstruction accuracy, editing controllability and generation quality compared with existing methods.

Title: Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution

Authors: Yao Tong, Weijun Li, Xuanli He, Haolan Zhan, Qiongkai Xu
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.20476
Pdf URL: https://arxiv.org/pdf/2412.20476
Copy Paste: [[2412.20476]] Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution(https://arxiv.org/abs/2412.20476)
Keywords: defense, attack
Abstract: The success of DNNs often depends on training with large-scale datasets, but building such datasets is both expensive and challenging. Consequently, public datasets from open-source platforms like HuggingFace have become popular, posing significant risks of data poisoning attacks. Existing backdoor defenses in NLP primarily focus on identifying and removing poisoned samples; however, purifying a backdoored model with these sample-cleaning approaches typically requires expensive retraining. Therefore, we propose Greedy Module Substitution (GMS), which identifies and substitutes ''deadwood'' modules (i.e., components critical to backdoor pathways) in a backdoored model to purify it. Our method relaxes the common dependency of prior model purification methods on clean datasets or clean auxiliary models. When applied to RoBERTa-large under backdoor attacks, GMS demonstrates strong effectiveness across various settings, particularly against widely recognized challenging attacks like LWS, achieving a post-purification attack success rate (ASR) of 9.7% on SST-2 compared to 58.8% for the best baseline approach.

Title: MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation

Authors: Minjae Seong, Jisong Kim, Geonho Bang, Hawook Jeong, Jun Won Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20480
Pdf URL: https://arxiv.org/pdf/2412.20480
Copy Paste: [[2412.20480]] MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation(https://arxiv.org/abs/2412.20480)
Keywords: robust
Abstract: Accurate 3D perception is essential for understanding the environment in autonomous driving. Recent advancements in 3D semantic occupancy prediction have leveraged camera-LiDAR fusion to improve robustness and accuracy. However, current methods allocate computational resources uniformly across all voxels, leading to inefficiency, and they also fail to adequately address occlusions, resulting in reduced accuracy in challenging scenarios. We propose MR-Occ, a novel approach for camera-LiDAR fusion-based 3D semantic occupancy prediction, addressing these challenges through three key components: Hierarchical Voxel Feature Refinement (HVFR), Multi-scale Occupancy Decoder (MOD), and Pixel to Voxel Fusion Network (PVF-Net). HVFR improves performance by enhancing features for critical voxels, reducing computational cost. MOD introduces an `occluded' class to better handle regions obscured from sensor view, improving accuracy. PVF-Net leverages densified LiDAR features to effectively fuse camera and LiDAR data through a deformable attention mechanism. Extensive experiments demonstrate that MR-Occ achieves state-of-the-art performance on the nuScenes-Occupancy dataset, surpassing previous approaches by +5.2% in IoU and +5.3% in mIoU while using fewer parameters and FLOPs. Moreover, MR-Occ demonstrates superior performance on the SemanticKITTI dataset, further validating its effectiveness and generalizability across diverse 3D semantic occupancy benchmarks.

Title: Multimodal Variational Autoencoder: a Barycentric View

Authors: Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Xiaotong Sun, Jin Yang, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras
Subjects: cs.LG, cs.CV, cs.IT
Abstract URL: https://arxiv.org/abs/2412.20487
Pdf URL: https://arxiv.org/pdf/2412.20487
Copy Paste: [[2412.20487]] Multimodal Variational Autoencoder: a Barycentric View(https://arxiv.org/abs/2412.20487)
Keywords: generative
Abstract: Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.

Title: A Multiparty Homomorphic Encryption Approach to Confidential Federated Kaplan Meier Survival Analysis

Authors: Narasimha Raghavan Veeraragavan, Svetlana Boudko, Jan Franz Nygård
Subjects: cs.CR, cs.AI, cs.LG, cs.MS, stat.ML
Abstract URL: https://arxiv.org/abs/2412.20495
Pdf URL: https://arxiv.org/pdf/2412.20495
Copy Paste: [[2412.20495]] A Multiparty Homomorphic Encryption Approach to Confidential Federated Kaplan Meier Survival Analysis(https://arxiv.org/abs/2412.20495)
Keywords: secure, security, privacy, attack, robust, federate
Abstract: The proliferation of healthcare data has expanded opportunities for collaborative research, yet stringent privacy regulations hinder pooling sensitive patient records. We propose a \emph{multiparty homomorphic encryption-based} framework for \emph{privacy-preserving federated Kaplan--Meier survival analysis}, offering native floating-point support, a theoretical model, and explicit reconstruction-attack mitigation. Compared to prior work, our framework ensures encrypted federated survival estimates closely match centralized outcomes, supported by formal utility-loss bounds that demonstrate convergence as aggregation and decryption noise diminish. Extensive experiments on the NCCTG Lung Cancer and synthetic Breast Cancer datasets confirm low \emph{mean absolute error (MAE)} and \emph{root mean squared error (RMSE)}, indicating negligible deviations between encrypted and non-encrypted survival curves. Log-rank and numerical accuracy tests reveal \emph{no significant difference} between federated encrypted and non-encrypted analyses, preserving statistical validity. A reconstruction-attack evaluation shows smaller federations (2--3 providers) with overlapping data between the institutions are vulnerable, a challenge mitigated by multiparty encryption. Larger federations (5--50 sites) degrade reconstruction accuracy further, with encryption improving confidentiality. Despite an 8--19$\times$ computational overhead, threshold-based homomorphic encryption is \emph{feasible for moderate-scale deployments}, balancing security and runtime. By providing robust privacy guarantees alongside high-fidelity survival estimates, our framework advances the state-of-the art in secure multi-institutional survival analysis.

Title: ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

Authors: Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.20504
Pdf URL: https://arxiv.org/pdf/2412.20504
Copy Paste: [[2412.20504]] ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding(https://arxiv.org/abs/2412.20504)
Keywords: large language model
Abstract: Video Large Language Models (VideoLLMs) have achieved remarkable progress in video understanding. However, existing VideoLLMs often inherit the limitations of their backbone LLMs in handling long sequences, leading to challenges for long video understanding. Common solutions either simply uniformly sample videos' frames or compress visual tokens, which focus primarily on low-level temporal visual redundancy, overlooking high-level knowledge redundancy. This limits the achievable compression rate with minimal loss. To this end. we introduce a training-free method, $\textbf{ReTaKe}$, containing two novel modules DPSelect and PivotKV, to jointly model and reduce both temporal visual redundancy and knowledge redundancy for long video understanding. Specifically, DPSelect identifies keyframes with local maximum peak distance based on their visual features, which are closely aligned with human video perception. PivotKV employs the obtained keyframes as pivots and conducts KV-Cache compression for the non-pivot tokens with low attention scores, which are derived from the learned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and LVBench, show that ReTaKe can support 4x longer video sequences with minimal performance loss (<1%) and outperform all similar-size VideoLLMs with 3%-5%, even surpassing or on par with much larger ones. Our code is available at this https URL

Title: DPBridge: Latent Diffusion Bridge for Dense Prediction

Authors: Haorui Ji, Taojun Lin, Hongdong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20506
Pdf URL: https://arxiv.org/pdf/2412.20506
Copy Paste: [[2412.20506]] DPBridge: Latent Diffusion Bridge for Dense Prediction(https://arxiv.org/abs/2412.20506)
Keywords: robust, diffusion, generative
Abstract: Diffusion models have demonstrated remarkable success in dense prediction problems, which aims to model per-pixel relationship between RGB images and dense signal maps, thanks to their ability to effectively capture complex data distributions. However, initiating the reverse sampling trajectory from uninformative noise prior introduces limitations such as degraded performance and slow inference speed. In this work, we propose DPBridge, a generative framework that formulates dense prediction tasks as image-conditioned generation problems and establishes a direct mapping between input image and its corresponding dense map based on fully-tractable diffusion bridge process. This approach addresses aforementioned limitations in conventional diffusion-based solutions. In addition, we introduce finetuning strategies to adapt our model from pretrained image diffusion backbone, leveraging its rich visual prior knowledge to facilitate both efficient training and robust generalization ability. Experimental results shows that our DPBridge can achieve competitive performance compared to both feed-forward and diffusion-based approaches across various benchmarks, highlighting its effectiveness and adaptability.

Title: Dive into Time-Series Anomaly Detection: A Decade Review

Authors: Paul Boniol, Qinghua Liu, Mingyi Huang, Themis Palpanas, John Paparrizos
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2412.20512
Pdf URL: https://arxiv.org/pdf/2412.20512
Copy Paste: [[2412.20512]] Dive into Time-Series Anomaly Detection: A Decade Review(https://arxiv.org/abs/2412.20512)
Keywords: security
Abstract: Recent advances in data collection technology, accompanied by the ever-rising volume and velocity of streaming data, underscore the vital need for time series analytics. In this regard, time-series anomaly detection has been an important activity, entailing various applications in fields such as cyber security, financial markets, law enforcement, and health care. While traditional literature on anomaly detection is centered on statistical measures, the increasing number of machine learning algorithms in recent years call for a structured, general characterization of the research methods for time-series anomaly detection. This survey groups and summarizes anomaly detection existing solutions under a process-centric taxonomy in the time series context. In addition to giving an original categorization of anomaly detection methods, we also perform a meta-analysis of the literature and outline general trends in time-series anomaly detection research.

Title: Goal-Conditioned Data Augmentation for Offline Reinforcement Learning

Authors: Xingshuai Huang, Di Wu Member, Benoit Boulet
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2412.20519
Pdf URL: https://arxiv.org/pdf/2412.20519
Copy Paste: [[2412.20519]] Goal-Conditioned Data Augmentation for Offline Reinforcement Learning(https://arxiv.org/abs/2412.20519)
Keywords: diffusion, generative
Abstract: Offline reinforcement learning (RL) enables policy learning from pre-collected offline datasets, relaxing the need to interact directly with the environment. However, limited by the quality of offline datasets, it generally fails to learn well-qualified policies in suboptimal datasets. To address datasets with insufficient optimal demonstrations, we introduce Goal-cOnditioned Data Augmentation (GODA), a novel goal-conditioned diffusion-based method for augmenting samples with higher quality. Leveraging recent advancements in generative modeling, GODA incorporates a novel return-oriented goal condition with various selection mechanisms. Specifically, we introduce a controllable scaling technique to provide enhanced return-based guidance during data sampling. GODA learns a comprehensive distribution representation of the original offline datasets while generating new data with selectively higher-return goals, thereby maximizing the utility of limited optimal demonstrations. Furthermore, we propose a novel adaptive gated conditioning method for processing noised inputs and conditions, enhancing the capture of goal-oriented guidance. We conduct experiments on the D4RL benchmark and real-world challenges, specifically traffic signal control (TSC) tasks, to demonstrate GODA's effectiveness in enhancing data quality and superior performance compared to state-of-the-art data augmentation methods across various offline RL algorithms.

Title: Attacks on the neural network and defense methods

Authors: A. Korenev, G. Belokrylov, B. Lodonova, A. Novokhrestov
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20529
Pdf URL: https://arxiv.org/pdf/2412.20529
Copy Paste: [[2412.20529]] Attacks on the neural network and defense methods(https://arxiv.org/abs/2412.20529)
Keywords: protect, defense, attack
Abstract: This article will discuss the use of attacks on a neural network trained on audio data, as well as possible methods of protection against these attacks. FGSM, PGD and CW attacks, as well as data poisoning, will be considered. Within the framework of protection, Art-IBM and advertorch libraries will be considered. The obtained accuracy metrics within the framework of attack applications are presented

Title: KVC-onGoing: Keystroke Verification Challenge

Authors: Giuseppe Stragapede, Ruben Vera-Rodriguez, Ruben Tolosana, Aythami Morales, Ivan DeAndres-Tame, Naser Damer, Julian Fierrez, Javier Ortega-Garcia, Alejandro Acien, Nahuel Gonzalez, Andrei Shadrikov, Dmitrii Gordin, Leon Schmitt, Daniel Wimmer, Christoph Großmann, Joerdis Krieger, Florian Heinz, Ron Krestel, Christoffer Mayer, Simon Haberl, Helena Gschrey, Yosuke Yamagishi, Sanjay Saha, Sanka Rasnayaka, Sandareka Wickramanayake, Terence Sim, Weronika Gutfeter, Adam Baran, Mateusz Krzysztoń, Przemysław Jaskóła
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20530
Pdf URL: https://arxiv.org/pdf/2412.20530
Copy Paste: [[2412.20530]] KVC-onGoing: Keystroke Verification Challenge(https://arxiv.org/abs/2412.20530)
Keywords: fair
Abstract: This article presents the Keystroke Verification Challenge - onGoing (KVC-onGoing), on which researchers can easily benchmark their systems in a common platform using large-scale public databases, the Aalto University Keystroke databases, and a standard experimental protocol. The keystroke data consist of tweet-long sequences of variable transcript text from over 185,000 subjects, acquired through desktop and mobile keyboards simulating real-life conditions. The results on the evaluation set of KVC-onGoing have proved the high discriminative power of keystroke dynamics, reaching values as low as 3.33% of Equal Error Rate (EER) and 11.96% of False Non-Match Rate (FNMR) @1% False Match Rate (FMR) in the desktop scenario, and 3.61% of EER and 17.44% of FNMR @1% at FMR in the mobile scenario, significantly improving previous state-of-the-art results. Concerning demographic fairness, the analyzed scores reflect the subjects' age and gender to various extents, not negligible in a few cases. The framework runs on CodaLab.

Title: SAFE-MEME: Structured Reasoning Framework for Robust Hate Speech Detection in Memes

Authors: Palash Nandi, Shivam Sharma, Tanmoy Chakraborty
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2412.20541
Pdf URL: https://arxiv.org/pdf/2412.20541
Copy Paste: [[2412.20541]] SAFE-MEME: Structured Reasoning Framework for Robust Hate Speech Detection in Memes(https://arxiv.org/abs/2412.20541)
Keywords: robust
Abstract: Memes act as cryptic tools for sharing sensitive ideas, often requiring contextual knowledge to interpret. This makes moderating multimodal memes challenging, as existing works either lack high-quality datasets on nuanced hate categories or rely on low-quality social media visuals. Here, we curate two novel multimodal hate speech datasets, MHS and MHS-Con, that capture fine-grained hateful abstractions in regular and confounding scenarios, respectively. We benchmark these datasets against several competing baselines. Furthermore, we introduce SAFE-MEME (Structured reAsoning FramEwork), a novel multimodal Chain-of-Thought-based framework employing Q&A-style reasoning (SAFE-MEME-QA) and hierarchical categorization (SAFE-MEME-H) to enable robust hate speech detection in memes. SAFE-MEME-QA outperforms existing baselines, achieving an average improvement of approximately 5% and 4% on MHS and MHS-Con, respectively. In comparison, SAFE-MEME-H achieves an average improvement of 6% in MHS while outperforming only multimodal baselines in MHS-Con. We show that fine-tuning a single-layer adapter within SAFE-MEME-H outperforms fully fine-tuned models in regular fine-grained hateful meme detection. However, the fully fine-tuning approach with a Q&A setup is more effective for handling confounding cases. We also systematically examine the error cases, offering valuable insights into the robustness and limitations of the proposed structured reasoning framework for analyzing hateful memes.

Title: Counterfactual Samples Constructing and Training for Commonsense Statements Estimation

Authors: Chong Liu, Zaiwen Feng, Lin Liu, Zhenyun Deng, Jiuyong Li, Ruifang Zhai, Debo Cheng, Li Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20563
Pdf URL: https://arxiv.org/pdf/2412.20563
Copy Paste: [[2412.20563]] Counterfactual Samples Constructing and Training for Commonsense Statements Estimation(https://arxiv.org/abs/2412.20563)
Keywords: large language model
Abstract: Plausibility Estimation (PE) plays a crucial role for enabling language models to objectively comprehend the real world. While large language models (LLMs) demonstrate remarkable capabilities in PE tasks but sometimes produce trivial commonsense errors due to the complexity of commonsense knowledge. They lack two key traits of an ideal PE model: a) Language-explainable: relying on critical word segments for decisions, and b) Commonsense-sensitive: detecting subtle linguistic variations in commonsense. To address these issues, we propose a novel model-agnostic method, referred to as Commonsense Counterfactual Samples Generating (CCSG). By training PE models with CCSG, we encourage them to focus on critical words, thereby enhancing both their language-explainable and commonsense-sensitive capabilities. Specifically, CCSG generates counterfactual samples by strategically replacing key words and introducing low-level dropout within sentences. These counterfactual samples are then incorporated into a sentence-level contrastive training framework to further enhance the model's learning process. Experimental results across nine diverse datasets demonstrate the effectiveness of CCSG in addressing commonsense reasoning challenges, with our CCSG method showing 3.07% improvement against the SOTA methods.

Title: Towards Neural No-Resource Language Translation: A Comparative Evaluation of Approaches

Authors: Madhavendra Thakur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20584
Pdf URL: https://arxiv.org/pdf/2412.20584
Copy Paste: [[2412.20584]] Towards Neural No-Resource Language Translation: A Comparative Evaluation of Approaches(https://arxiv.org/abs/2412.20584)
Keywords: large language model
Abstract: No-resource languages - those with minimal or no digital representation - pose unique challenges for machine translation (MT). Unlike low-resource languages, which rely on limited but existent corpora, no-resource languages often have fewer than 100 sentences available for training. This work explores the problem of no-resource translation through three distinct workflows: fine-tuning of translation-specific models, in-context learning with large language models (LLMs) using chain-of-reasoning prompting, and direct prompting without reasoning. Using Owens Valley Paiute as a case study, we demonstrate that no-resource translation demands fundamentally different approaches from low-resource scenarios, as traditional approaches to machine translation, such as those that work for low-resource languages, fail. Empirical results reveal that, although traditional approaches fail, the in-context learning capabilities of general-purpose large language models enable no-resource language translation that outperforms low-resource translation approaches and rivals human translations (BLEU 0.45-0.6); specifically, chain-of-reasoning prompting outperforms other methods for larger corpora, while direct prompting exhibits advantages in smaller datasets. As these approaches are language-agnostic, they have potential to be generalized to translation tasks from a wide variety of no-resource languages without expert input. These findings establish no-resource translation as a distinct paradigm requiring innovative solutions, providing practical and theoretical insights for language preservation.

Title: Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection

Authors: Dmitri Roussinov, Serge Sharoff, Nadezhda Puchnina
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20595
Pdf URL: https://arxiv.org/pdf/2412.20595
Copy Paste: [[2412.20595]] Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection(https://arxiv.org/abs/2412.20595)
Keywords: large language model
Abstract: This study demonstrates that the modern generation of Large Language Models (LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs, such as BERT). We demonstrate this across two non-topical classification tasks: 1) genre classification and 2) generated text detection. Our results show that when demonstration examples for In-Context Learning (ICL) come from one domain (e.g., travel) and the system is tested on another domain (e.g., history), classification performance declines significantly. To address this, we introduce a method that controls which predictive indicators are used and which are excluded during classification. For the two tasks studied here, this ensures that topical features are omitted, while the model is guided to focus on stylistic rather than content-based attributes. This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline, prove insufficient, while our approach consistently enhances domain transfer performance.

Title: Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)

Authors: Tomer Garber, Tom Tirer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20596
Pdf URL: https://arxiv.org/pdf/2412.20596
Copy Paste: [[2412.20596]] Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)(https://arxiv.org/abs/2412.20596)
Keywords: diffusion, generative
Abstract: In recent years, it has become popular to tackle image restoration tasks with a single pretrained diffusion model (DM) and data-fidelity guidance, instead of training a dedicated deep neural network per task. However, such "zero-shot" restoration schemes currently require many Neural Function Evaluations (NFEs) for performing well, which may be attributed to the many NFEs needed in the original generative functionality of the DMs. Recently, faster variants of DMs have been explored for image generation. These include Consistency Models (CMs), which can generate samples via a couple of NFEs. However, existing works that use guided CMs for restoration still require tens of NFEs or fine-tuning of the model per task that leads to performance drop if the assumptions during the fine-tuning are not accurate. In this paper, we propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs. It is based on a wise combination of several ingredients: better initialization, back-projection guidance, and above all a novel noise injection mechanism. We demonstrate the advantages of our approach for image super-resolution, deblurring and inpainting. Interestingly, we show that the usefulness of our noise injection technique goes beyond CMs: it can also mitigate the performance degradation of existing guided DM methods when reducing their NFE count.

Title: MATEY: multiscale adaptive foundation models for spatiotemporal physical systems

Authors: Pei Zhang, M. Paul Laiu, Matthew Norman, Doug Stefanski, John Gounley
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2412.20601
Pdf URL: https://arxiv.org/pdf/2412.20601
Copy Paste: [[2412.20601]] MATEY: multiscale adaptive foundation models for spatiotemporal physical systems(https://arxiv.org/abs/2412.20601)
Keywords: transformer
Abstract: Accurate representation of the multiscale features in spatiotemporal physical systems using vision transformer (ViT) architectures requires extremely long, computationally prohibitive token sequences. To address this issue, we propose two adaptive tokenization schemes that dynamically adjust patch sizes based on local features: one ensures convergent behavior to uniform patch refinement, while the other offers better computational efficiency. Moreover, we present a set of spatiotemporal attention schemes, where the temporal or axial spatial dimensions are decoupled, and evaluate their computational and data efficiencies. We assess the performance of the proposed multiscale adaptive model, MATEY, in a sequence of experiments. The results show that adaptive tokenization schemes achieve improved accuracy without significantly increasing the length of the token sequence. Compared to a full spatiotemporal attention scheme or a scheme that decouples only the temporal dimension, we find that fully decoupled axial attention is less efficient and expressive, requiring more training time and model weights to achieve the same accuracy. Finally, we demonstrate in two fine-tuning tasks featuring different physics that models pretrained on PDEBench data outperform the ones trained from scratch, especially in the low data regime with frozen attention.

Title: NLP-based Regulatory Compliance -- Using GPT 4.0 to Decode Regulatory Documents

Authors: Bimal Kumar, Dmitri Roussinov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20602
Pdf URL: https://arxiv.org/pdf/2412.20602
Copy Paste: [[2412.20602]] NLP-based Regulatory Compliance -- Using GPT 4.0 to Decode Regulatory Documents(https://arxiv.org/abs/2412.20602)
Keywords: large language model
Abstract: Large Language Models (LLMs) such as GPT-4.0 have shown significant promise in addressing the semantic complexities of regulatory documents, particularly in detecting inconsistencies and contradictions. This study evaluates GPT-4.0's ability to identify conflicts within regulatory requirements by analyzing a curated corpus with artificially injected ambiguities and contradictions, designed in collaboration with architects and compliance engineers. Using metrics such as precision, recall, and F1 score, the experiment demonstrates GPT-4.0's effectiveness in detecting inconsistencies, with findings validated by human experts. The results highlight the potential of LLMs to enhance regulatory compliance processes, though further testing with larger datasets and domain-specific fine-tuning is needed to maximize accuracy and practical applicability. Future work will explore automated conflict resolution and real-world implementation through pilot projects with industry partners.

Title: Privacy-Preserving Identity and Access Management in Multiple Cloud Environments: Models, Issues, and Solutions

Authors: Alfredo Cuzzocrea, Islam Belmerabet
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.20603
Pdf URL: https://arxiv.org/pdf/2412.20603
Copy Paste: [[2412.20603]] Privacy-Preserving Identity and Access Management in Multiple Cloud Environments: Models, Issues, and Solutions(https://arxiv.org/abs/2412.20603)
Keywords: privacy
Abstract: This paper focuses the attention on privacy-preserving identity and access management in multiple Cloud environments, which is an annoying problem in the modern big data era. Within this conceptual context, the paper describes contemporaneous models and issues, and put the basis for future solid solutions. Finally, we provide a summary table where we embed an innovative taxonomy of state-of-the-art research proposals in the reference scientific field.

Title: Do Current Video LLMs Have Strong OCR Abilities? A Preliminary Study

Authors: Yulin Fei, Yuhui Gao, Xingyuan Xian, Xiaojin Zhang, Tao Wu, Wei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20613
Pdf URL: https://arxiv.org/pdf/2412.20613
Copy Paste: [[2412.20613]] Do Current Video LLMs Have Strong OCR Abilities? A Preliminary Study(https://arxiv.org/abs/2412.20613)
Keywords: large language model
Abstract: With the rise of multimodal large language models, accurately extracting and understanding textual information from video content, referred to as video based optical character recognition (Video OCR), has become a crucial capability. This paper introduces a novel benchmark designed to evaluate the video OCR performance of multi-modal models in videos. Comprising 1,028 videos and 2,961 question-answer pairs, this benchmark proposes several key challenges through 6 distinct subtasks: (1) Recognition of text content itself and its basic visual attributes, (2)Semantic and Spatial Comprehension of OCR objects in videos (3) Dynamic Motion detection and Temporal Localization. We developed this benchmark using a semi-automated approach that integrates the OCR ability of image LLMs with manual refinement, balancing efficiency, cost, and data quality. Our resource aims to help advance research in video LLMs and underscores the need for improving OCR ability for video LLMs. The benchmark will be released on this https URL.

Title: FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition

Authors: Wenhan Wu, Pengfei Wang, Chen Chen, Aidong Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20621
Pdf URL: https://arxiv.org/pdf/2412.20621
Copy Paste: [[2412.20621]] FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition(https://arxiv.org/abs/2412.20621)
Keywords: robust, transformer
Abstract: Transformer-based human skeleton action recognition has been developed for years. However, the complexity and high parameter count demands of these models hinder their practical applications, especially in resource-constrained environments. In this work, we propose FreqMixForemrV2, which was built upon the Frequency-aware Mixed Transformer (FreqMixFormer) for identifying subtle and discriminative actions with pioneered frequency-domain analysis. We design a lightweight architecture that maintains robust performance while significantly reducing the model complexity. This is achieved through a redesigned frequency operator that optimizes high-frequency and low-frequency parameter adjustments, and a simplified frequency-aware attention module. These improvements result in a substantial reduction in model parameters, enabling efficient deployment with only a minimal sacrifice in accuracy. Comprehensive evaluations of standard datasets (NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets) demonstrate that the proposed model achieves a superior balance between efficiency and accuracy, outperforming state-of-the-art methods with only 60% of the parameters.

Title: HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models

Authors: Ashish Seth, Dinesh Manocha, Chirag Agarwal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20622
Pdf URL: https://arxiv.org/pdf/2412.20622
Copy Paste: [[2412.20622]] HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models(https://arxiv.org/abs/2412.20622)
Keywords: attack
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in performing complex multimodal tasks. However, they are still plagued by object hallucination: the misidentification or misclassification of objects present in images. To this end, we propose HALLUCINOGEN, a novel visual question answering (VQA) object hallucination attack benchmark that utilizes diverse contextual reasoning prompts to evaluate object hallucination in state-of-the-art LVLMs. We design a series of contextual reasoning hallucination prompts to evaluate LVLMs' ability to accurately identify objects in a target image while asking them to perform diverse visual-language tasks such as identifying, locating or performing visual reasoning around specific objects. Further, we extend our benchmark to high-stakes medical applications and introduce MED-HALLUCINOGEN, hallucination attacks tailored to the biomedical domain, and evaluate the hallucination performance of LVLMs on medical images, a critical area where precision is crucial. Finally, we conduct extensive evaluations of eight LVLMs and two hallucination mitigation strategies across multiple datasets to show that current generic and medical LVLMs remain susceptible to hallucination attacks.

Title: NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics

Authors: Jiawei Zhou, Woojeong Kim, Zhiying Xu, Alexander M. Rush, Minlan Yu
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2412.20635
Pdf URL: https://arxiv.org/pdf/2412.20635
Copy Paste: [[2412.20635]] NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics(https://arxiv.org/abs/2412.20635)
Keywords: attack, generative
Abstract: Understanding the traffic dynamics in networks is a core capability for automated systems to monitor and analyze networking behaviors, reducing expensive human efforts and economic risks through tasks such as traffic classification, congestion prediction, and attack detection. However, it is still challenging to accurately model network traffic with machine learning approaches in an efficient and broadly applicable manner. Task-specific models trained from scratch are used for different networking applications, which limits the efficiency of model development and generalization of model deployment. Furthermore, while networking data is abundant, high-quality task-specific labels are often insufficient for training individual models. Large-scale self-supervised learning on unlabeled data provides a natural pathway for tackling these challenges. We propose to pre-train a general-purpose machine learning model to capture traffic dynamics with only traffic data from NetFlow records, with the goal of fine-tuning for different downstream tasks with small amount of labels. Our presented NetFlowGen framework goes beyond a proof-of-concept for network traffic pre-training and addresses specific challenges such as unifying network feature representations, learning from large unlabeled traffic data volume, and testing on real downstream tasks in DDoS attack detection. Experiments demonstrate promising results of our pre-training framework on capturing traffic dynamics and adapting to different networking tasks.

Title: Knowledge Editing for Large Language Model with Knowledge Neuronal Ensemble

Authors: Yongchang Li, Yujin Zhu, Tao Yan, Shijian Fan, Gang Wu, Liang Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20637
Pdf URL: https://arxiv.org/pdf/2412.20637
Copy Paste: [[2412.20637]] Knowledge Editing for Large Language Model with Knowledge Neuronal Ensemble(https://arxiv.org/abs/2412.20637)
Keywords: large language model
Abstract: As real-world knowledge is constantly evolving, ensuring the timeliness and accuracy of a model's knowledge is crucial. This has made knowledge editing in large language models increasingly important. However, existing knowledge editing methods face several challenges, including parameter localization coupling, imprecise localization, and a lack of dynamic interaction across layers. In this paper, we propose a novel knowledge editing method called Knowledge Neuronal Ensemble (KNE). A knowledge neuronal ensemble represents a group of neurons encoding specific knowledge, thus mitigating the issue of frequent parameter modification caused by coupling in parameter localization. The KNE method enhances the precision and accuracy of parameter localization by computing gradient attribution scores for each parameter at each layer. During the editing process, only the gradients and losses associated with the knowledge neuronal ensemble are computed, with error backpropagation performed accordingly, ensuring dynamic interaction and collaborative updates among parameters. Experimental results on three widely used knowledge editing datasets show that the KNE method significantly improves the accuracy of knowledge editing and achieves, or even exceeds, the performance of the best baseline methods in portability and locality metrics.

Title: SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy

Authors: Md Mahadi Hasan Nahid, Sadid Bin Hasan
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.20641
Pdf URL: https://arxiv.org/pdf/2412.20641
Copy Paste: [[2412.20641]] SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy(https://arxiv.org/abs/2412.20641)
Keywords: privacy, protect, attack, membership infer, large language model
Abstract: Machine learning (ML) models frequently rely on training data that may include sensitive or personal information, raising substantial privacy concerns. Legislative frameworks such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have necessitated the development of strategies that preserve privacy while maintaining the utility of data. In this paper, we investigate the capability of Large Language Models (LLMs) to generate synthetic datasets integrated with Differential Privacy (DP) mechanisms, thereby enabling data-driven research and model training without direct exposure of sensitive information. Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process. We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data. To substantiate privacy guarantees, we assess the resilience of the generated synthetic data to membership inference attacks and related threats. The experimental results demonstrate that integrating DP within LLM-driven synthetic data generation offers a viable balance between privacy protection and data utility. This study provides a foundational methodology and insight into the privacy-preserving capabilities of LLMs, paving the way for compliant and effective ML research and applications.

Title: Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis

Authors: Yousef Yeganeh, Ioannis Charisiadis, Marta Hasny, Martin Hartenberger, Björn Ommer, Nassir Navab, Azade Farshad, Ehsan Adeli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20651
Pdf URL: https://arxiv.org/pdf/2412.20651
Copy Paste: [[2412.20651]] Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis(https://arxiv.org/abs/2412.20651)
Keywords: privacy, diffusion
Abstract: Scaling by training on large datasets has been shown to enhance the quality and fidelity of image generation and manipulation with diffusion models; however, such large datasets are not always accessible in medical imaging due to cost and privacy issues, which contradicts one of the main applications of such models to produce synthetic samples where real data is scarce. Also, finetuning on pre-trained general models has been a challenge due to the distribution shift between the medical domain and the pre-trained models. Here, we propose Latent Drift (LD) for diffusion models that can be adopted for any fine-tuning method to mitigate the issues faced by the distribution shift or employed in inference time as a condition. Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation, which is crucial to investigate how parameters such as gender, age, and adding or removing diseases in a patient would alter the medical images. We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation. Our results demonstrate significant performance gains in various scenarios when combined with different fine-tuning schemes. The source code of this work will be publicly released upon its acceptance.

Title: Overcoming Class Imbalance: Unified GNN Learning with Structural and Semantic Connectivity Representations

Authors: Abdullah Alchihabi, Hao Yan, Yuhong Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20656
Pdf URL: https://arxiv.org/pdf/2412.20656
Copy Paste: [[2412.20656]] Overcoming Class Imbalance: Unified GNN Learning with Structural and Semantic Connectivity Representations(https://arxiv.org/abs/2412.20656)
Keywords: diffusion
Abstract: Class imbalance is pervasive in real-world graph datasets, where the majority of annotated nodes belong to a small set of classes (majority classes), leaving many other classes (minority classes) with only a handful of labeled nodes. Graph Neural Networks (GNNs) suffer from significant performance degradation in the presence of class imbalance, exhibiting bias towards majority classes and struggling to generalize effectively on minority classes. This limitation stems, in part, from the message passing process, leading GNNs to overfit to the limited neighborhood of annotated nodes from minority classes and impeding the propagation of discriminative information throughout the entire graph. In this paper, we introduce a novel Unified Graph Neural Network Learning (Uni-GNN) framework to tackle class-imbalanced node classification. The proposed framework seamlessly integrates both structural and semantic connectivity representations through semantic and structural node encoders. By combining these connectivity types, Uni-GNN extends the propagation of node embeddings beyond immediate neighbors, encompassing non-adjacent structural nodes and semantically similar nodes, enabling efficient diffusion of discriminative information throughout the graph. Moreover, to harness the potential of unlabeled nodes within the graph, we employ a balanced pseudo-label generation mechanism that augments the pool of available labeled nodes from minority classes in the training set. Experimental results underscore the superior performance of our proposed Uni-GNN framework compared to state-of-the-art class-imbalanced graph learning baselines across multiple benchmark datasets.

Title: Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

Authors: Yonghao Zhang, Qiang He, Yanguang Wan, Yinda Zhang, Xiaoming Deng, Cuixia Ma, Hongan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20657
Pdf URL: https://arxiv.org/pdf/2412.20657
Copy Paste: [[2412.20657]] Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model(https://arxiv.org/abs/2412.20657)
Keywords: diffusion
Abstract: Generating high-quality whole-body human object interaction motion sequences is becoming increasingly important in various fields such as animation, VR/AR, and robotics. The main challenge of this task lies in determining the level of involvement of each hand given the complex shapes of objects in different sizes and their different motion trajectories, while ensuring strong grasping realism and guaranteeing the coordination of movement in all body parts. Contrasting with existing work, which either generates human interaction motion sequences without detailed hand grasping poses or only models a static grasping pose, we propose a simple yet effective framework that jointly models the relationship between the body, hands, and the given object motion sequences within a single diffusion model. To guide our network in perceiving the object's spatial position and learning more natural grasping poses, we introduce novel contact-aware losses and incorporate a data-driven, carefully designed guidance. Experimental results demonstrate that our approach outperforms the state-of-the-art method and generates plausible whole-body motion sequences.

Title: Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner

Authors: Yitong Zhou, Mingyue Cheng, Qingyang Mao, Qi Liu, Feiyang Xu, Xin Li, Enhong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20662
Pdf URL: https://arxiv.org/pdf/2412.20662
Copy Paste: [[2412.20662]] Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner(https://arxiv.org/abs/2412.20662)
Keywords: large language model
Abstract: Pre-trained foundation models have recently significantly progressed in structured table understanding and reasoning. However, despite advancements in areas such as table semantic understanding and table question answering, recognizing the structure and content of unstructured tables using Vision Large Language Models (VLLMs) remains under-explored. In this work, we address this research gap by employing VLLMs in a training-free reasoning paradigm. First, we design a benchmark with various hierarchical dimensions relevant to table recognition. Subsequently, we conduct in-depth evaluations using pre-trained VLLMs, finding that low-quality image input is a significant bottleneck in the recognition process. Drawing inspiration from these findings, we propose the Neighbor-Guided Toolchain Reasoner (NGTR) framework, which is characterized by integrating multiple lightweight models for low-level visual processing operations aimed at mitigating issues with low-quality input images. Specifically, we utilize a neighbor retrieval mechanism to guide the generation of multiple tool invocation plans, transferring tool selection experiences from similar neighbors to the given input, thereby facilitating suitable tool selection. Additionally, we introduce a reflection module to supervise the tool invocation process. Extensive experiments on public table recognition datasets demonstrate that our approach significantly enhances the recognition capabilities of the vanilla VLLMs. We believe that the designed benchmark and the proposed NGTR framework could provide an alternative solution in table recognition.

Title: Prototypical Distillation and Debiased Tuning for Black-box Unsupervised Domain Adaptation

Authors: Jian Liang, Lijun Sheng, Hongmin Liu, Ran He
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.20670
Pdf URL: https://arxiv.org/pdf/2412.20670
Copy Paste: [[2412.20670]] Prototypical Distillation and Debiased Tuning for Black-box Unsupervised Domain Adaptation(https://arxiv.org/abs/2412.20670)
Keywords: attack
Abstract: Unsupervised domain adaptation aims to transfer knowledge from a related, label-rich source domain to an unlabeled target domain, thereby circumventing the high costs associated with manual annotation. Recently, there has been growing interest in source-free domain adaptation, a paradigm in which only a pre-trained model, rather than the labeled source data, is provided to the target domain. Given the potential risk of source data leakage via model inversion attacks, this paper introduces a novel setting called black-box domain adaptation, where the source model is accessible only through an API that provides the predicted label along with the corresponding confidence value for each query. We develop a two-step framework named $\textbf{Pro}$totypical $\textbf{D}$istillation and $\textbf{D}$ebiased tun$\textbf{ing}$ ($\textbf{ProDDing}$). In the first step, ProDDing leverages both the raw predictions from the source model and prototypes derived from the target domain as teachers to distill a customized target model. In the second step, ProDDing keeps fine-tuning the distilled model by penalizing logits that are biased toward certain classes. Empirical results across multiple benchmarks demonstrate that ProDDing outperforms existing black-box domain adaptation methods. Moreover, in the case of hard-label black-box domain adaptation, where only predicted labels are available, ProDDing achieves significant improvements over these methods. Code will be available at \url{this https URL}.

Title: Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Authors: Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20677
Pdf URL: https://arxiv.org/pdf/2412.20677
Copy Paste: [[2412.20677]] Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA(https://arxiv.org/abs/2412.20677)
Keywords: large language model
Abstract: Large language models have been shown to perform well on a variety of natural language processing problems. However, as the model size and the input sequence's length increase, the rapid increase of KV Cache significantly slows down inference speed. Therefore GQA model, as an alternative to MHA model, has been widely introduced into LLMs. In this work, we propose a low-cost method for pruning MHA models into GQA models with any compression ratio of key-value heads. Our method is based on $\mathit{L_0}$ masks to gradually remove redundant parameters. In addition, we apply orthogonal transformations to attention heads without changing the model to increase similarity between attention heads before pruning training, in order to further improve performance of the model. Our method can be compatible with rotary position embedding (RoPE), which means the model after training can be fully adapted to the mainstream standard GQA framework. Experiments demonstrate that our strategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model without too much performance degradation, just achieved through supervised fine-tuning.

Title: Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks

Authors: Yuhe Ding, Bo Jiang, Aihua Zheng, Qin Xu, Jian Liang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20682
Pdf URL: https://arxiv.org/pdf/2412.20682
Copy Paste: [[2412.20682]] Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks(https://arxiv.org/abs/2412.20682)
Keywords: large language model
Abstract: Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks. However, selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial. Existing VLM selection methods focus on the class-name-only setting, relying on a supervised large-scale dataset and large language models, which may not be accessible or feasible during deployment. This paper introduces the problem of \textbf{unsupervised vision-language model selection}, where only unsupervised downstream datasets are available, with no additional information provided. To solve this problem, we propose a method termed Visual-tExtual Graph Alignment (VEGA), to select VLMs without any annotations by measuring the alignment of the VLM between the two modalities on the downstream task. VEGA is motivated by the pretraining paradigm of VLMs, which aligns features with the same semantics from the visual and textual modalities, thereby mapping both modalities into a shared representation space. Specifically, we first construct two graphs on the vision and textual features, respectively. VEGA is then defined as the overall similarity between the visual and textual graphs at both node and edge levels. Extensive experiments across three different benchmarks, covering a variety of application scenarios and downstream datasets, demonstrate that VEGA consistently provides reliable and accurate estimates of VLMs' performance on unlabeled downstream tasks.

Title: HFI: A unified framework for training-free detection and implicit watermarking of latent diffusion model generated images

Authors: Sungik Choi, Sungwoo Park, Jaehoon Lee, Seunghyun Kim, Stanley Jungkyu Choi, Moontae Lee
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20704
Pdf URL: https://arxiv.org/pdf/2412.20704
Copy Paste: [[2412.20704]] HFI: A unified framework for training-free detection and implicit watermarking of latent diffusion model generated images(https://arxiv.org/abs/2412.20704)
Keywords: watermark, diffusion, generative
Abstract: Dramatic advances in the quality of the latent diffusion models (LDMs) also led to the malicious use of AI-generated images. While current AI-generated image detection methods assume the availability of real/AI-generated images for training, this is practically limited given the vast expressibility of LDMs. This motivates the training-free detection setup where no related data are available in advance. The existing LDM-generated image detection method assumes that images generated by LDM are easier to reconstruct using an autoencoder than real images. However, we observe that this reconstruction distance is overfitted to background information, leading the current method to underperform in detecting images with simple backgrounds. To address this, we propose a novel method called HFI. Specifically, by viewing the autoencoder of LDM as a downsampling-upsampling kernel, HFI measures the extent of aliasing, a distortion of high-frequency information that appears in the reconstructed image. HFI is training-free, efficient, and consistently outperforms other training-free methods in detecting challenging images generated by various generative models. We also show that HFI can successfully detect the images generated from the specified LDM as a means of implicit watermarking. HFI outperforms the best baseline method while achieving magnitudes of

Title: M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs

Authors: Bei Yan, Jie Zhang, Zhiyuan Chen, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20718
Pdf URL: https://arxiv.org/pdf/2412.20718
Copy Paste: [[2412.20718]] M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs(https://arxiv.org/abs/2412.20718)
Keywords: diffusion, large language model
Abstract: Recently, large foundation models, including large language models (LLMs) and large vision-language models (LVLMs), have become essential tools in critical fields such as law, finance, and healthcare. As these models increasingly integrate into our daily life, it is necessary to conduct moral evaluation to ensure that their outputs align with human values and remain within moral boundaries. Previous works primarily focus on LLMs, proposing moral datasets and benchmarks limited to text modality. However, given the rapid development of LVLMs, there is still a lack of multimodal moral evaluation methods. To bridge this gap, we introduce M$^3$oralBench, the first MultiModal Moral Benchmark for LVLMs. M$^3$oralBench expands the everyday moral scenarios in Moral Foundations Vignettes (MFVs) and employs the text-to-image diffusion model, SD3.0, to create corresponding scenario images. It conducts moral evaluation across six moral foundations of Moral Foundations Theory (MFT) and encompasses tasks in moral judgement, moral classification, and moral response, providing a comprehensive assessment of model performance in multimodal moral understanding and reasoning. Extensive experiments on 10 popular open-source and closed-source LVLMs demonstrate that M$^3$oralBench is a challenging benchmark, exposing notable moral limitations in current models. Our benchmark is publicly available.

Title: Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling

Authors: Min Zhang, Zilin Wang, Liyan Chen, Kunhong Liu, Juncong Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20725
Pdf URL: https://arxiv.org/pdf/2412.20725
Copy Paste: [[2412.20725]] Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling(https://arxiv.org/abs/2412.20725)
Keywords: diffusion
Abstract: Recent advances in AI-driven storytelling have enhanced video generation and story visualization. However, translating dialogue-centric scripts into coherent storyboards remains a significant challenge due to limited script detail, inadequate physical context understanding, and the complexity of integrating cinematic principles. To address these challenges, we propose Dialogue Visualization, a novel task that transforms dialogue scripts into dynamic, multi-view storyboards. We introduce Dialogue Director, a training-free multimodal framework comprising a Script Director, Cinematographer, and Storyboard Maker. This framework leverages large multimodal models and diffusion-based architectures, employing techniques such as Chain-of-Thought reasoning, Retrieval-Augmented Generation, and multi-view synthesis to improve script understanding, physical context comprehension, and cinematic knowledge integration. Experimental results demonstrate that Dialogue Director outperforms state-of-the-art methods in script interpretation, physical world understanding, and cinematic principle application, significantly advancing the quality and controllability of dialogue-based story visualization.

Title: AverageLinear: Enhance Long-Term Time series forcasting with simple averaging

Authors: Gaoxiang Zhao, Li Zhou, Xiaoqiang Wang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.20727
Pdf URL: https://arxiv.org/pdf/2412.20727
Copy Paste: [[2412.20727]] AverageLinear: Enhance Long-Term Time series forcasting with simple averaging(https://arxiv.org/abs/2412.20727)
Keywords: robust, transformer
Abstract: Long-term time series analysis aims to forecast long-term trends by examining changes over past and future periods. The intricacy of time series data poses significant challenges for modeling. Models based on the Transformer architecture, through the application of attention mechanisms to channels and sequences, have demonstrated notable performance advantages. In contrast, methods based on convolutional neural networks or linear models often struggle to effectively handle scenarios with large number of channels. However, our research reveals that the attention mechanism is not the core component responsible for performance enhancement. We have designed an exceedingly simple linear structure AverageLinear. By employing straightforward channel embedding and averaging operations, this model can effectively capture correlations between channels while maintaining a lightweight architecture. Experimentss on real-world datasets shows that AverageLinear matches or even surpasses state-of-the-art Transformer-based structures in performance. This indicates that using purely linear structures can also endow models with robust predictive power.

Title: Towards nation-wide analytical healthcare infrastructures: A privacy-preserving augmented knee rehabilitation case study

Authors: Boris Bačić, Claudiu Vasile, Chengwei Feng, Marian G. Ciucă
Subjects: cs.CV, cs.AI, cs.CY, cs.MM
Abstract URL: https://arxiv.org/abs/2412.20733
Pdf URL: https://arxiv.org/pdf/2412.20733
Copy Paste: [[2412.20733]] Towards nation-wide analytical healthcare infrastructures: A privacy-preserving augmented knee rehabilitation case study(https://arxiv.org/abs/2412.20733)
Keywords: privacy
Abstract: The purpose of this paper is to contribute towards the near-future privacy-preserving big data analytical healthcare platforms, capable of processing streamed or uploaded timeseries data or videos from patients. The experimental work includes a real-life knee rehabilitation video dataset capturing a set of exercises from simple and personalised to more general and challenging movements aimed for returning to sport. To convert video from mobile into privacy-preserving diagnostic timeseries data, we employed Google MediaPipe pose estimation. The developed proof-of-concept algorithms can augment knee exercise videos by overlaying the patient with stick figure elements while updating generated timeseries plot with knee angle estimation streamed as CSV file format. For patients and physiotherapists, video with side-to-side timeseries visually indicating potential issues such as excessive knee flexion or unstable knee movements or stick figure overlay errors is possible by setting a-priori knee-angle parameters. To address adherence to rehabilitation programme and quantify exercise sets and repetitions, our adaptive algorithm can correctly identify (91.67%-100%) of all exercises from side- and front-view videos. Transparent algorithm design for adaptive visual analysis of various knee exercise patterns contributes towards the interpretable AI and will inform near-future privacy-preserving, non-vendor locking, open-source developments for both end-user computing devices and as on-premises non-proprietary cloud platforms that can be deployed within the national healthcare system.

Title: UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

Authors: Yujie Li, Wenjia Xu, Guangzuo Li, Zijian Yu, Zhiwei Wei, Jiuniu Wang, Mugen Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20742
Pdf URL: https://arxiv.org/pdf/2412.20742
Copy Paste: [[2412.20742]] UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models(https://arxiv.org/abs/2412.20742)
Keywords: extraction
Abstract: The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce \textbf{UniRS}, the first vision-language model \textbf{uni}fying multi-temporal \textbf{r}emote \textbf{s}ensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.

Title: Advancing Parkinson's Disease Progression Prediction: Comparing Long Short-Term Memory Networks and Kolmogorov-Arnold Networks

Authors: Abhinav Roy, Bhavesh Gyanchandani, Aditya Oza, Abhishek Sharma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20744
Pdf URL: https://arxiv.org/pdf/2412.20744
Copy Paste: [[2412.20744]] Advancing Parkinson's Disease Progression Prediction: Comparing Long Short-Term Memory Networks and Kolmogorov-Arnold Networks(https://arxiv.org/abs/2412.20744)
Keywords: generative
Abstract: Parkinson's Disease (PD) is a degenerative neurological disorder that impairs motor and non-motor functions, significantly reducing quality of life and increasing mortality risk. Early and accurate detection of PD progression is vital for effective management and improved patient outcomes. Current diagnostic methods, however, are often costly, time-consuming, and require specialized equipment and expertise. This work proposes an innovative approach to predicting PD progression using regression methods, Long Short-Term Memory (LSTM) networks, and Kolmogorov Arnold Networks (KAN). KAN, utilizing spline-parametrized univariate functions, allows for dynamic learning of activation patterns, unlike traditional linear models. The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) is a comprehensive tool for evaluating PD symptoms and is commonly used to measure disease progression. Additionally, protein or peptide abnormalities are linked to PD onset and progression. Identifying these associations can aid in predicting disease progression and understanding molecular changes. Comparing multiple models, including LSTM and KAN, this study aims to identify the method that delivers the highest metrics. The analysis reveals that KAN, with its dynamic learning capabilities, outperforms other approaches in predicting PD progression. This research highlights the potential of AI and machine learning in healthcare, paving the way for advanced computational models to enhance clinical predictions and improve patient care and treatment strategies in PD management.

Title: Solar Filaments Detection using Active Contours Without Edges

Authors: Sanmoy Bandyopadhyay, Vaibhav Pant
Subjects: cs.CV, astro-ph.IM, astro-ph.SR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20749
Pdf URL: https://arxiv.org/pdf/2412.20749
Copy Paste: [[2412.20749]] Solar Filaments Detection using Active Contours Without Edges(https://arxiv.org/abs/2412.20749)
Keywords: segmentation
Abstract: In this article, an active contours without edges (ACWE)-based algorithm has been proposed for the detection of solar filaments in H-alpha full-disk solar images. The overall algorithm consists of three main steps of image processing. These are image pre-processing, image segmentation, and image post-processing. Here in the work, contours are initialized on the solar image and allowed to deform based on the energy function. As soon as the contour reaches the boundary of the desired object, the energy function gets reduced, and the contour stops evolving. The proposed algorithm has been applied to few benchmark datasets and has been compared with the classical technique of object detection. The results analysis indicates that the proposed algorithm outperforms the results obtained using the existing classical algorithm of object detection.

Title: Attributing Culture-Conditioned Generations to Pretraining Corpora

Authors: Huihan Li, Arnav Goel, Keyu He, Xiang Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20760
Pdf URL: https://arxiv.org/pdf/2412.20760
Copy Paste: [[2412.20760]] Attributing Culture-Conditioned Generations to Pretraining Corpora(https://arxiv.org/abs/2412.20760)
Keywords: generative, large language model
Abstract: In open-ended generative tasks like narrative writing or dialogue, large language models often exhibit cultural biases, showing limited knowledge and generating templated outputs for less prevalent cultures. Recent works show that these biases may stem from uneven cultural representation in pretraining corpora. This work investigates how pretraining leads to biased culture-conditioned generations by analyzing how models associate entities with cultures based on pretraining data patterns. We propose the MEMOed framework (MEMOrization from pretraining document) to determine whether a generation for a culture arises from memorization. Using MEMOed on culture-conditioned generations about food and clothing for 110 cultures, we find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none. Additionally, the model favors generating entities with extraordinarily high frequency regardless of the conditioned culture, reflecting biases toward frequent pretraining terms irrespective of relevance. We hope that the MEMOed framework and our insights will inspire more works on attributing model performance on pretraining data.

Title: Sample Correlation for Fingerprinting Deep Face Recognition

Authors: Jiyang Guan, Jian Liang, Yanbo Wang, Ran He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20768
Pdf URL: https://arxiv.org/pdf/2412.20768
Copy Paste: [[2412.20768]] Sample Correlation for Fingerprinting Deep Face Recognition(https://arxiv.org/abs/2412.20768)
Keywords: defense, attack, steal
Abstract: Face recognition has witnessed remarkable advancements in recent years, thanks to the development of deep learning this http URL, an off-the-shelf face recognition model as a commercial service could be stolen by model stealing attacks, posing great threats to the rights of the model this http URL fingerprinting, as a model stealing detection method, aims to verify whether a suspect model is stolen from the victim model, gaining more and more attention this http URL methods always utilize transferable adversarial examples as the model fingerprint, but this method is known to be sensitive to adversarial defense and transfer learning this http URL address this issue, we consider the pairwise relationship between samples instead and propose a novel yet simple model stealing detection method based on SAmple Correlation (SAC).Specifically, we present SAC-JC that selects JPEG compressed samples as model inputs and calculates the correlation matrix among their model this http URL results validate that SAC successfully defends against various model stealing attacks in deep face recognition, encompassing face verification and face emotion recognition, exhibiting the highest performance in terms of AUC, p-value and F1 this http URL, we extend our evaluation of SAC-JC to object recognition datasets including Tiny-ImageNet and CIFAR10, which also demonstrates the superior performance of SAC-JC to previous this http URL code will be available at \url{this https URL}.

Title: Accelerating Energy-Efficient Federated Learning in Cell-Free Networks with Adaptive Quantization

Authors: Afsaneh Mahmoudi, Ming Xiao, Emil Björnson
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20785
Pdf URL: https://arxiv.org/pdf/2412.20785
Copy Paste: [[2412.20785]] Accelerating Energy-Efficient Federated Learning in Cell-Free Networks with Adaptive Quantization(https://arxiv.org/abs/2412.20785)
Keywords: federate
Abstract: Federated Learning (FL) enables clients to share learning parameters instead of local data, reducing communication overhead. Traditional wireless networks face latency challenges with FL. In contrast, Cell-Free Massive MIMO (CFmMIMO) can serve multiple clients on shared resources, boosting spectral efficiency and reducing latency for large-scale FL. However, clients' communication resource limitations can hinder the completion of the FL training. To address this challenge, we propose an energy-efficient, low-latency FL framework featuring optimized uplink power allocation for seamless client-server collaboration. Our framework employs an adaptive quantization scheme, dynamically adjusting bit allocation for local gradient updates to reduce communication costs. We formulate a joint optimization problem covering FL model updates, local iterations, and power allocation, solved using sequential quadratic programming (SQP) to balance energy and latency. Additionally, clients use the AdaDelta method for local FL model updates, enhancing local model convergence compared to standard SGD, and we provide a comprehensive analysis of FL convergence with AdaDelta local updates. Numerical results show that, within the same energy and latency budgets, our power allocation scheme outperforms the Dinkelbach and max-sum rate methods by increasing the test accuracy up to $7$\% and $19$\%, respectively. Moreover, for the three power allocation methods, our proposed quantization scheme outperforms AQUILA and LAQ by increasing test accuracy by up to $36$\% and $35$\%, respectively.

Title: SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity

Authors: Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, Xiapu Luo
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20787
Pdf URL: https://arxiv.org/pdf/2412.20787
Copy Paste: [[2412.20787]] SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity(https://arxiv.org/abs/2412.20787)
Keywords: security, large language model
Abstract: Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and HumanEval assess general LLM performance but lack focus on specific expert domains such as cybersecurity. Previous attempts to create cybersecurity datasets have faced limitations, including insufficient data volume and a reliance on multiple-choice questions (MCQs). To address these gaps, we propose SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in the cybersecurity domain. SecBench includes questions in various formats (MCQs and short-answer questions (SAQs)), at different capability levels (Knowledge Retention and Logical Reasoning), in multiple languages (Chinese and English), and across various sub-domains. The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest, resulting in 44,823 MCQs and 3,087 SAQs. Particularly, we used the powerful while cost-effective LLMs to (1). label the data and (2). constructing a grading agent for automatic evaluation of this http URL results on 13 SOTA LLMs demonstrate the usability of SecBench, which is arguably the largest and most comprehensive benchmark dataset for LLMs in cybersecurity. More information about SecBench can be found at our website, and the dataset can be accessed via the artifact link.

Title: A Tale of Two Imperatives: Privacy and Explainability

Authors: Supriya Manna, Niladri Sett
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.20798
Pdf URL: https://arxiv.org/pdf/2412.20798
Copy Paste: [[2412.20798]] A Tale of Two Imperatives: Privacy and Explainability(https://arxiv.org/abs/2412.20798)
Keywords: privacy, explainability
Abstract: Deep learning's preponderance across scientific domains has reshaped high-stakes decision-making, making it essential to follow rigorous operational frameworks that include both Right-to-Privacy (RTP) and Right-to-Explanation (RTE). This paper examines the complexities of combining these two requirements. For RTP, we focus on 'Differentially privacy' (DP), which is considered the current gold standard for privacy-preserving machine learning due to its strong quantitative guarantee of privacy. For RTE, we focus on post-hoc explainers: they are the go-to option for model auditing as they operate independently of model training. We formally investigate (DP) models and various commonly-used post-hoc explainers: how to evaluate these explainers subject to RTP, and analyze the intrinsic interactions between DP models and these explainers. Furthermore, our work throws light on how RTP and RTE can be effectively combined in high-stakes applications. Our study concludes by outlining an industrial software pipeline, with the example of a wildly used use-case, that respects both RTP and RTE requirements.

Title: VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Authors: Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20800
Pdf URL: https://arxiv.org/pdf/2412.20800
Copy Paste: [[2412.20800]] VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control(https://arxiv.org/abs/2412.20800)
Keywords: diffusion
Abstract: While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is this https URL.

Title: Frequency-aware Event Cloud Network

Authors: Hongwei Ren, Fei Ma, Xiaopeng Lin, Yuetong Fang, Hongxiang Huang, Yulong Huang, Yue Zhou, Haotian Fu, Ziyi Yang, Fei Richard Yu, Bojun Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20803
Pdf URL: https://arxiv.org/pdf/2412.20803
Copy Paste: [[2412.20803]] Frequency-aware Event Cloud Network(https://arxiv.org/abs/2412.20803)
Keywords: extraction
Abstract: Event cameras are biologically inspired sensors that emit events asynchronously with remarkable temporal resolution, garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformation, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it ignores the polarity information, and its models have limited proficiency in abstracting long-term events' features. In this paper, we propose a frequency-aware network named FECNet that leverages Event Cloud representations. FECNet fully utilizes 2S-1T-1P Event Cloud by innovating the event-based Group and Sampling module. To accommodate the long sequence events from Event Cloud, FECNet embraces feature extraction in the frequency domain via the Fourier transform. This approach substantially extinguishes the explosion of Multiply Accumulate Operations (MACs) while effectively abstracting spatial-temporal features. We conducted extensive experiments on event-based object classification, action recognition, and human pose estimation tasks, and the results substantiate the effectiveness and efficiency of FECNet.

Title: Two Heads Are Better Than One: Averaging along Fine-Tuning to Improve Targeted Transferability

Authors: Hui Zeng, Sanshuai Cui, Biwei Chen, Anjie Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20807
Pdf URL: https://arxiv.org/pdf/2412.20807
Copy Paste: [[2412.20807]] Two Heads Are Better Than One: Averaging along Fine-Tuning to Improve Targeted Transferability(https://arxiv.org/abs/2412.20807)
Keywords: attack
Abstract: With much longer optimization time than that of untargeted attacks notwithstanding, the transferability of targeted attacks is still far from satisfactory. Recent studies reveal that fine-tuning an existing adversarial example (AE) in feature space can efficiently boost its targeted transferability. However, existing fine-tuning schemes only utilize the endpoint and ignore the valuable information in the fine-tuning trajectory. Noting that the vanilla fine-tuning trajectory tends to oscillate around the periphery of a flat region of the loss surface, we propose averaging over the fine-tuning trajectory to pull the crafted AE towards a more centered region. We compare the proposed method with existing fine-tuning schemes by integrating them with state-of-the-art targeted attacks in various attacking scenarios. Experimental results uphold the superiority of the proposed method in boosting targeted transferability. The code is available at this http URL.

Title: Length-Aware DETR for Robust Moment Retrieval

Authors: Seojeong Park, Jiho Choi, Kyungjune Baek, Hyunjung Shim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20816
Pdf URL: https://arxiv.org/pdf/2412.20816
Copy Paste: [[2412.20816]] Length-Aware DETR for Robust Moment Retrieval(https://arxiv.org/abs/2412.20816)
Keywords: robust
Abstract: Video Moment Retrieval (MR) aims to localize moments within a video based on a given natural language query. Given the prevalent use of platforms like YouTube for information retrieval, the demand for MR techniques is significantly growing. Recent DETR-based models have made notable advances in performance but still struggle with accurately localizing short moments. Through data analysis, we identified limited feature diversity in short moments, which motivated the development of MomentMix. MomentMix employs two augmentation strategies: ForegroundMix and BackgroundMix, each enhancing the feature representations of the foreground and background, respectively. Additionally, our analysis of prediction bias revealed that short moments particularly struggle with accurately predicting their center positions of moments. To address this, we propose a Length-Aware Decoder, which conditions length through a novel bipartite matching process. Our extensive studies demonstrate the efficacy of our length-aware approach, especially in localizing short moments, leading to improved overall performance. Our method surpasses state-of-the-art DETR-based methods on benchmark datasets, achieving the highest R1 and mAP on QVHighlights and the highest R1@0.7 on TACoS and Charades-STA (such as a 2.46% gain in R1@0.7 and a 2.57% gain in mAP average for QVHighlights). The code is available at this https URL.

Title: Disentangling Preference Representation and Text Generation for Efficient Individual Preference Alignment

Authors: Jianfei Zhang, Jun Bai, Bei Li, Yanmeng Wang, Rumei Li, Chenghua Lin, Wenge Rong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20834
Pdf URL: https://arxiv.org/pdf/2412.20834
Copy Paste: [[2412.20834]] Disentangling Preference Representation and Text Generation for Efficient Individual Preference Alignment(https://arxiv.org/abs/2412.20834)
Keywords: large language model
Abstract: Aligning Large Language Models (LLMs) with general human preferences has been proved crucial in improving the interaction quality between LLMs and human. However, human values are inherently diverse among different individuals, making it insufficient to align LLMs solely with general preferences. To address this, personalizing LLMs according to individual feedback emerges as a promising solution. Nonetheless, this approach presents challenges in terms of the efficiency of alignment algorithms. In this work, we introduce a flexible paradigm for individual preference alignment. Our method fundamentally improves efficiency by disentangling preference representation from text generation in LLMs. We validate our approach across multiple text generation tasks and demonstrate that it can produce aligned quality as well as or better than PEFT-based methods, while reducing additional training time for each new individual preference by $80\%$ to $90\%$ in comparison with them.

Title: Dual-Space Augmented Intrinsic-LoRA for Wind Turbine Segmentation

Authors: Shubh Singhal, Raül Pérez-Gonzalo, Andreas Espersen, Antonio Agudo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20838
Pdf URL: https://arxiv.org/pdf/2412.20838
Copy Paste: [[2412.20838]] Dual-Space Augmented Intrinsic-LoRA for Wind Turbine Segmentation(https://arxiv.org/abs/2412.20838)
Keywords: segmentation
Abstract: Accurate segmentation of wind turbine blade (WTB) images is critical for effective assessments, as it directly influences the performance of automated damage detection systems. Despite advancements in large universal vision models, these models often underperform in domain-specific tasks like WTB segmentation. To address this, we extend Intrinsic LoRA for image segmentation, and propose a novel dual-space augmentation strategy that integrates both image-level and latent-space augmentations. The image-space augmentation is achieved through linear interpolation between image pairs, while the latent-space augmentation is accomplished by introducing a noise-based latent probabilistic model. Our approach significantly boosts segmentation accuracy, surpassing current state-of-the-art methods in WTB image segmentation.

Title: Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory

Authors: Xingjian Tao, Yiwei Wang, Yujun Cai, Zhicheng Yang, Jing Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20846
Pdf URL: https://arxiv.org/pdf/2412.20846
Copy Paste: [[2412.20846]] Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory(https://arxiv.org/abs/2412.20846)
Keywords: large language model
Abstract: Large language models (LLMs) have shown promise as potential knowledge bases, yet they often struggle with question-answering tasks and are prone to hallucinations. While previous research attributes these issues to knowledge gaps in the model's parameters, our investigation reveals a different phenomenon: LLMs often retain correct knowledge even when generating incorrect answers. Through analysis of model's internal representations, we find that correct answers frequently appear among high-probability tokens despite not being selected as final outputs. Based on this observation, we introduce Hits@k, a new metric to assess knowledge retention independent of expression accuracy. Our extensive experiments demonstrate that LLMs store significantly more knowledge than their QA performance suggests. Building on these findings, we develop SkipUnsure, a method to improve answer accuracy by leveraging detected but unexpressed knowledge. Experiments on both open-domain and specific-domain datasets show consistent improvements, with accuracy gains of up to 11.8% on DBPedia and 6.3% on IMDB, without requiring model retraining.

Title: Enhancing Annotated Bibliography Generation with LLM Ensembles

Authors: Sergio Bermejo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20864
Pdf URL: https://arxiv.org/pdf/2412.20864
Copy Paste: [[2412.20864]] Enhancing Annotated Bibliography Generation with LLM Ensembles(https://arxiv.org/abs/2412.20864)
Keywords: large language model
Abstract: This work proposes a novel approach to enhancing annotated bibliography generation through Large Language Model (LLM) ensembles. In particular, multiple LLMs in different roles -- controllable text generation, evaluation, and summarization -- are introduced and validated using a systematic methodology to enhance model performance in scholarly tasks. Output diversity among the ensemble that generates text is obtained using different LLM parameters, followed by an LLM acting as a judge to assess relevance, accuracy, and coherence. Responses selected by several combining strategies are then merged and refined through summarization and redundancy removal techniques. The preliminary experimental validation demonstrates that the combined outputs from the LLM ensemble improve coherence and relevance compared to individual responses, leading to a 38% improvement in annotation quality and a 51% reduction in content redundancy, thus highlighting the potential for automating complex scholarly tasks while maintaining high-quality standards.

Title: SoftPatch+: Fully Unsupervised Anomaly Classification and Segmentation

Authors: Chengjie Wang, Xi Jiang, Bin-Bin Gao, Zhenye Gan, Yong Liu, Feng Zheng, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20870
Pdf URL: https://arxiv.org/pdf/2412.20870
Copy Paste: [[2412.20870]] SoftPatch+: Fully Unsupervised Anomaly Classification and Segmentation(https://arxiv.org/abs/2412.20870)
Keywords: robust, segmentation
Abstract: Although mainstream unsupervised anomaly detection (AD) (including image-level classification and pixel-level segmentation)algorithms perform well in academic datasets, their performance is limited in practical application due to the ideal experimental setting of clean training data. Training with noisy data is an inevitable problem in real-world anomaly detection but is seldom discussed. This paper is the first to consider fully unsupervised industrial anomaly detection (i.e., unsupervised AD with noisy data). To solve this problem, we proposed memory-based unsupervised AD methods, SoftPatch and SoftPatch+, which efficiently denoise the data at the patch level. Noise discriminators are utilized to generate outlier scores for patch-level noise elimination before coreset construction. The scores are then stored in the memory bank to soften the anomaly detection boundary. Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset, and SoftPatch+ has more robust performance which is articularly useful in real-world industrial inspection scenarios with high levels of noise (from 10% to 40%). Comprehensive experiments conducted in diverse noise scenarios demonstrate that both SoftPatch and SoftPatch+ outperform the state-of-the-art AD methods on the MVTecAD, ViSA, and BTAD benchmarks. Furthermore, the performance of SoftPatch and SoftPatch+ is comparable to that of the noise-free methods in conventional unsupervised AD setting. The code of the proposed methods can be found at this https URL.

Title: Attention Is All You Need For Mixture-of-Depths Routing

Authors: Advait Gadhikar, Souptik Kumar Majumdar, Niclas Popp, Piyapat Saranrittichai, Martin Rapp, Lukas Schott
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20875
Pdf URL: https://arxiv.org/pdf/2412.20875
Copy Paste: [[2412.20875]] Attention Is All You Need For Mixture-of-Depths Routing(https://arxiv.org/abs/2412.20875)
Keywords: transformer
Abstract: Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanism A-MoD that leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing, A-MoD allows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pretrained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to 2% higher accuracy on ImageNet compared to standard routing and isoFLOP ViT baselines. Furthermore, A-MoD improves the MoD training convergence, leading to up to 2x faster transfer learning.

Title: LiDAR-Camera Fusion for Video Panoptic Segmentation without Video Training

Authors: Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi, Mohammad Rahmati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20881
Pdf URL: https://arxiv.org/pdf/2412.20881
Copy Paste: [[2412.20881]] LiDAR-Camera Fusion for Video Panoptic Segmentation without Video Training(https://arxiv.org/abs/2412.20881)
Keywords: segmentation
Abstract: Panoptic segmentation, which combines instance and semantic segmentation, has gained a lot of attention in autonomous vehicles, due to its comprehensive representation of the scene. This task can be applied for cameras and LiDAR sensors, but there has been a limited focus on combining both sensors to enhance image panoptic segmentation (PS). Although previous research has acknowledged the benefit of 3D data on camera-based scene perception, no specific study has explored the influence of 3D data on image and video panoptic segmentation (VPS).This work seeks to introduce a feature fusion module that enhances PS and VPS by fusing LiDAR and image data for autonomous vehicles. We also illustrate that, in addition to this fusion, our proposed model, which utilizes two simple modifications, can further deliver even more high-quality VPS without being trained on video data. The results demonstrate a substantial improvement in both the image and video panoptic segmentation evaluation metrics by up to 5 points.

Title: DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Authors: Xiaolin Hu, Xiang Cheng, Peiyu Liu, Wei Liu, Jian Luan, Bin Wang, Yong Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20891
Pdf URL: https://arxiv.org/pdf/2412.20891
Copy Paste: [[2412.20891]] DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models(https://arxiv.org/abs/2412.20891)
Keywords: large language model
Abstract: Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices. However, low-rank approximation in two-dimensional space fails to capture high-dimensional structures within the target matrix. Recently, tensor decomposition methods have been explored for fine-tuning LLMs, leveraging their ability to extract structured information. Yet, these approaches primarily rely on random initialization, and the impact of initialization on tensor adaptation remains underexplored. In this paper, we reveal that random initialization significantly diverges from the validation loss achieved by full fine-tuning. To address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights for effective initialization in fine-tuning LLMs. Additionally, we introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization. Experiments on commonsense and arithmetic reasoning tasks show that DoTA outperforms random initialization methods with fewer parameters. QDoTA further reduces memory consumption and achieves comparable performance to DoTA on commonsense reasoning tasks. We will release our code to support future research.

Title: Towards Compatible Fine-tuning for Vision-Language Model Updates

Authors: Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, Tieniu Tan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20895
Pdf URL: https://arxiv.org/pdf/2412.20895
Copy Paste: [[2412.20895]] Towards Compatible Fine-tuning for Vision-Language Model Updates(https://arxiv.org/abs/2412.20895)
Keywords: robust
Abstract: So far, efficient fine-tuning has become a popular strategy for enhancing the capabilities of foundation models on downstream tasks by learning plug-and-play modules. However, existing methods overlook a crucial issue: if the underlying foundation model is updated, are these plug-and-play modules still effective? In this paper, we first conduct a detailed analysis of various fine-tuning methods on the CLIP in terms of their compatibility with model updates. The study reveals that many high-performing fine-tuning methods fail to be compatible with the upgraded models. To address this, we propose a novel approach, Class-conditioned Context Optimization (ContCoOp), which integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder. Consequently, the prompts can dynamically adapt to the changes in embedding space (due to model updates), ensuring continued effectiveness. Extensive experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.

Title: DDIM sampling for Generative AIBIM, a faster intelligent structural design framework

Authors: Zhili He, Yu-Hsing Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20899
Pdf URL: https://arxiv.org/pdf/2412.20899
Copy Paste: [[2412.20899]] DDIM sampling for Generative AIBIM, a faster intelligent structural design framework(https://arxiv.org/abs/2412.20899)
Keywords: diffusion, generative
Abstract: Generative AIBIM, a successful structural design pipeline, has proven its ability to intelligently generate high-quality, diverse, and creative shear wall designs that are tailored to specific physical conditions. However, the current module of Generative AIBIM that generates designs, known as the physics-based conditional diffusion model (PCDM), necessitates 1000 iterations for each generation due to its reliance on the denoising diffusion probabilistic model (DDPM) sampling process. This leads to a time-consuming and computationally demanding generation process. To address this issue, this study introduces the denoising diffusion implicit model (DDIM), an accelerated generation method that replaces the DDPM sampling process in PCDM. While the original DDIM was designed for DDPM and the optimization process of PCDM differs from that of DDPM, this paper designs "DDIM sampling for PCDM," which modifies the original DDIM formulations to adapt to the optimization process of PCDM. Experimental results demonstrate that DDIM sampling for PCDM can accelerate the generation process of the original PCDM by a factor of 100 while maintaining the same visual quality in the generated results. This study effectively showcases the effectiveness of DDIM sampling for PCDM in expediting intelligent structural design. Furthermore, this paper reorganizes the contents of DDIM, focusing on the practical usage of DDIM. This change is particularly meaningful for researchers who may not possess a strong background in machine learning theory but are interested in utilizing the tool effectively.

Title: ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

Authors: Ting Zhang, Zhiqiang Yuan, Yeshuang Zhu, Jinchao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20901
Pdf URL: https://arxiv.org/pdf/2412.20901
Copy Paste: [[2412.20901]] ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation(https://arxiv.org/abs/2412.20901)
Keywords: diffusion
Abstract: High-quality animated stickers usually contain transparent channels, which are often ignored by current video generation models. To generate fine-grained animated transparency channels, existing methods can be roughly divided into video matting algorithms and diffusion-based algorithms. The methods based on video matting have poor performance in dealing with semi-open areas in stickers, while diffusion-based methods are often used to model a single image, which will lead to local flicker when modeling animated stickers. In this paper, we firstly propose an ILDiff method to generate animated transparent channels through implicit layout distillation, which solves the problems of semi-open area collapse and no consideration of temporal information in existing methods. Secondly, we create the Transparent Animated Sticker Dataset (TASD), which contains 0.32M high-quality samples with transparent channel, to provide data support for related fields. Extensive experiments demonstrate that ILDiff can produce finer and smoother transparent channels compared to other methods such as Matting Anything and Layer Diffusion. Our code and dataset will be released at link this https URL.

Title: WalkVLM:Aid Visually Impaired People Walking by Vision Language Model

Authors: Zhiqiang Yuan, Ting Zhang, Jiapei Zhang, Jie Zhou, Jinchao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20903
Pdf URL: https://arxiv.org/pdf/2412.20903
Copy Paste: [[2412.20903]] WalkVLM:Aid Visually Impaired People Walking by Vision Language Model(https://arxiv.org/abs/2412.20903)
Keywords: fair
Abstract: Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of vision-language models (VLMs), employing VLMs to improve this field has emerged as a popular research topic. However, most existing methods are studied on self-built question-answering datasets, lacking a unified training and testing benchmark for walk guidance. Moreover, in blind walking task, it is necessary to perform real-time streaming video parsing and generate concise yet informative reminders, which poses a great challenge for VLMs that suffer from redundant responses and low inference efficiency. In this paper, we firstly release a diverse, extensive, and unbiased walking awareness dataset, containing 12k video-manual annotation pairs from Europe and Asia to provide a fair training and testing benchmark for blind walking task. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs. Our dataset and code will be released at anonymous link this https URL.

Title: Low-Light Image Enhancement via Generative Perceptual Priors

Authors: Han Zhou, Wei Dong, Xiaohong Liu, Yulun Zhang, Guangtao Zhai, Jun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20916
Pdf URL: https://arxiv.org/pdf/2412.20916
Copy Paste: [[2412.20916]] Low-Light Image Enhancement via Generative Perceptual Priors(https://arxiv.org/abs/2412.20916)
Keywords: diffusion, transformer, generative
Abstract: Although significant progress has been made in enhancing visibility, retrieving texture details, and mitigating noise in Low-Light (LL) images, the challenge persists in applying current Low-Light Image Enhancement (LLIE) methods to real-world scenarios, primarily due to the diverse illumination conditions encountered. Furthermore, the quest for generating enhancements that are visually realistic and attractive remains an underexplored realm. In response to these challenges, we introduce a novel \textbf{LLIE} framework with the guidance of \textbf{G}enerative \textbf{P}erceptual \textbf{P}riors (\textbf{GPP-LLIE}) derived from vision-language models (VLMs). Specifically, we first propose a pipeline that guides VLMs to assess multiple visual attributes of the LL image and quantify the assessment to output the global and local perceptual priors. Subsequently, to incorporate these generative perceptual priors to benefit LLIE, we introduce a transformer-based backbone in the diffusion process, and develop a new layer normalization (\textit{\textbf{GPP-LN}}) and an attention mechanism (\textit{\textbf{LPP-Attn}}) guided by global and local perceptual priors. Extensive experiments demonstrate that our model outperforms current SOTA methods on paired LL datasets and exhibits superior generalization on real-world data. The code is released at \url{this https URL}.

Title: HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization

Authors: Zijie Fang, Yifeng Wang, Peizhang Xie, Zhi Wang, Yongbing Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20924
Pdf URL: https://arxiv.org/pdf/2412.20924
Copy Paste: [[2412.20924]] HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization(https://arxiv.org/abs/2412.20924)
Keywords: segmentation
Abstract: Tissue semantic segmentation is one of the key tasks in computational pathology. To avoid the expensive and laborious acquisition of pixel-level annotations, a wide range of studies attempt to adopt the class activation map (CAM), a weakly-supervised learning scheme, to achieve pixel-level tissue segmentation. However, CAM-based methods are prone to suffer from under-activation and over-activation issues, leading to poor segmentation performance. To address this problem, we propose a novel weakly-supervised semantic segmentation framework for histopathological images based on image-mixing synthesis and consistency regularization, dubbed HisynSeg. Specifically, synthesized histopathological images with pixel-level masks are generated for fully-supervised model training, where two synthesis strategies are proposed based on Mosaic transformation and Bézier mask generation. Besides, an image filtering module is developed to guarantee the authenticity of the synthesized images. In order to further avoid the model overfitting to the occasional synthesis artifacts, we additionally propose a novel self-supervised consistency regularization, which enables the real images without segmentation masks to supervise the training of the segmentation model. By integrating the proposed techniques, the HisynSeg framework successfully transforms the weakly-supervised semantic segmentation problem into a fully-supervised one, greatly improving the segmentation accuracy. Experimental results on three datasets prove that the proposed method achieves a state-of-the-art performance. Code is available at this https URL.

Title: Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Authors: Junxiao Xue, Quan Deng, Fei Yu, Yanhao Wang, Jun Wang, Yuehua Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20927
Pdf URL: https://arxiv.org/pdf/2412.20927
Copy Paste: [[2412.20927]] Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering(https://arxiv.org/abs/2412.20927)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and Flamingo, have made significant progress in integrating visual and textual modalities, excelling in tasks like visual question answering (VQA), image captioning, and content retrieval. They can generate coherent and contextually relevant descriptions of images. However, they still face challenges in accurately identifying and counting objects and determining their spatial locations, particularly in complex scenes with overlapping or small objects. To address these limitations, we propose a novel framework based on multimodal retrieval-augmented generation (RAG), which introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images. Our framework improves the MLLM's capacity to handle tasks requiring precise visual descriptions, especially in scenarios with challenging perspectives, such as aerial views or scenes with dense object arrangements. Finally, we conduct extensive experiments on the VG-150 dataset that focuses on first-person visual understanding and the AUG dataset that involves aerial imagery. The results show that our approach consistently outperforms existing MLLMs in VQA tasks, which stands out in recognizing, localizing, and quantifying objects in different spatial contexts and provides more accurate visual descriptions.

Title: Generalizing in Net-Zero Microgrids: A Study with Federated PPO and TRPO

Authors: Nicolas M Cuadrado Avila, Samuel Horváth, Martin Takáč
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20946
Pdf URL: https://arxiv.org/pdf/2412.20946
Copy Paste: [[2412.20946]] Generalizing in Net-Zero Microgrids: A Study with Federated PPO and TRPO(https://arxiv.org/abs/2412.20946)
Keywords: privacy, federate
Abstract: This work addresses the challenge of optimal energy management in microgrids through a collaborative and privacy-preserving framework. We propose the FedTRPO methodology, which integrates Federated Learning (FL) and Trust Region Policy Optimization (TRPO) to manage distributed energy resources (DERs) efficiently. Using a customized version of the CityLearn environment and synthetically generated data, we simulate designed net-zero energy scenarios for microgrids composed of multiple buildings. Our approach emphasizes reducing energy costs and carbon emissions while ensuring privacy. Experimental results demonstrate that FedTRPO is comparable with state-of-the-art federated RL methodologies without hyperparameter tunning. The proposed framework highlights the feasibility of collaborative learning for achieving optimal control policies in energy systems, advancing the goals of sustainable and efficient smart grids.

Title: GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search

Authors: Matan Ben-Tov, Mahmood Sharif
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.20953
Pdf URL: https://arxiv.org/pdf/2412.20953
Copy Paste: [[2412.20953]] GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search(https://arxiv.org/abs/2412.20953)
Keywords: attack, robust
Abstract: Dense embedding-based text retrieval$\unicode{x2013}$retrieval of relevant passages from corpora via deep learning encodings$\unicode{x2013}$has emerged as a powerful method attaining state-of-the-art search results and popularizing the use of Retrieval Augmented Generation (RAG). Still, like other search methods, embedding-based retrieval may be susceptible to search-engine optimization (SEO) attacks, where adversaries promote malicious content by introducing adversarial passages to corpora. To faithfully assess and gain insights into the susceptibility of such systems to SEO, this work proposes the GASLITE attack, a mathematically principled gradient-based search method for generating adversarial passages without relying on the corpus content or modifying the model. Notably, GASLITE's passages (1) carry adversary-chosen information while (2) achieving high retrieval ranking for a selected query distribution when inserted to corpora. We use GASLITE to extensively evaluate retrievers' robustness, testing nine advanced models under varied threat models, while focusing on realistic adversaries targeting queries on a specific concept (e.g., a public figure). We found GASLITE consistently outperformed baselines by $\geq$140% success rate, in all settings. Particularly, adversaries using GASLITE require minimal effort to manipulate search results$\unicode{x2013}$by injecting a negligible amount of adversarial passages ($\leq$0.0001% of the corpus), they could make them visible in the top-10 results for 61-100% of unseen concept-specific queries against most evaluated models. Inspecting variance in retrievers' robustness, we identify key factors that may contribute to models' susceptibility to SEO, including specific properties in the embedding space's geometry.

Title: Conservation-informed Graph Learning for Spatiotemporal Dynamics Prediction

Authors: Yuan Mi, Pu Ren, Hongteng Xu, Hongsheng Liu, Zidong Wang, Yike Guo, Ji-Rong Wen, Hao Sun, Yang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20962
Pdf URL: https://arxiv.org/pdf/2412.20962
Copy Paste: [[2412.20962]] Conservation-informed Graph Learning for Spatiotemporal Dynamics Prediction(https://arxiv.org/abs/2412.20962)
Keywords: interpretability
Abstract: Data-centric methods have shown great potential in understanding and predicting spatiotemporal dynamics, enabling better design and control of the object system. However, pure deep learning models often lack interpretability, fail to obey intrinsic physics, and struggle to cope with the various domains. While geometry-based methods, e.g., graph neural networks (GNNs), have been proposed to further tackle these challenges, they still need to find the implicit physical laws from large datasets and rely excessively on rich labeled data. In this paper, we herein introduce the conservation-informed GNN (CiGNN), an end-to-end explainable learning framework, to learn spatiotemporal dynamics based on limited training data. The network is designed to conform to the general conservation law via symmetry, where conservative and non-conservative information passes over a multiscale space enhanced by a latent temporal marching strategy. The efficacy of our model has been verified in various spatiotemporal systems based on synthetic and real-world datasets, showing superiority over baseline models. Results demonstrate that CiGNN exhibits remarkable accuracy and generalization ability, and is readily applicable to learning for prediction of various spatiotemporal dynamics in a spatial domain with complex geometry.

Title: AlignAb: Pareto-Optimal Energy Alignment for Designing Nature-Like Antibodies

Authors: Yibo Wen, Chenwei Xu, Jerry Yao-Chieh Hu, Han Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20984
Pdf URL: https://arxiv.org/pdf/2412.20984
Copy Paste: [[2412.20984]] AlignAb: Pareto-Optimal Energy Alignment for Designing Nature-Like Antibodies(https://arxiv.org/abs/2412.20984)
Keywords: diffusion
Abstract: We present a three-stage framework for training deep learning models specializing in antibody sequence-structure co-design. We first pre-train a language model using millions of antibody sequence data. Then, we employ the learned representations to guide the training of a diffusion model for joint optimization over both sequence and structure of antibodies. During the final alignment stage, we optimize the model to favor antibodies with low repulsion and high attraction to the antigen binding site, enhancing the rationality and functionality of the designs. To mitigate conflicting energy preferences, we extend AbDPO (Antibody Direct Preference Optimization) to guide the model towards Pareto optimality under multiple energy-based alignment objectives. Furthermore, we adopt an iterative learning paradigm with temperature scaling, enabling the model to benefit from diverse online datasets without requiring additional data. In practice, our proposed methods achieve high stability and efficiency in producing a better Pareto front of antibody designs compared to top samples generated by baselines and previous alignment techniques. Through extensive experiments, we showcase the superior performance of our methods in generating nature-like antibodies with high binding affinity consistently.

Title: RobustBlack: Challenging Black-Box Adversarial Attacks on State-of-the-Art Defenses

Authors: Mohamed Djilani, Salah Ghamizi, Maxime Cordy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20987
Pdf URL: https://arxiv.org/pdf/2412.20987
Copy Paste: [[2412.20987]] RobustBlack: Challenging Black-Box Adversarial Attacks on State-of-the-Art Defenses(https://arxiv.org/abs/2412.20987)
Keywords: defense, attack, robust
Abstract: Although adversarial robustness has been extensively studied in white-box settings, recent advances in black-box attacks (including transfer- and query-based approaches) are primarily benchmarked against weak defenses, leaving a significant gap in the evaluation of their effectiveness against more recent and moderate robust models (e.g., those featured in the Robustbench leaderboard). In this paper, we question this lack of attention from black-box attacks to robust models. We establish a framework to evaluate the effectiveness of recent black-box attacks against both top-performing and standard defense mechanisms, on the ImageNet dataset. Our empirical evaluation reveals the following key findings: (1) the most advanced black-box attacks struggle to succeed even against simple adversarially trained models; (2) robust models that are optimized to withstand strong white-box attacks, such as AutoAttack, also exhibits enhanced resilience against black-box attacks; and (3) robustness alignment between the surrogate models and the target model plays a key factor in the success rate of transfer-based attacks

Title: Efficiently Serving LLM Reasoning Programs with Certaindex

Authors: Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, Hao Zhang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.20993
Pdf URL: https://arxiv.org/pdf/2412.20993
Copy Paste: [[2412.20993]] Efficiently Serving LLM Reasoning Programs with Certaindex(https://arxiv.org/abs/2412.20993)
Keywords: large language model
Abstract: The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets. We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.

Title: KARPA: A Training-free Method of Adapting Knowledge Graph as References for Large Language Model's Reasoning Path Aggregation

Authors: Siyuan Fang, Kaijing Ma, Tianyu Zheng, Xinrun Du, Ningxuan Lu, Ge Zhang, Qingkun Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20995
Pdf URL: https://arxiv.org/pdf/2412.20995
Copy Paste: [[2412.20995]] KARPA: A Training-free Method of Adapting Knowledge Graph as References for Large Language Model's Reasoning Path Aggregation(https://arxiv.org/abs/2412.20995)
Keywords: large language model
Abstract: Large language models (LLMs) demonstrate exceptional performance across a variety of tasks, yet they are often affected by hallucinations and the timeliness of knowledge. Leveraging knowledge graphs (KGs) as external knowledge sources has emerged as a viable solution, but existing methods for LLM-based knowledge graph question answering (KGQA) are often limited by step-by-step decision-making on KGs, restricting the global planning and reasoning capabilities of LLMs, or they require fine-tuning or pre-training on specific KGs. To address these challenges, we propose Knowledge graph Assisted Reasoning Path Aggregation (KARPA), a novel framework that harnesses the global planning abilities of LLMs for efficient and accurate KG reasoning. KARPA operates in three steps: pre-planning relation paths using the LLM's global planning capabilities, matching semantically relevant paths via an embedding model, and reasoning over these paths to generate answers. Unlike existing KGQA methods, KARPA avoids stepwise traversal, requires no additional training, and is adaptable to various LLM architectures. Extensive experimental results show that KARPA achieves state-of-the-art performance in KGQA tasks, delivering both high efficiency and accuracy. Our code will be available on Github.

Title: Plug-and-Play Training Framework for Preference Optimization

Authors: Jingyuan Ma, Rui Li, Zheng Li, Lei Sha, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.20996
Pdf URL: https://arxiv.org/pdf/2412.20996
Copy Paste: [[2412.20996]] Plug-and-Play Training Framework for Preference Optimization(https://arxiv.org/abs/2412.20996)
Keywords: large language model
Abstract: Recently, preference optimization methods such as DPO have significantly enhanced large language models (LLMs) in wide tasks including dialogue and question-answering. However, current methods fail to account for the varying difficulty levels of training samples during preference optimization, leading to mediocre performance in tasks with high accuracy requirements, particularly in mathematical reasoning. To address this limitation, we propose a novel training framework, which employs multiple sampling to analyze output distributions, assign different weights to samples, and incorporate these weights into the preference optimization process. This plug-and-play approach enables LLMs to prioritize challenging examples during training, improving learning efficiency. Experimental results demonstrate that our framework integrates seamlessly with various preference optimization methods and achieves consistent improvements in mathematical reasoning tasks.

Title: Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria

Authors: Joonwon Jang, Jaehee Kim, Wonbin Kweon, Hwanjo Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.21006
Pdf URL: https://arxiv.org/pdf/2412.21006
Copy Paste: [[2412.21006]] Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria(https://arxiv.org/abs/2412.21006)
Keywords: large language model
Abstract: Large Language Models (LLMs) rely on generating extensive intermediate reasoning units (e.g., tokens, sentences) to enhance final answer quality across a wide range of complex tasks. While generating multiple reasoning paths or iteratively refining rationales proves effective for improving performance, these approaches inevitably result in significantly higher inference costs. In this work, we propose a novel sentence-level rationale reduction training framework that leverages likelihood-based criteria, verbosity, to identify and remove redundant reasoning sentences. Unlike previous approaches that utilize token-level reduction, our sentence-level reduction framework maintains model performance while reducing generation length. This preserves the original reasoning abilities of LLMs and achieves an average 17.15% reduction in generation costs across various models and tasks.

Title: Towards Identity-Aware Cross-Modal Retrieval: a Dataset and a Baseline

Authors: Nicola Messina, Lucia Vadicamo, Leo Maltese, Claudio Gennaro
Subjects: cs.CV, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2412.21009
Pdf URL: https://arxiv.org/pdf/2412.21009
Copy Paste: [[2412.21009]] Towards Identity-Aware Cross-Modal Retrieval: a Dataset and a Baseline(https://arxiv.org/abs/2412.21009)
Keywords: robust
Abstract: Recent advancements in deep learning have significantly enhanced content-based retrieval methods, notably through models like CLIP that map images and texts into a shared embedding space. However, these methods often struggle with domain-specific entities and long-tail concepts absent from their training data, particularly in identifying specific individuals. In this paper, we explore the task of identity-aware cross-modal retrieval, which aims to retrieve images of persons in specific contexts based on natural language queries. This task is critical in various scenarios, such as for searching and browsing personalized video collections or large audio-visual archives maintained by national broadcasters. We introduce a novel dataset, COCO Person FaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched with deepfake-generated faces from VGGFace2. This dataset addresses the lack of large-scale datasets needed for training and evaluating models for this task. Our experiments assess the performance of different CLIP variations repurposed for this task, including our architecture, Identity-aware CLIP (Id-CLIP), which achieves competitive retrieval performance through targeted fine-tuning. Our contributions lay the groundwork for more robust cross-modal retrieval systems capable of recognizing long-tail identities and contextual nuances. Data and code are available at this https URL.

Title: MapQaTor: A System for Efficient Annotation of Map Query Datasets

Authors: Mahir Labib Dihan, Mohammed Eunus Ali, Md Rizwan Parvez
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2412.21015
Pdf URL: https://arxiv.org/pdf/2412.21015
Copy Paste: [[2412.21015]] MapQaTor: A System for Efficient Annotation of Map Query Datasets(https://arxiv.org/abs/2412.21015)
Keywords: large language model
Abstract: Mapping and navigation services like Google Maps, Apple Maps, Openstreet Maps, are essential for accessing various location-based data, yet they often struggle to handle natural language geospatial queries. Recent advancements in Large Language Models (LLMs) show promise in question answering (QA), but creating reliable geospatial QA datasets from map services remains challenging. We introduce MapQaTor, a web application that streamlines the creation of reproducible, traceable map-based QA datasets. With its plug-and-play architecture, MapQaTor enables seamless integration with any maps API, allowing users to gather and visualize data from diverse sources with minimal setup. By caching API responses, the platform ensures consistent ground truth, enhancing the reliability of the data even as real-world information evolves. MapQaTor centralizes data retrieval, annotation, and visualization within a single platform, offering a unique opportunity to evaluate the current state of LLM-based geospatial reasoning while advancing their capabilities for improved geospatial understanding. Evaluation metrics show that, MapQaTor speeds up the annotation process by at least 30 times compared to manual methods, underscoring its potential for developing geospatial resources, such as complex map reasoning datasets. The website is live at: this https URL and a demo video is available at: this https URL.

Title: Text Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models

Authors: Christos Petridis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.21022
Pdf URL: https://arxiv.org/pdf/2412.21022
Copy Paste: [[2412.21022]] Text Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models(https://arxiv.org/abs/2412.21022)
Keywords: transformer
Abstract: Text classification is a very common task nowadays and there are many efficient methods and algorithms that we can employ to accomplish it. Transformers have revolutionized the field of deep learning, particularly in Natural Language Processing (NLP) and have rapidly expanded to other domains such as computer vision, time-series analysis and more. The transformer model was firstly introduced in the context of machine translation and its architecture relies on self-attention mechanisms to capture complex relationships within data sequences. It is able to handle long-range dependencies more effectively than traditional neural networks (such as Recurrent Neural Networks and Multilayer Perceptrons). In this work, we present a comparison between different techniques to perform text classification. We take into consideration seven pre-trained models, three standard neural networks and three machine learning models. For standard neural networks and machine learning models we also compare two embedding techniques: TF-IDF and GloVe, with the latter consistently outperforming the former. Finally, we demonstrate the results from our experiments where pre-trained models such as BERT and DistilBERT always perform better than standard models/algorithms.

Title: Improving Location-based Thermal Emission Side-Channel Analysis Using Iterative Transfer Learning

Authors: Tun-Chieh Lou, Chung-Che Wang, Jyh-Shing Roger Jang, Henian Li, Lang Lin, Norman Chang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.21030
Pdf URL: https://arxiv.org/pdf/2412.21030
Copy Paste: [[2412.21030]] Improving Location-based Thermal Emission Side-Channel Analysis Using Iterative Transfer Learning(https://arxiv.org/abs/2412.21030)
Keywords: attack
Abstract: This paper proposes the use of iterative transfer learning applied to deep learning models for side-channel attacks. Currently, most of the side-channel attack methods train a model for each individual byte, without considering the correlation between bytes. However, since the models' parameters for attacking different bytes may be similar, we can leverage transfer learning, meaning that we first train the model for one of the key bytes, then use the trained model as a pretrained model for the remaining bytes. This technique can be applied iteratively, a process known as iterative transfer learning. Experimental results show that when using thermal or power consumption map images as input, and multilayer perceptron or convolutional neural network as the model, our method improves average performance, especially when the amount of data is insufficient.

Title: GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

Authors: Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.21036
Pdf URL: https://arxiv.org/pdf/2412.21036
Copy Paste: [[2412.21036]] GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models(https://arxiv.org/abs/2412.21036)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have achieved significant advancements in integrating visual and linguistic understanding. While existing benchmarks evaluate these models in context-rich, real-life scenarios, they often overlook fundamental perceptual skills essential for environments deviating from everyday realism. In particular, geometric perception, the ability to interpret spatial relationships and abstract visual patterns, remains underexplored. To address this limitation, we introduce GePBench, a novel benchmark designed to assess the geometric perception capabilities of MLLMs. Results from extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in such tasks. Additionally, we demonstrate that models trained with data sourced from GePBench show notable improvements on a wide range of downstream tasks, underscoring the importance of geometric perception as a foundation for advanced multimodal applications. Our code and datasets will be publicly available.

Title: Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration

Authors: Wanglong Lu, Jikai Wang, Tao Wang, Kaihao Zhang, Xianta Jiang, Hanli Zhao
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.21042
Pdf URL: https://arxiv.org/pdf/2412.21042
Copy Paste: [[2412.21042]] Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration(https://arxiv.org/abs/2412.21042)
Keywords: extraction, diffusion, generative
Abstract: Blind face restoration aims to recover high-quality facial images from various unidentified sources of degradation, posing significant challenges due to the minimal information retrievable from the degraded images. Prior knowledge-based methods, leveraging geometric priors and facial features, have led to advancements in face restoration but often fall short of capturing fine details. To address this, we introduce a visual style prompt learning framework that utilizes diffusion probabilistic models to explicitly generate visual prompts within the latent space of pre-trained generative models. These prompts are designed to guide the restoration process. To fully utilize the visual prompts and enhance the extraction of informative and rich patterns, we introduce a style-modulated aggregation transformation layer. Extensive experiments and applications demonstrate the superiority of our method in achieving high-quality blind face restoration. The source code is available at \href{this https URL}{this https URL}.

Title: E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

Authors: Zhiyu Tan, WenXu Qian, Hesen Chen, Mengping Yang, Lei Chen, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21044
Pdf URL: https://arxiv.org/pdf/2412.21044
Copy Paste: [[2412.21044]] E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models(https://arxiv.org/abs/2412.21044)
Keywords: robust, diffusion, generative
Abstract: Diffusion models have emerged as a powerful framework for generative modeling, achieving state-of-the-art performance across various tasks. However, they face several inherent limitations, including a training-sampling gap, information leakage in the progressive noising process, and the inability to incorporate advanced loss functions like perceptual and adversarial losses during training. To address these challenges, we propose an innovative end-to-end training framework that aligns the training and sampling processes by directly optimizing the final reconstruction output. Our method eliminates the training-sampling gap, mitigates information leakage by treating the training process as a direct mapping from pure noise to the target data distribution, and enables the integration of perceptual and adversarial losses into the objective. Extensive experiments on benchmarks such as COCO30K and HW30K demonstrate that our approach consistently outperforms traditional diffusion models, achieving superior results in terms of FID and CLIP score, even with reduced sampling steps. These findings highlight the potential of end-to-end training to advance diffusion-based generative models toward more robust and efficient solutions.

Title: Learning Epidemiological Dynamics via the Finite Expression Method

Authors: Jianda Du, Senwei Liang, Chunmei Wang
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2412.21049
Pdf URL: https://arxiv.org/pdf/2412.21049
Copy Paste: [[2412.21049]] Learning Epidemiological Dynamics via the Finite Expression Method(https://arxiv.org/abs/2412.21049)
Keywords: interpretability
Abstract: Modeling and forecasting the spread of infectious diseases is essential for effective public health decision-making. Traditional epidemiological models rely on expert-defined frameworks to describe complex dynamics, while neural networks, despite their predictive power, often lack interpretability due to their ``black-box" nature. This paper introduces the Finite Expression Method, a symbolic learning framework that leverages reinforcement learning to derive explicit mathematical expressions for epidemiological dynamics. Through numerical experiments on both synthetic and real-world datasets, FEX demonstrates high accuracy in modeling and predicting disease spread, while uncovering explicit relationships among epidemiological variables. These results highlight FEX as a powerful tool for infectious disease modeling, combining interpretability with strong predictive performance to support practical applications in public health.

Title: Toward Intelligent and Secure Cloud: Large Language Model Empowered Proactive Defense

Authors: Yuyang Zhou, Guang Cheng, Kang Du, Zihan Chen
Subjects: cs.CR, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2412.21051
Pdf URL: https://arxiv.org/pdf/2412.21051
Copy Paste: [[2412.21051]] Toward Intelligent and Secure Cloud: Large Language Model Empowered Proactive Defense(https://arxiv.org/abs/2412.21051)
Keywords: secure, security, defense, attack, generative, large language model
Abstract: The rapid evolution of cloud computing technologies and the increasing number of cloud applications have provided a large number of benefits in daily lives. However, the diversity and complexity of different components pose a significant challenge to cloud security, especially when dealing with sophisticated and advanced cyberattacks. Recent advancements in generative foundation models (GFMs), particularly in the large language models (LLMs), offer promising solutions for security intelligence. By exploiting the powerful abilities in language understanding, data analysis, task inference, action planning, and code generation, we present LLM-PD, a novel proactive defense architecture that defeats various threats in a proactive manner. LLM-PD can efficiently make a decision through comprehensive data analysis and sequential reasoning, as well as dynamically creating and deploying actionable defense mechanisms on the target cloud. Furthermore, it can flexibly self-evolve based on experience learned from previous interactions and adapt to new attack scenarios without additional training. The experimental results demonstrate its remarkable ability in terms of defense effectiveness and efficiency, particularly highlighting an outstanding success rate when compared with other existing methods.

Title: Towards Effective Discrimination Testing for Generative AI

Authors: Thomas P. Zollo, Nikita Rajaneesh, Richard Zemel, Talia B. Gillis, Emily Black
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2412.21052
Pdf URL: https://arxiv.org/pdf/2412.21052
Copy Paste: [[2412.21052]] Towards Effective Discrimination Testing for Generative AI(https://arxiv.org/abs/2412.21052)
Keywords: fair, generative
Abstract: Generative AI (GenAI) models present new challenges in regulating against discriminatory behavior. In this paper, we argue that GenAI fairness research still has not met these challenges; instead, a significant gap remains between existing bias assessment methods and regulatory goals. This leads to ineffective regulation that can allow deployment of reportedly fair, yet actually discriminatory, GenAI systems. Towards remedying this problem, we connect the legal and technical literature around GenAI bias evaluation and identify areas of misalignment. Through four case studies, we demonstrate how this misalignment between fairness testing techniques and regulatory goals can result in discriminatory outcomes in real-world deployments, especially in adaptive or complex environments. We offer practical recommendations for improving discrimination testing to better align with regulatory goals and enhance the reliability of fairness assessments in future deployments.

Title: BridgePure: Revealing the Fragility of Black-box Data Protection

Authors: Yihan Wang, Yiwei Lu, Xiao-Shan Gao, Gautam Kamath, Yaoliang Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.21061
Pdf URL: https://arxiv.org/pdf/2412.21061
Copy Paste: [[2412.21061]] BridgePure: Revealing the Fragility of Black-box Data Protection(https://arxiv.org/abs/2412.21061)
Keywords: protect, attack, diffusion
Abstract: Availability attacks, or unlearnable examples, are defensive techniques that allow data owners to modify their datasets in ways that prevent unauthorized machine learning models from learning effectively while maintaining the data's intended functionality. It has led to the release of popular black-box tools for users to upload personal data and receive protected counterparts. In this work, we show such black-box protections can be substantially bypassed if a small set of unprotected in-distribution data is available. Specifically, an adversary can (1) easily acquire (unprotected, protected) pairs by querying the black-box protections with the unprotected dataset; and (2) train a diffusion bridge model to build a mapping. This mapping, termed BridgePure, can effectively remove the protection from any previously unseen data within the same distribution. Under this threat model, our method demonstrates superior purification performance on classification and style mimicry tasks, exposing critical vulnerabilities in black-box data protection.

Title: Varformer: Adapting VAR's Generative Prior for Image Restoration

Authors: Siyang Wang, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21063
Pdf URL: https://arxiv.org/pdf/2412.21063
Copy Paste: [[2412.21063]] Varformer: Adapting VAR's Generative Prior for Image Restoration(https://arxiv.org/abs/2412.21063)
Keywords: diffusion, generative
Abstract: Generative models trained on extensive high-quality datasets effectively capture the structural and statistical properties of clean images, rendering them powerful priors for transforming degraded features into clean ones in image restoration. VAR, a novel image generative paradigm, surpasses diffusion models in generation quality by applying a next-scale prediction approach. It progressively captures both global structures and fine-grained details through the autoregressive process, consistent with the multi-scale restoration principle widely acknowledged in the restoration community. Furthermore, we observe that during the image reconstruction process utilizing VAR, scale predictions automatically modulate the input, facilitating the alignment of representations at subsequent scales with the distribution of clean images. To harness VAR's adaptive distribution alignment capability in image restoration tasks, we formulate the multi-scale latent representations within VAR as the restoration prior, thus advancing our delicately designed VarFormer framework. The strategic application of these priors enables our VarFormer to achieve remarkable generalization on unseen tasks while also reducing training computational costs. Extensive experiments underscores that our VarFormer outperforms existing multi-task image restoration methods across various restoration tasks.

Title: Efficient Multi-Task Inferencing with a Shared Backbone and Lightweight Task-Specific Adapters for Automatic Scoring

Authors: Ehsan Latif, Xiaoming Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.21065
Pdf URL: https://arxiv.org/pdf/2412.21065
Copy Paste: [[2412.21065]] Efficient Multi-Task Inferencing with a Shared Backbone and Lightweight Task-Specific Adapters for Automatic Scoring(https://arxiv.org/abs/2412.21065)
Keywords: fair
Abstract: The integration of Artificial Intelligence (AI) in education requires scalable and efficient frameworks that balance performance, adaptability, and cost. This paper addresses these needs by proposing a shared backbone model architecture enhanced with lightweight LoRA adapters for task-specific fine-tuning, targeting the automated scoring of student responses across 27 mutually exclusive tasks. By achieving competitive performance (average QWK of 0.848 compared to 0.888 for fully fine-tuned models) while reducing GPU memory consumption by 60% and inference latency by 40%, the framework demonstrates significant efficiency gains. This approach aligns with the workshops' focus on improving language models for educational tasks, creating responsible innovations for cost-sensitive deployment, and supporting educators by streamlining assessment workflows. The findings underscore the potential of scalable AI to enhance learning outcomes while maintaining fairness and transparency in automated scoring systems.

Title: Edicho: Consistent Image Editing in the Wild

Authors: Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21079
Pdf URL: https://arxiv.org/pdf/2412.21079
Copy Paste: [[2412.21079]] Edicho: Consistent Image Editing in the Wild(https://arxiv.org/abs/2412.21079)
Keywords: diffusion
Abstract: As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.

Title: Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Authors: Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Yali Wang, Yu Qiao, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21080
Pdf URL: https://arxiv.org/pdf/2412.21080
Copy Paste: [[2412.21080]] Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model(https://arxiv.org/abs/2412.21080)
Keywords: robust
Abstract: We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos at this https URL.

Title: On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage

Authors: Nicolas Reategui, Roman Pletka, Dionysios Diamantopoulos
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.21084
Pdf URL: https://arxiv.org/pdf/2412.21084
Copy Paste: [[2412.21084]] On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage(https://arxiv.org/abs/2412.21084)
Keywords: security, attack, robust
Abstract: Ransomware represents a pervasive threat, traditionally countered at the operating system, file-system, or network levels. However, these approaches often introduce significant overhead and remain susceptible to circumvention by attackers. Recent research activity started looking into the detection of ransomware by observing block IO operations. However, this approach exhibits significant detection challenges. Recognizing these limitations, our research pivots towards enabling robust ransomware detection in storage systems keeping in mind their limited computational resources available. To perform our studies, we propose a kernel-based framework capable of efficiently extracting and analyzing IO operations to identify ransomware activity. The framework can be adopted to storage systems using computational storage devices to improve security and fully hide detection overheads. Our method employs a refined set of computationally light features optimized for ML models to accurately discern malicious from benign activities. Using this lightweight approach, we study a wide range of generalizability aspects and analyze the performance of these models across a large space of setups and configurations covering a wide range of realistic real-world scenarios. We reveal various trade-offs and provide strong arguments for the generalizability of storage-based detection of ransomware and show that our approach outperforms currently available ML-based ransomware detection in storage. Empirical validation reveals that our decision tree-based models achieve remarkable effectiveness, evidenced by higher median F1 scores of up to 12.8%, lower false negative rates of up to 10.9% and particularly decreased false positive rates of up to 17.1% compared to existing storage-based detection approaches.

Title: Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

Authors: Yuanbo Yang, Jiahao Shao, Xinyang Li, Yujun Shen, Andreas Geiger, Yiyi Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21117
Pdf URL: https://arxiv.org/pdf/2412.21117
Copy Paste: [[2412.21117]] Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation(https://arxiv.org/abs/2412.21117)
Keywords: diffusion
Abstract: In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: this https URL

Title: ExpShield: Safeguarding Web Text from Unauthorized Crawling and Language Modeling Exploitation

Authors: Ruixuan Liu, Toan Tran, Tianhao Wang, Hongsheng Hu, Shuo Wang, Li Xiong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.21123
Pdf URL: https://arxiv.org/pdf/2412.21123
Copy Paste: [[2412.21123]] ExpShield: Safeguarding Web Text from Unauthorized Crawling and Language Modeling Exploitation(https://arxiv.org/abs/2412.21123)
Keywords: protect, defense, large language model
Abstract: As large language models (LLMs) increasingly depend on web-scraped datasets, concerns over unauthorized use of copyrighted or personal content for training have intensified. Despite regulations such as the General Data Protection Regulation (GDPR), data owners still have limited control over the use of their content in model training. To address this, we propose ExpShield, a proactive self-guard mechanism that empowers content owners to embed invisible perturbations into their text, limiting data misuse in LLMs training without affecting readability. This preemptive approach enables data owners to protect sensitive content directly, without relying on a third-party to perform defense. Starting from the random perturbation, we demonstrate the rationale for using perturbation to conceal protected content. We further enhance the efficiency by identifying memorization triggers and creating pitfalls to diverge the model memorization in a more focused way. To validate our defense's effectiveness, we propose a novel metric of instance exploitation which captures the individual risk raised by model training. The experimental results validate the effectiveness of our approach as the MIA AUC decreases from 0.95 to 0.55, and instance exploitation approaches zero. This suggests that the individual risk does not increase after training, underscoring the significance of proactive defenses in protecting copyrighted data.

Title: Facilitating large language model Russian adaptation with Learned Embedding Propagation

Authors: Mikhail Tikhomirov, Daniil Chernyshev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.21140
Pdf URL: https://arxiv.org/pdf/2412.21140
Copy Paste: [[2412.21140]] Facilitating large language model Russian adaptation with Learned Embedding Propagation(https://arxiv.org/abs/2412.21140)
Keywords: large language model
Abstract: Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.

Title: PyG-SSL: A Graph Self-Supervised Learning Toolkit

Authors: Lecheng Zheng, Baoyu Jing, Zihao Li, Zhichen Zeng, Tianxin Wei, Mengting Ai, Xinrui He, Lihui Liu, Dongqi Fu, Jiaxuan You, Hanghang Tong, Jingrui He
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.21151
Pdf URL: https://arxiv.org/pdf/2412.21151
Copy Paste: [[2412.21151]] PyG-SSL: A Graph Self-Supervised Learning Toolkit(https://arxiv.org/abs/2412.21151)
Keywords: robust
Abstract: Graph Self-Supervised Learning (SSL) has emerged as a pivotal area of research in recent years. By engaging in pretext tasks to learn the intricate topological structures and properties of graphs using unlabeled data, these graph SSL models achieve enhanced performance, improved generalization, and heightened robustness. Despite the remarkable achievements of these graph SSL methods, their current implementation poses significant challenges for beginners and practitioners due to the complex nature of graph structures, inconsistent evaluation metrics, and concerns regarding reproducibility hinder further progress in this field. Recognizing the growing interest within the research community, there is an urgent need for a comprehensive, beginner-friendly, and accessible toolkit consisting of the most representative graph SSL algorithms. To address these challenges, we present a Graph SSL toolkit named PyG-SSL, which is built upon PyTorch and is compatible with various deep learning and scientific computing backends. Within the toolkit, we offer a unified framework encompassing dataset loading, hyper-parameter configuration, model training, and comprehensive performance evaluation for diverse downstream tasks. Moreover, we provide beginner-friendly tutorials and the best hyper-parameters of each graph SSL algorithm on different graph datasets, facilitating the reproduction of results. The GitHub repository of the library is this https URL.

Title: Unified dimensionality reduction techniques in chronic liver disease detection

Authors: Anand Karna, Naina Khan, Rahul Rauniyar, Prashant Giridhar Shambharkar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.21156
Pdf URL: https://arxiv.org/pdf/2412.21156
Copy Paste: [[2412.21156]] Unified dimensionality reduction techniques in chronic liver disease detection(https://arxiv.org/abs/2412.21156)
Keywords: extraction
Abstract: Globally, chronic liver disease continues to be a major health concern that requires precise predictive models for prompt detection and treatment. Using the Indian Liver Patient Dataset (ILPD) from the University of California at Irvine's UCI Machine Learning Repository, a number of machine learning algorithms are investigated in this study. The main focus of our research is this dataset, which includes the medical records of 583 patients, 416 of whom have been diagnosed with liver disease and 167 of whom have not. There are several aspects to this work, including feature extraction and dimensionality reduction methods like Linear Discriminant Analysis (LDA), Factor Analysis (FA), t-distributed Stochastic Neighbour Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). The purpose of the study is to investigate how well these approaches work for converting high-dimensional datasets and improving prediction accuracy. To assess the prediction ability of the improved models, a number of classification methods were used, such as Multi-layer Perceptron, Random Forest, K-nearest neighbours, and Logistic Regression. Remarkably, the improved models performed admirably, with Random Forest having the highest accuracy of 98.31\% in 10-fold cross-validation and 95.79\% in train-test split evaluation. Findings offer important new perspectives on the choice and use of customized feature extraction and dimensionality reduction methods, which improve predictive models for patients with chronic liver disease.

Title: A Large-Scale Study on Video Action Dataset Condensation

Authors: Yang Chen, Sheng Guo, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21197
Pdf URL: https://arxiv.org/pdf/2412.21197
Copy Paste: [[2412.21197]] A Large-Scale Study on Video Action Dataset Condensation(https://arxiv.org/abs/2412.21197)
Keywords: fair
Abstract: Dataset condensation has made significant progress in the image domain. Unlike images, videos possess an additional temporal dimension, which harbors considerable redundant information, making condensation even more crucial. However, video dataset condensation still remains an underexplored area. We aim to bridge this gap by providing a large-scale empirical study with systematic design and fair comparison. Specifically, our work delves into three key aspects to provide valuable empirical insights: (1) temporal processing of video data, (2) establishing a comprehensive evaluation protocol for video dataset condensation, and (3) adaptation of condensation methods to the space-time domain and fair comparisons among them. From this study, we derive several intriguing observations: (i) sample diversity appears to be more crucial than temporal diversity for video dataset condensation, (ii) simple slide-window sampling proves to be effective, and (iii) sample selection currently outperforms dataset distillation in most cases. Furthermore, we conduct experiments on three prominent action recognition datasets (HMDB51, UCF101 and Kinetics-400) and achieve state-of-the-art results on all of them. Our code is available at this https URL.

Title: PERSE: Personalized 3D Generative Avatars from A Single Portrait

Authors: Hyunsoo Cha, Inhee Lee, Hanbyul Joo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21206
Pdf URL: https://arxiv.org/pdf/2412.21206
Copy Paste: [[2412.21206]] PERSE: Personalized 3D Generative Avatars from A Single Portrait(https://arxiv.org/abs/2412.21206)
Keywords: generative
Abstract: We present PERSE, a method for building an animatable personalized generative avatar from a reference portrait. Our avatar model enables facial attribute editing in a continuous and disentangled latent space to control each facial attribute, while preserving the individual's identity. To achieve this, our method begins by synthesizing large-scale synthetic 2D video datasets, where each video contains consistent changes in the facial expression and viewpoint, combined with a variation in a specific facial attribute from the original input. We propose a novel pipeline to produce high-quality, photorealistic 2D videos with facial attribute editing. Leveraging this synthetic attribute dataset, we present a personalized avatar creation method based on the 3D Gaussian Splatting, learning a continuous and disentangled latent space for intuitive facial attribute manipulation. To enforce smooth transitions in this latent space, we introduce a latent space regularization technique by using interpolated 2D faces as supervision. Compared to previous approaches, we demonstrate that PERSE generates high-quality avatars with interpolated attributes while preserving identity of reference person.