2024-12-20

Title: Improving Generalization Performance of YOLOv8 for Camera Trap Object Detection

Authors: Aroj Subedi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14211
Pdf URL: https://arxiv.org/pdf/2412.14211
Copy Paste: [[2412.14211]] Improving Generalization Performance of YOLOv8 for Camera Trap Object Detection(https://arxiv.org/abs/2412.14211)
Keywords: robust
Abstract: Camera traps have become integral tools in wildlife conservation, providing non-intrusive means to monitor and study wildlife in their natural habitats. The utilization of object detection algorithms to automate species identification from Camera Trap images is of huge importance for research and conservation purposes. However, the generalization issue, where the trained model is unable to apply its learnings to a never-before-seen dataset, is prevalent. This thesis explores the enhancements made to the YOLOv8 object detection algorithm to address the problem of generalization. The study delves into the limitations of the baseline YOLOv8 model, emphasizing its struggles with generalization in real-world environments. To overcome these limitations, enhancements are proposed, including the incorporation of a Global Attention Mechanism (GAM) module, modified multi-scale feature fusion, and Wise Intersection over Union (WIoUv3) as a bounding box regression loss function. A thorough evaluation and ablation experiments reveal the improved model's ability to suppress the background noise, focus on object properties, and exhibit robust generalization in novel environments. The proposed enhancements not only address the challenges inherent in camera trap datasets but also pave the way for broader applicability in real-world conservation scenarios, ultimately aiding in the effective management of wildlife populations and habitats.

Title: Heterogeneous Multi-Agent Reinforcement Learning for Distributed Channel Access in WLANs

Authors: Jiaming Yu, Le Liang, Chongtao Guo, Ziyang Guo, Shi Jin, Geoffrey Ye Li
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2412.14218
Pdf URL: https://arxiv.org/pdf/2412.14218
Copy Paste: [[2412.14218]] Heterogeneous Multi-Agent Reinforcement Learning for Distributed Channel Access in WLANs(https://arxiv.org/abs/2412.14218)
Keywords: robust, fair
Abstract: This paper investigates the use of multi-agent reinforcement learning (MARL) to address distributed channel access in wireless local area networks. In particular, we consider the challenging yet more practical case where the agents heterogeneously adopt value-based or policy-based reinforcement learning algorithms to train the model. We propose a heterogeneous MARL training framework, named QPMIX, which adopts a centralized training with distributed execution paradigm to enable heterogeneous agents to collaborate. Moreover, we theoretically prove the convergence of the proposed heterogeneous MARL method when using the linear value function approximation. Our method maximizes the network throughput and ensures fairness among stations, therefore, enhancing the overall network performance. Simulation results demonstrate that the proposed QPMIX algorithm improves throughput, mean delay, delay jitter, and collision rates compared with conventional carrier-sense multiple access with collision avoidance in the saturated traffic scenario. Furthermore, the QPMIX is shown to be robust in unsaturated and delay-sensitive traffic scenarios, and promotes cooperation among heterogeneous agents.

Title: Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing

Authors: Le-Anh Tran, Dong-Chul Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14220
Pdf URL: https://arxiv.org/pdf/2412.14220
Copy Paste: [[2412.14220]] Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing(https://arxiv.org/abs/2412.14220)
Keywords: transformer, generative
Abstract: This paper proposes a lightweight neural network designed for realistic image dehazing, utilizing a Distilled Pooling Transformer Encoder, named DPTE-Net. Recently, while vision transformers (ViTs) have achieved great success in various vision tasks, their self-attention (SA) module's complexity scales quadratically with image resolution, hindering their applicability on resource-constrained devices. To overcome this, the proposed DPTE-Net substitutes traditional SA modules with efficient pooling mechanisms, significantly reducing computational demands while preserving ViTs' learning capabilities. To further enhance semantic feature learning, a distillation-based training process is implemented which transfers rich knowledge from a larger teacher network to DPTE-Net. Additionally, DPTE-Net is trained within a generative adversarial network (GAN) framework, leveraging the strong generalization of GAN in image restoration, and employs a transmission-aware loss function to dynamically adapt to varying haze densities. Experimental results on various benchmark datasets have shown that the proposed DPTE-Net can achieve competitive dehazing performance when compared to state-of-the-art methods while maintaining low computational complexity, making it a promising solution for resource-limited applications. The code of this work is available at this https URL.

Title: FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learning

Authors: Jordan Slessor, Dezheng Kong, Xiaofen Tang, Zheng En Than, Linglong Kong
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.14226
Pdf URL: https://arxiv.org/pdf/2412.14226
Copy Paste: [[2412.14226]] FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learning(https://arxiv.org/abs/2412.14226)
Keywords: privacy, federate
Abstract: Federated learning (FL) is a machine learning methodology that involves the collaborative training of a global model across multiple decentralized clients in a privacy-preserving way. Several FL methods are introduced to tackle communication inefficiencies but do not address how to sample participating clients in each round effectively and in a privacy-preserving manner. In this paper, we propose \textit{FedSTaS}, a client and data-level sampling method inspired by \textit{FedSTS} and \textit{FedSampling}. In each federated learning round, \textit{FedSTaS} stratifies clients based on their compressed gradients, re-allocate the number of clients to sample using an optimal Neyman allocation, and sample local data from each participating clients using a data uniform sampling strategy. Experiments on three datasets show that \textit{FedSTaS} can achieve higher accuracy scores than those of \textit{FedSTS} within a fixed number of training rounds.

Title: ViTmiX: Vision Transformer Explainability Augmented by Mixed Visualization Methods

Authors: Eduard Hogea, Darian M. Onchis, Ana Coporan, Adina Magda Florea, Codruta Istin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14231
Pdf URL: https://arxiv.org/pdf/2412.14231
Copy Paste: [[2412.14231]] ViTmiX: Vision Transformer Explainability Augmented by Mixed Visualization Methods(https://arxiv.org/abs/2412.14231)
Keywords: robust, interpretability, explainability, transformer, segmentation
Abstract: Recent advancements in Vision Transformers (ViT) have demonstrated exceptional results in various visual recognition tasks, owing to their ability to capture long-range dependencies in images through self-attention mechanisms. However, the complex nature of ViT models requires robust explainability methods to unveil their decision-making processes. Explainable Artificial Intelligence (XAI) plays a crucial role in improving model transparency and trustworthiness by providing insights into model predictions. Current approaches to ViT explainability, based on visualization techniques such as Layer-wise Relevance Propagation (LRP) and gradient-based methods, have shown promising but sometimes limited results. In this study, we explore a hybrid approach that mixes multiple explainability techniques to overcome these limitations and enhance the interpretability of ViT models. Our experiments reveal that this hybrid approach significantly improves the interpretability of ViT models compared to individual methods. We also introduce modifications to existing techniques, such as using geometric mean for mixing, which demonstrates notable results in object segmentation tasks. To quantify the explainability gain, we introduced a novel post-hoc explainability measure by applying the Pigeonhole principle. These findings underscore the importance of refining and optimizing explainability methods for ViT models, paving the way to reliable XAI-based segmentations.

Title: Split Learning in Computer Vision for Semantic Segmentation Delay Minimization

Authors: Nikos G. Evgenidis, Nikos A. Mitsiou, Sotiris A. Tegos, Panagiotis D. Diamantoulakis, George K. Karagiannidis
Subjects: cs.CV, cs.AI, cs.DC, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14272
Pdf URL: https://arxiv.org/pdf/2412.14272
Copy Paste: [[2412.14272]] Split Learning in Computer Vision for Semantic Segmentation Delay Minimization(https://arxiv.org/abs/2412.14272)
Keywords: segmentation
Abstract: In this paper, we propose a novel approach to minimize the inference delay in semantic segmentation using split learning (SL), tailored to the needs of real-time computer vision (CV) applications for resource-constrained devices. Semantic segmentation is essential for applications such as autonomous vehicles and smart city infrastructure, but faces significant latency challenges due to high computational and communication loads. Traditional centralized processing methods are inefficient for such scenarios, often resulting in unacceptable inference delays. SL offers a promising alternative by partitioning deep neural networks (DNNs) between edge devices and a central server, enabling localized data processing and reducing the amount of data required for transmission. Our contribution includes the joint optimization of bandwidth allocation, cut layer selection of the edge devices' DNN, and the central server's processing resource allocation. We investigate both parallel and serial data processing scenarios and propose low-complexity heuristic solutions that maintain near-optimal performance while reducing computational requirements. Numerical results show that our approach effectively reduces inference delay, demonstrating the potential of SL for improving real-time CV applications in dynamic, resource-constrained environments.

Title: Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data

Authors: haina Raza, Drai Paulen-Patterson, Chen Ding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14276
Pdf URL: https://arxiv.org/pdf/2412.14276
Copy Paste: [[2412.14276]] Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data(https://arxiv.org/abs/2412.14276)
Keywords: robust, generative, large language model
Abstract: Fake news poses a significant threat to public opinion and social stability in modern society. This study presents a comparative evaluation of BERT-like encoder-only models and autoregressive decoder-only large language models (LLMs) for fake news detection. We introduce a dataset of news articles labeled with GPT-4 assistance (an AI-labeling method) and verified by human experts to ensure reliability. Both BERT-like encoder-only models and LLMs were fine-tuned on this dataset. Additionally, we developed an instruction-tuned LLM approach with majority voting during inference for label generation. Our analysis reveals that BERT-like models generally outperform LLMs in classification tasks, while LLMs demonstrate superior robustness against text perturbations. Compared to weak labels (distant supervision) data, the results show that AI labels with human supervision achieve better classification results. This study highlights the effectiveness of combining AI-based annotation with human oversight and demonstrates the performance of different families of machine learning models for fake news detection

Title: PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation

Authors: Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, Di Niu
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.14283
Pdf URL: https://arxiv.org/pdf/2412.14283
Copy Paste: [[2412.14283]] PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation(https://arxiv.org/abs/2412.14283)
Keywords: diffusion
Abstract: Recent research explores the potential of Diffusion Models (DMs) for consistent object editing, which aims to modify object position, size, and composition, etc., while preserving the consistency of objects and background without changing their texture and attributes. Current inference-time methods often rely on DDIM inversion, which inherently compromises efficiency and the achievable consistency of edited images. Recent methods also utilize energy guidance which iteratively updates the predicted noise and can drive the latents away from the original image, resulting in distortions. In this paper, we propose PixelMan, an inversion-free and training-free method for achieving consistent object editing via Pixel Manipulation and generation, where we directly create a duplicate copy of the source object at target location in the pixel space, and introduce an efficient sampling approach to iteratively harmonize the manipulated object into the target location and inpaint its original location, while ensuring image consistency by anchoring the edited image to be generated to the pixel-manipulated image as well as by introducing various consistency-preserving optimization techniques during inference. Experimental evaluations based on benchmark datasets as well as extensive visual comparisons show that in as few as 16 inference steps, PixelMan outperforms a range of state-of-the-art training-based and training-free methods (usually requiring 50 steps) on multiple consistent object editing tasks.

Title: TRecViT: A Recurrent Video Transformer

Authors: Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14294
Pdf URL: https://arxiv.org/pdf/2412.14294
Copy Paste: [[2412.14294]] TRecViT: A Recurrent Video Transformer(https://arxiv.org/abs/2412.14294)
Keywords: transformer
Abstract: We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at this https URL.

Title: Distributionally Robust Policy Learning under Concept Drifts

Authors: Jingyuan Wang, Zhimei Ren, Ruohan Zhan, Zhengyuan Zhou
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.14297
Pdf URL: https://arxiv.org/pdf/2412.14297
Copy Paste: [[2412.14297]] Distributionally Robust Policy Learning under Concept Drifts(https://arxiv.org/abs/2412.14297)
Keywords: robust
Abstract: Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studiesa more nuanced problem -- robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $\Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, with $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.

Title: What Has Been Overlooked in Contrastive Source-Free Domain Adaptation: Leveraging Source-Informed Latent Augmentation within Neighborhood Context

Authors: Jing Wang, Wonho Bae, Jiahong Chen, Kuangen Zhang, Leonid Sigal, Clarence W. de Silva
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14301
Pdf URL: https://arxiv.org/pdf/2412.14301
Copy Paste: [[2412.14301]] What Has Been Overlooked in Contrastive Source-Free Domain Adaptation: Leveraging Source-Informed Latent Augmentation within Neighborhood Context(https://arxiv.org/abs/2412.14301)
Keywords: privacy
Abstract: Source-free domain adaptation (SFDA) involves adapting a model originally trained using a labeled dataset ({\em source domain}) to perform effectively on an unlabeled dataset ({\em target domain}) without relying on any source data during adaptation. This adaptation is especially crucial when significant disparities in data distributions exist between the two domains and when there are privacy concerns regarding the source model's training data. The absence of access to source data during adaptation makes it challenging to analytically estimate the domain gap. To tackle this issue, various techniques have been proposed, such as unsupervised clustering, contrastive learning, and continual learning. In this paper, we first conduct an extensive theoretical analysis of SFDA based on contrastive learning, primarily because it has demonstrated superior performance compared to other techniques. Motivated by the obtained insights, we then introduce a straightforward yet highly effective latent augmentation method tailored for contrastive SFDA. This augmentation method leverages the dispersion of latent features within the neighborhood of the query sample, guided by the source pre-trained model, to enhance the informativeness of positive keys. Our approach, based on a single InfoNCE-based contrastive loss, outperforms state-of-the-art SFDA methods on widely recognized benchmark datasets.

Title: Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

Authors: David Restrepo, Chenwei Wu, Zhengxu Tang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding, Cong-Tinh Dao, Jack Gallifant, Robyn Gayle Dychiao, Jose Carlo Artiaga, André Hiroshi Bando, Carolina Pelegrini Barbosa Gracitelli, Vincenz Ferrer, Leo Anthony Celi, Danielle Bitterman, Michael G Morley, Luis Filipe Nakayama
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14304
Pdf URL: https://arxiv.org/pdf/2412.14304
Copy Paste: [[2412.14304]] Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs(https://arxiv.org/abs/2412.14304)
Keywords: large language model
Abstract: Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.

Title: Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning

Authors: Brett Barkley, David Fridovich-Keil
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14312
Pdf URL: https://arxiv.org/pdf/2412.14312
Copy Paste: [[2412.14312]] Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning(https://arxiv.org/abs/2412.14312)
Keywords: steal
Abstract: Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process -- the backbone of Dyna-style algorithms -- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.

Title: Covariances for Free: Exploiting Mean Distributions for Federated Learning with Pre-Trained Models

Authors: Dipam Goswami, Simone Magistri, Kai Wang, Bartłomiej Twardowski, Andrew D. Bagdanov, Joost van de Weijer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.14326
Pdf URL: https://arxiv.org/pdf/2412.14326
Copy Paste: [[2412.14326]] Covariances for Free: Exploiting Mean Distributions for Federated Learning with Pre-Trained Models(https://arxiv.org/abs/2412.14326)
Keywords: federate
Abstract: Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have investigated the use of first-order statistics and second-order statistics to aggregate local client data distributions at the server and achieve very high performance without any training. In this work we propose a training-free method based on an unbiased estimator of class covariance matrices. Our method, which only uses first-order statistics in the form of class means communicated by clients to the server, incurs only a fraction of the communication costs required by methods based on communicating second-order statistics. We show how these estimated class covariances can be used to initialize a linear classifier, thus exploiting the covariances without actually sharing them. When compared to state-of-the-art methods which also share only class means, our approach improves performance in the range of 4-26\% with exactly the same communication cost. Moreover, our method achieves performance competitive or superior to sharing second-order statistics with dramatically less communication overhead. Finally, using our method to initialize classifiers and then performing federated fine-tuning yields better and faster convergence. Code is available at this https URL.

Title: Personalized Generative Low-light Image Denoising and Enhancement

Authors: Xijun Wang, Prateek Chennuri, Yu Yuan, Bole Ma, Xingguang Zhang, Stanley Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14327
Pdf URL: https://arxiv.org/pdf/2412.14327
Copy Paste: [[2412.14327]] Personalized Generative Low-light Image Denoising and Enhancement(https://arxiv.org/abs/2412.14327)
Keywords: diffusion, generative
Abstract: While smartphone cameras today can produce astonishingly good photos, their performance in low light is still not completely satisfactory because of the fundamental limits in photon shot noise and sensor read noise. Generative image restoration methods have demonstrated promising results compared to traditional methods, but they suffer from hallucinatory content generation when the signal-to-noise ratio (SNR) is low. Recognizing the availability of personalized photo galleries on users' smartphones, we propose Personalized Generative Denoising (PGD) by building a diffusion model customized for different users. Our core innovation is an identity-consistent physical buffer that extracts the physical attributes of the person from the gallery. This ID-consistent physical buffer provides a strong prior that can be integrated with the diffusion model to restore the degraded images, without the need of fine-tuning. Over a wide range of low-light testing scenarios, we show that PGD achieves superior image denoising and enhancement performance compared to existing diffusion-based denoising approaches.

Title: Semantic Role Labeling of NomBank Partitives

Authors: Adam Meyers, Advait Pravin Savant, John E. Ortega
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14328
Pdf URL: https://arxiv.org/pdf/2412.14328
Copy Paste: [[2412.14328]] Semantic Role Labeling of NomBank Partitives(https://arxiv.org/abs/2412.14328)
Keywords: transformer
Abstract: This article is about Semantic Role Labeling for English partitive nouns (5%/REL of the price/ARG1; The price/ARG1 rose 5 percent/REL) in the NomBank annotated corpus. Several systems are described using traditional and transformer-based machine learning, as well as ensembling. Our highest scoring system achieves an F1 of 91.74% using "gold" parses from the Penn Treebank and 91.12% when using the Berkeley Neural parser. This research includes both classroom and experimental settings for system development.

Title: Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

Authors: Steven Hogue, Chenxu Zhang, Yapeng Tian, Xiaohu Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14333
Pdf URL: https://arxiv.org/pdf/2412.14333
Copy Paste: [[2412.14333]] Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters(https://arxiv.org/abs/2412.14333)
Keywords: diffusion
Abstract: Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.

Title: A Unifying Information-theoretic Perspective on Evaluating Generative Models

Authors: Alexis Fox, Samarth Swarup, Abhijin Adiga
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.14340
Pdf URL: https://arxiv.org/pdf/2412.14340
Copy Paste: [[2412.14340]] A Unifying Information-theoretic Perspective on Evaluating Generative Models(https://arxiv.org/abs/2412.14340)
Keywords: generative
Abstract: Considering the difficulty of interpreting generative model output, there is significant current research focused on determining meaningful evaluation metrics. Several recent approaches utilize "precision" and "recall," borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation), respectively. With the increase in metric proposals, there is a need for a unifying perspective, allowing for easier comparison and clearer explanation of their benefits and drawbacks. To this end, we unify a class of kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. Additionally, we propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity and two distinct aspects of diversity, inter- and intra-class. Our domain-agnostic metric, derived from the information-theoretic concepts of entropy and cross-entropy, can be dissected for both sample- and mode-level analysis. Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics.

Title: State Space Models are Strong Text Rerankers

Authors: Zhichao Xu, Jinghua Yan, Ashim Gupta, Vivek Srikumar
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.14354
Pdf URL: https://arxiv.org/pdf/2412.14354
Copy Paste: [[2412.14354]] State Space Models are Strong Text Rerankers(https://arxiv.org/abs/2412.14354)
Keywords: transformer
Abstract: Transformers dominate NLP and IR; but their inference inefficiencies and challenges in extrapolating to longer contexts have sparked interest in alternative model architectures. Among these, state space models (SSMs) like Mamba offer promising advantages, particularly $O(1)$ time complexity in inference. Despite their potential, SSMs' effectiveness at text reranking -- a task requiring fine-grained query-document interaction and long-context understanding -- remains underexplored. This study benchmarks SSM-based architectures (specifically, Mamba-1 and Mamba-2) against transformer-based models across various scales, architectures, and pre-training objectives, focusing on performance and efficiency in text reranking tasks. We find that (1) Mamba architectures achieve competitive text ranking performance, comparable to transformer-based models of similar size; (2) they are less efficient in training and inference compared to transformers with flash attention; and (3) Mamba-2 outperforms Mamba-1 in both performance and efficiency. These results underscore the potential of state space models as a transformer alternative and highlight areas for improvement in future IR applications.

Title: Dynamic semantic VSLAM with known and unknown objects

Authors: Sanghyoup Gu, Ratnesh Kumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14359
Pdf URL: https://arxiv.org/pdf/2412.14359
Copy Paste: [[2412.14359]] Dynamic semantic VSLAM with known and unknown objects(https://arxiv.org/abs/2412.14359)
Keywords: segmentation
Abstract: Traditional Visual Simultaneous Localization and Mapping (VSLAM) systems assume a static environment, which makes them ineffective in highly dynamic settings. To overcome this, many approaches integrate semantic information from deep learning models to identify dynamic regions within images. However, these methods face a significant limitation as a supervised model cannot recognize objects not included in the training datasets. This paper introduces a novel feature-based Semantic VSLAM capable of detecting dynamic features in the presence of both known and unknown objects. By employing an unsupervised segmentation network, we achieve unlabeled segmentation, and next utilize an objector detector to identify any of the known classes among those. We then pair this with the computed high-gradient optical-flow information to next identify the static versus dynamic segmentations for both known and unknown object classes. A consistency check module is also introduced for further refinement and final classification into static versus dynamic features. Evaluations using public datasets demonstrate that our method offers superior performance than traditional VSLAM when unknown objects are present in the images while still matching the performance of the leading semantic VSLAM techniques when the images contain only the known objects

Title: ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Authors: Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.14363
Pdf URL: https://arxiv.org/pdf/2412.14363
Copy Paste: [[2412.14363]] ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals(https://arxiv.org/abs/2412.14363)
Keywords: large language model
Abstract: Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and a 2.4x speedup over 16-bit baseline. Code is available at this https URL.

Title: Surrealistic-like Image Generation with Vision-Language Models

Authors: Elif Ayten, Shuai Wang, Hjalmar Snoep
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14366
Pdf URL: https://arxiv.org/pdf/2412.14366
Copy Paste: [[2412.14366]] Surrealistic-like Image Generation with Vision-Language Models(https://arxiv.org/abs/2412.14366)
Keywords: generative
Abstract: Recent advances in generative AI make it convenient to create different types of content, including text, images, and code. In this paper, we explore the generation of images in the style of paintings in the surrealism movement using vision-language generative models, including DALL-E, Deep Dream Generator, and DreamStudio. Our investigation starts with the generation of images under various image generation settings and different models. The primary objective is to identify the most suitable model and settings for producing such images. Additionally, we aim to understand the impact of using edited base images on the generated resulting images. Through these experiments, we evaluate the performance of selected models and gain valuable insights into their capabilities in generating such images. Our analysis shows that Dall-E 2 performs the best when using the generated prompt by ChatGPT.

Title: Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models' Character Understanding Evaluation

Authors: Yuxuan Jiang, Francis Ferraro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14368
Pdf URL: https://arxiv.org/pdf/2412.14368
Copy Paste: [[2412.14368]] Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models' Character Understanding Evaluation(https://arxiv.org/abs/2412.14368)
Keywords: large language model
Abstract: Recently, Large Language Models (LLMs) have shown impressive performance in character understanding tasks, such as analyzing the roles, personalities, and relationships of fictional characters. However, the extensive pre-training corpora used by LLMs raise concerns that they may rely on memorizing popular fictional works rather than genuinely understanding and reasoning about them. In this work, we argue that 'gist memory'-capturing essential meaning - should be the primary mechanism for character understanding tasks, as opposed to 'verbatim memory' - exact match of a string. We introduce a simple yet effective method to mitigate mechanized memorization in character understanding evaluations while preserving the essential implicit cues needed for comprehension and reasoning. Our approach reduces memorization-driven performance on popular fictional works from 96% accuracy to 72% and results in up to an 18% drop in accuracy across various character understanding tasks. These findings underscore the issue of data contamination in existing benchmarks, which often measure memorization rather than true character understanding.

Title: SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting

Authors: Arthur Josi, Luiz Gustavo Hafemann, Abdallah Dib, Emeline Got, Rafael M. O. Cruz, Marc-Andre Carbonneau
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14371
Pdf URL: https://arxiv.org/pdf/2412.14371
Copy Paste: [[2412.14371]] SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting(https://arxiv.org/abs/2412.14371)
Keywords: robust
Abstract: Monocular facial performance capture in-the-wild is challenging due to varied capture conditions, face shapes, and expressions. Most current methods rely on linear 3D Morphable Models, which represent facial expressions independently of identity at the vertex displacement level. We propose SEREP (Semantic Expression Representation), a model that disentangles expression from identity at the semantic level. It first learns an expression representation from unpaired 3D facial expressions using a cycle consistency loss. Then we train a model to predict expression from monocular images using a novel semi-supervised scheme that relies on domain adaptation. In addition, we introduce MultiREX, a benchmark addressing the lack of evaluation resources for the expression capture task. Our experiments show that SEREP outperforms state-of-the-art methods, capturing challenging expressions and transferring them to novel identities.

Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Authors: William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Subjects: cs.CL, eess.SP
Abstract URL: https://arxiv.org/abs/2412.14373
Pdf URL: https://arxiv.org/pdf/2412.14373
Copy Paste: [[2412.14373]] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling(https://arxiv.org/abs/2412.14373)
Keywords: interpretability, generative, large language model
Abstract: Large Language Models (LLMs) have shown remarkable adaptability across domains beyond text, specifically electrocardiograms (ECGs). More specifically, there is a growing body of work exploring the task of generating text from a multi-channeled ECG and corresponding textual prompt. Current approaches typically involve pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective and using the features output by the pretrained encoder to finetune a LLM for natural language generation (NLG). However, these methods are limited by 1) inefficiency from two-stage training and 2) interpretability challenges with encoder-generated features. To address these limitations, we introduce ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. This approach compresses and encodes ECG signals into tokens, enabling end-to-end LLM training by combining ECG and text tokens directly, while being much more interpretable since the ECG tokens can be directly mapped back to the original signal. Using ECG-Byte, we achieve competitive performance in NLG tasks in only half the time and ~48% of the data required by two-stage approaches.

Title: Differentially Private Multi-objective Selection: Pareto and Aggregation Approaches

Authors: Victor A. E. Farias, Felipe T. Brito, Cheryl Flynn, Javam C. Machado, Divesh Srivastava
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.14380
Pdf URL: https://arxiv.org/pdf/2412.14380
Copy Paste: [[2412.14380]] Differentially Private Multi-objective Selection: Pareto and Aggregation Approaches(https://arxiv.org/abs/2412.14380)
Keywords: privacy
Abstract: Differentially private selection mechanisms are fundamental building blocks for privacy-preserving data analysis. While numerous mechanisms exist for single-objective selection, many real-world applications require optimizing multiple competing objectives simultaneously. We present two novel mechanisms for differentially private multi-objective selection: PrivPareto and PrivAgg. PrivPareto uses a novel Pareto score to identify solutions near the Pareto frontier, while PrivAgg enables privacy-preserving weighted aggregation of multiple objectives. Both mechanisms support global and local sensitivity approaches, with comprehensive theoretical analysis showing how to compose sensitivities of multiple utility functions. We demonstrate the practical applicability through two real-world applications: cost-sensitive decision tree construction and multi-objective influential node selection in social networks. The experimental results showed that our local sensitivity-based approaches achieve significantly better utility compared to global sensitivity approaches across both applications and both Pareto and Aggregation approaches. Moreover, the local sensitivity-based approaches are able to perform well with typical privacy budget values $\epsilon \in [0.01, 1]$ in most experiments.

Title: Nemesis: Noise-randomized Encryption with Modular Efficiency and Secure Integration in Machine Learning Systems

Authors: Dongfang Zhao
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14392
Pdf URL: https://arxiv.org/pdf/2412.14392
Copy Paste: [[2412.14392]] Nemesis: Noise-randomized Encryption with Modular Efficiency and Secure Integration in Machine Learning Systems(https://arxiv.org/abs/2412.14392)
Keywords: secure, security, privacy
Abstract: Machine learning (ML) systems that guarantee security and privacy often rely on Fully Homomorphic Encryption (FHE) as a cornerstone technique, enabling computations on encrypted data without exposing sensitive information. However, a critical limitation of FHE is its computational inefficiency, making it impractical for large-scale applications. In this work, we propose \textit{Nemesis}, a framework that accelerates FHE-based systems without compromising accuracy or security. The design of Nemesis is inspired by Rache (SIGMOD'23), which introduced a caching mechanism for encrypted integers and scalars. Nemesis extends this idea with more advanced caching techniques and mathematical tools, enabling efficient operations over multi-slot FHE schemes and overcoming Rache's limitations to support general plaintext structures. We formally prove the security of Nemesis under standard cryptographic assumptions and evaluate its performance extensively on widely used datasets, including MNIST, FashionMNIST, and CIFAR-10. Experimental results show that Nemesis significantly reduces the computational overhead of FHE-based ML systems, paving the way for broader adoption of privacy-preserving technologies.

Title: Enhancing Fingerprint Recognition Systems: Comparative Analysis of Biometric Authentication Algorithms and Techniques for Improved Accuracy and Reliability

Authors: Temirlan Meiramkhanov, Arailym Tleubayeva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14404
Pdf URL: https://arxiv.org/pdf/2412.14404
Copy Paste: [[2412.14404]] Enhancing Fingerprint Recognition Systems: Comparative Analysis of Biometric Authentication Algorithms and Techniques for Improved Accuracy and Reliability(https://arxiv.org/abs/2412.14404)
Keywords: security, robust, biometric, extraction
Abstract: Fingerprint recognition systems stand as pillars in the realm of biometric authentication, providing indispensable security measures across various domains. This study investigates integrating Convolutional Neural Networks (CNNs) with Gabor filters to improve fingerprint recognition accuracy and robustness. Leveraging a diverse dataset sourced from the Sokoto Coventry Fingerprint Dataset, our experiments meticulously evaluate the efficacy of different classification algorithms. Our findings underscore the supremacy of CNN-based approaches, boasting an impressive overall accuracy of 94\%. Furthermore, the amalgamation of Gabor filters with CNN architectures unveils promising strides in discerning altered fingerprints, illuminating new pathways for enhancing biometric authentication systems. While the CNN-Gabor fusion showcases commendable performance, our exploration of hybrid approaches combining multiple classifiers reveals nuanced outcomes. Despite these mixed results, our study illuminates the transformative potential of deep learning methodologies in reshaping the landscape of fingerprint recognition. Through rigorous experimentation and insightful analysis, this research not only contributes to advancing biometric authentication technologies but also sheds light on the intricate interplay between traditional feature extraction methods and cutting-edge deep learning architectures. These findings offer actionable insights for optimizing fingerprint recognition systems for real-world deployment, paving the way for enhanced security and reliability in diverse applications.

Title: DriveGPT: Scaling Autoregressive Behavior Models for Driving

Authors: Xin Huang, Eric M. Wolff, Paul Vernaza, Tung Phan-Minh, Hongge Chen, David S. Hayden, Mark Edmonds, Brian Pierce, Xinxin Chen, Pratik Elias Jacob, Xiaobai Chen, Chingiz Tairbekov, Pratik Agarwal, Tianshi Gao, Yuning Chai, Siddhartha Srinivasa
Subjects: cs.LG, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.14415
Pdf URL: https://arxiv.org/pdf/2412.14415
Copy Paste: [[2412.14415]] DriveGPT: Scaling Autoregressive Behavior Models for Driving(https://arxiv.org/abs/2412.14415)
Keywords: transformer
Abstract: We present DriveGPT, a scalable behavior model for autonomous driving. We model driving as a sequential decision making task, and learn a transformer model to predict future agent states as tokens in an autoregressive fashion. We scale up our model parameters and training data by multiple orders of magnitude, enabling us to explore the scaling properties in terms of dataset size, model parameters, and compute. We evaluate DriveGPT across different scales in a planning task, through both quantitative metrics and qualitative examples including closed-loop driving in complex real-world scenarios. In a separate prediction task, DriveGPT outperforms a state-of-the-art baseline and exhibits improved performance by pretraining on a large-scale dataset, further validating the benefits of data scaling.

Title: Enhancing Diffusion Models for High-Quality Image Generation

Authors: Jaineet Shah, Michael Gromis, Rickston Pinto
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14422
Pdf URL: https://arxiv.org/pdf/2412.14422
Copy Paste: [[2412.14422]] Enhancing Diffusion Models for High-Quality Image Generation(https://arxiv.org/abs/2412.14422)
Keywords: diffusion, generative
Abstract: This report presents the comprehensive implementation, evaluation, and optimization of Denoising Diffusion Probabilistic Models (DDPMs) and Denoising Diffusion Implicit Models (DDIMs), which are state-of-the-art generative models. During inference, these models take random noise as input and iteratively generate high-quality images as output. The study focuses on enhancing their generative capabilities by incorporating advanced techniques such as Classifier-Free Guidance (CFG), Latent Diffusion Models with Variational Autoencoders (VAE), and alternative noise scheduling strategies. The motivation behind this work is the growing demand for efficient and scalable generative AI models that can produce realistic images across diverse datasets, addressing challenges in applications such as art creation, image synthesis, and data augmentation. Evaluations were conducted on datasets including CIFAR-10 and ImageNet-100, with a focus on improving inference speed, computational efficiency, and image quality metrics like Frechet Inception Distance (FID). Results demonstrate that DDIM + CFG achieves faster inference and superior image quality. Challenges with VAE and noise scheduling are also highlighted, suggesting opportunities for future optimization. This work lays the groundwork for developing scalable, efficient, and high-quality generative AI systems to benefit industries ranging from entertainment to robotics.

Title: FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning

Authors: Pramit Saha, Divyanshu Mishra, Felix Wagner, Konstantinos Kamnitsas, J. Alison Noble
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14424
Pdf URL: https://arxiv.org/pdf/2412.14424
Copy Paste: [[2412.14424]] FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning(https://arxiv.org/abs/2412.14424)
Keywords: privacy, federate
Abstract: Large Vision-Language Models typically require large text and image datasets for effective fine-tuning. However, collecting data from various sites, especially in healthcare, is challenging due to strict privacy regulations. An alternative is to fine-tune these models on end-user devices, such as in medical clinics, without sending data to a server. These local clients typically have limited computing power and small datasets, which are not enough for fully fine-tuning large VLMs on their own. A naive solution to these scenarios is to leverage parameter-efficient fine-tuning (PEFT) strategies and apply federated learning (FL) algorithms to combine the learned adapter weights, thereby respecting the resource limitations and data privacy. However, this approach does not fully leverage the knowledge from multiple adapters trained on diverse data distributions and for diverse tasks. The adapters are adversely impacted by data heterogeneity and task heterogeneity across clients resulting in suboptimal convergence. To this end, we propose a novel framework called FedPIA that improves upon the naive combinations of FL and PEFT by introducing Permutation and Integration of the local Adapters in the server and global Adapters in the clients exploiting Wasserstein barycenters for improved blending of client-specific and client-agnostic knowledge. This layerwise permutation helps to bridge the gap in the parameter space of local and global adapters before integration. We conduct over 2000 client-level experiments utilizing 48 medical image datasets across five different medical vision-language FL task settings encompassing visual question answering as well as image and report-based multi-label disease detection. Our experiments involving diverse client settings, ten different modalities, and two VLM backbones demonstrate that FedPIA consistently outperforms the state-of-the-art PEFT-FL baselines.

Title: All-in-One Tuning and Structural Pruning for Domain-Specific LLMs

Authors: Lei Lu, Zhepeng Wang, Ruexue Bao, Mengbing Wang, Fangyi Li, Yawen Wu, Weiwen Jiang, Jie Xu, Yanzhi Wang, Shangqian Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14426
Pdf URL: https://arxiv.org/pdf/2412.14426
Copy Paste: [[2412.14426]] All-in-One Tuning and Structural Pruning for Domain-Specific LLMs(https://arxiv.org/abs/2412.14426)
Keywords: large language model
Abstract: Existing pruning techniques for large language models (LLMs) targeting domain-specific applications typically follow a two-stage process: pruning the pretrained general-purpose LLMs and then fine-tuning the pruned LLMs on specific domains. However, the pruning decisions, derived from the pretrained weights, remain unchanged during fine-tuning, even if the weights have been updated. Therefore, such a combination of the pruning decisions and the finetuned weights may be suboptimal, leading to non-negligible performance degradation. To address these limitations, we propose ATP: All-in-One Tuning and Structural Pruning, a unified one-stage structural pruning and fine-tuning approach that dynamically identifies the current optimal substructure throughout the fine-tuning phase via a trainable pruning decision generator. Moreover, given the limited available data for domain-specific applications, Low-Rank Adaptation (LoRA) becomes a common technique to fine-tune the LLMs. In ATP, we introduce LoRA-aware forward and sparsity regularization to ensure that the substructures corresponding to the learned pruning decisions can be directly removed after the ATP process. ATP outperforms the state-of-the-art two-stage pruning methods on tasks in the legal and healthcare domains. More specifically, ATP recovers up to 88% and 91% performance of the dense model when pruning 40% parameters of LLaMA2-7B and LLaMA3-8B models, respectively.

Title: IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features

Authors: Anand Kumar, Jiteng Mu, Nuno Vasconcelos
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14432
Pdf URL: https://arxiv.org/pdf/2412.14432
Copy Paste: [[2412.14432]] IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features(https://arxiv.org/abs/2412.14432)
Keywords: privacy, extraction, diffusion
Abstract: Text-to-image (T2I) models have gained widespread adoption among content creators and the general public. However, this has sparked significant concerns regarding data privacy and copyright infringement among artists. Consequently, there is an increasing demand for T2I models to incorporate mechanisms that prevent the generation of specific artistic styles, thereby safeguarding intellectual property rights. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. Moreover, it may not adequately address the dynamic nature of artistic styles and the rapidly evolving landscape of digital art. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as introspective style attribution (IntroStyle) and demonstrates superior performance to state-of-the-art models for style retrieval. We also introduce a synthetic dataset of Style Hacks (SHacks) to isolate artistic style and evaluate fine-grained style attribution performance.

Title: Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine

Authors: Luis Roque, Carlos Soares, Vitor Cerqueira, Luis Torgo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14435
Pdf URL: https://arxiv.org/pdf/2412.14435
Copy Paste: [[2412.14435]] Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine(https://arxiv.org/abs/2412.14435)
Keywords: robust
Abstract: The importance of time series forecasting drives continuous research and the development of new approaches to tackle this problem. Typically, these methods are introduced through empirical studies that frequently claim superior accuracy for the proposed approaches. Nevertheless, concerns are rising about the reliability and generalizability of these results due to limitations in experimental setups. This paper addresses a critical limitation: the number and representativeness of the datasets used. We investigate the impact of dataset selection bias, particularly the practice of cherry-picking datasets, on the performance evaluation of forecasting methods. Through empirical analysis with a diverse set of benchmark datasets, our findings reveal that cherry-picking datasets can significantly distort the perceived performance of methods, often exaggerating their effectiveness. Furthermore, our results demonstrate that by selectively choosing just four datasets - what most studies report - 46% of methods could be deemed best in class, and 77% could rank within the top three. Additionally, recent deep learning-based approaches show high sensitivity to dataset selection, whereas classical methods exhibit greater robustness. Finally, our results indicate that, when empirically validating forecasting algorithms on a subset of the benchmarks, increasing the number of datasets tested from 3 to 6 reduces the risk of incorrectly identifying an algorithm as the best one by approximately 40%. Our study highlights the critical need for comprehensive evaluation frameworks that more accurately reflect real-world scenarios. Adopting such frameworks will ensure the development of robust and reliable forecasting methods.

Title: ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Authors: Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr Kindratenko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14436
Pdf URL: https://arxiv.org/pdf/2412.14436
Copy Paste: [[2412.14436]] ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study(https://arxiv.org/abs/2412.14436)
Keywords: large language model
Abstract: Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69\% to 76\% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed \textsc{LLaMA-3-8B-base}, with GPT-4o evaluations preferring it in 73\% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT's generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model at \href{this https URL}{this https URL}.

Title: GenHMR: Generative Human Mesh Recovery

Authors: Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14444
Pdf URL: https://arxiv.org/pdf/2412.14444
Copy Paste: [[2412.14444]] GenHMR: Generative Human Mesh Recovery(https://arxiv.org/abs/2412.14444)
Keywords: transformer, generative
Abstract: Human mesh recovery (HMR) is crucial in many computer vision applications; from health to arts and entertainment. HMR from monocular images has predominantly been addressed by deterministic methods that output a single prediction for a given 2D image. However, HMR from a single image is an ill-posed problem due to depth ambiguity and occlusions. Probabilistic methods have attempted to address this by generating and fusing multiple plausible 3D reconstructions, but their performance has often lagged behind deterministic approaches. In this paper, we introduce GenHMR, a novel generative framework that reformulates monocular HMR as an image-conditioned generative task, explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping process. GenHMR comprises two key components: (1) a pose tokenizer to convert 3D human poses into a sequence of discrete tokens in a latent space, and (2) an image-conditional masked transformer to learn the probabilistic distributions of the pose tokens, conditioned on the input image prompt along with randomly masked token sequence. During inference, the model samples from the learned conditional distribution to iteratively decode high-confidence pose tokens, thereby reducing 3D reconstruction uncertainties. To further refine the reconstruction, a 2D pose-guided refinement technique is proposed to directly fine-tune the decoded pose tokens in the latent space, which forces the projected 3D body mesh to align with the 2D pose clues. Experiments on benchmark datasets demonstrate that GenHMR significantly outperforms state-of-the-art methods. Project website can be found at this https URL

Title: Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation

Authors: Shengqi Liu, Yuhao Cheng, Zhuo Chen, Xingyu Ren, Wenhan Zhu, Lincheng Li, Mengxiao Bi, Xiaokang Yang, Yichao Yan
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14453
Pdf URL: https://arxiv.org/pdf/2412.14453
Copy Paste: [[2412.14453]] Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation(https://arxiv.org/abs/2412.14453)
Keywords: diffusion, generative
Abstract: Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose SewingLDM, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, \ie, body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability. Our project page: this https URL.

Title: LEDiff: Latent Exposure Diffusion for HDR Generation

Authors: Chao Wang, Zhihao Xia, Thomas Leimkuehler, Karol Myszkowski, Xuaner Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14456
Pdf URL: https://arxiv.org/pdf/2412.14456
Copy Paste: [[2412.14456]] LEDiff: Latent Exposure Diffusion for HDR Generation(https://arxiv.org/abs/2412.14456)
Keywords: diffusion, generative
Abstract: While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.

Title: From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research

Authors: Xiang Cheng, Raveesh Mayya, João Sedoc
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14461
Pdf URL: https://arxiv.org/pdf/2412.14461
Copy Paste: [[2412.14461]] From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research(https://arxiv.org/abs/2412.14461)
Keywords: robust, generative, large language model
Abstract: Unstructured text data annotation and analysis are fundamental to management research, often relying on human annotators through crowdsourcing platforms. While Large Language Models (LLMs) promise to provide a cost-effective and efficient alternative to human annotation, there lacks a systematic workflow that evaluate when LLMs are suitable or how to proceed with LLM-based text annotation in a reproducible manner. This paper addresses this methodological gap by introducing the ``SILICON" (\textbf{S}ystematic \textbf{I}nference with \textbf{L}LMs for \textbf{I}nformation \textbf{C}lassificati\textbf{o}n and \textbf{N}otation) workflow. The workflow integrates established principles of human annotation with systematic prompt optimization and model selection, addressing challenges such as developing robust annotation guidelines, establishing high-quality human baselines, optimizing prompts, and ensuring reproducibility across LLMs. We validate the SILICON workflow through seven case studies covering common management research tasks, including business proposal evaluation, dialog intent and breakdown analysis, review attribute detection. Our findings highlight the importance of validating annotation guideline agreement, the superiority of expert-developed human baselines over crowdsourced ones, the iterative nature of prompt optimization, and the necessity of testing multiple LLMs. Notably, we propose a regression-based methodology to empirically compare LLM outputs across prompts and models. Our workflow advances management research by establishing reproducible processes for LLM-based annotation that maintain scientific rigor. We provide practical guidance for researchers to effectively navigate the evolving landscape of generative AI tools effectively while maintaining transparency and reproducibility.

Title: Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

Authors: Jixuan He, Wanhua Li, Ye Liu, Junsik Kim, Donglai Wei, Hanspeter Pfister
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14462
Pdf URL: https://arxiv.org/pdf/2412.14462
Copy Paste: [[2412.14462]] Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion(https://arxiv.org/abs/2412.14462)
Keywords: diffusion
Abstract: As a common image editing operation, image composition involves integrating foreground objects into background scenes. In this paper, we expand the application of the concept of Affordance from human-centered image composition tasks to a more general object-scene composition framework, addressing the complex interplay between foreground objects and background scenes. Following the principle of Affordance, we define the affordance-aware object insertion task, which aims to seamlessly insert any object into any scene with various position prompts. To address the limited data issue and incorporate this task, we constructed the SAM-FB dataset, which contains over 3 million examples across more than 3,000 object categories. Furthermore, we propose the Mask-Aware Dual Diffusion (MADD) model, which utilizes a dual-stream architecture to simultaneously denoise the RGB image and the insertion mask. By explicitly modeling the insertion mask in the diffusion process, MADD effectively facilitates the notion of affordance. Extensive experimental results show that our method outperforms the state-of-the-art methods and exhibits strong generalization performance on in-the-wild images. Please refer to our code on this https URL.

Title: LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations

Authors: Tung Do, Thuan Hoang Nguyen, Anh Tuan Tran, Rang Nguyen, Binh-Son Hua
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.14464
Pdf URL: https://arxiv.org/pdf/2412.14464
Copy Paste: [[2412.14464]] LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations(https://arxiv.org/abs/2412.14464)
Keywords: diffusion
Abstract: We propose a new view synthesis method via synthesizing a 3D neural field from both single or few-view input images. To address the ill-posed nature of the image-to-3D generation problem, we devise a two-stage method that involves a reconstruction model and a diffusion model for view synthesis. Our reconstruction model first lifts one or more input images to the 3D space from a volume as the coarse-scale 3D representation followed by a tri-plane as the fine-scale 3D representation. To mitigate the ambiguity in occluded regions, our diffusion model then hallucinates missing details in the rendered images from tri-planes. We then introduce a new progressive refinement technique that iteratively applies the reconstruction and diffusion model to gradually synthesize novel views, boosting the overall quality of the 3D representations and their rendering. Empirical evaluation demonstrates the superiority of our method over state-of-the-art methods on the synthetic SRN-Car dataset, the in-the-wild CO3D dataset, and large-scale Objaverse dataset while achieving both sampling efficacy and multi-view consistency.

Title: DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On

Authors: Wengyi Zhan, Mingbao Lin, Shuicheng Yan, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14465
Pdf URL: https://arxiv.org/pdf/2412.14465
Copy Paste: [[2412.14465]] DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On(https://arxiv.org/abs/2412.14465)
Keywords: diffusion
Abstract: We introduce DiffusionTrend for virtual fashion try-on, which forgoes the need for retraining diffusion models. Using advanced diffusion models, DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers some important advantages: (1) It circumvents resource-intensive retraining of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model. This initial foray into the application of untrained diffusion models in virtual try-on technology potentially paves the way for further exploration and refinement in this industrially and academically valuable field.

Title: Towards Provable Security in Industrial Control Systems Via Dynamic Protocol Attestation

Authors: Arthur Amorim, Trevor Kann, Max Taylor, Lance Joneckis
Subjects: cs.CR, cs.FL, cs.LO, cs.PL
Abstract URL: https://arxiv.org/abs/2412.14467
Pdf URL: https://arxiv.org/pdf/2412.14467
Copy Paste: [[2412.14467]] Towards Provable Security in Industrial Control Systems Via Dynamic Protocol Attestation(https://arxiv.org/abs/2412.14467)
Keywords: security, attack
Abstract: Industrial control systems (ICSs) increasingly rely on digital technologies vulnerable to cyber attacks. Cyber attackers can infiltrate ICSs and execute malicious actions. Individually, each action seems innocuous. But taken together, they cause the system to enter an unsafe state. These attacks have resulted in dramatic consequences such as physical damage, economic loss, and environmental catastrophes. This paper introduces a methodology that restricts actions using protocols. These protocols only allow safe actions to execute. Protocols are written in a domain specific language we have embedded in an interactive theorem prover (ITP). The ITP enables formal, machine-checked proofs to ensure protocols maintain safety properties. We use dynamic attestation to ensure ICSs conform to their protocol even if an adversary compromises a component. Since protocol conformance prevents unsafe actions, the previously mentioned cyber attacks become impossible. We demonstrate the effectiveness of our methodology using an example from the Fischertechnik Industry 4.0 platform. We measure dynamic attestation's impact on latency and throughput. Our approach is a starting point for studying how to combine formal methods and protocol design to thwart attacks intended to cripple ICSs.

Title: Agent-SafetyBench: Evaluating the Safety of LLM Agents

Authors: Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14470
Pdf URL: https://arxiv.org/pdf/2412.14470
Copy Paste: [[2412.14470]] Agent-SafetyBench: Evaluating the Safety of LLM Agents(https://arxiv.org/abs/2412.14470)
Keywords: defense, robust, large language model
Abstract: As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through quantitative analysis, we identify critical failure modes and summarize two fundamental safety detects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone is insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. We release Agent-SafetyBench at \url{this https URL} to facilitate further research and innovation in agent safety evaluation and improvement.

Title: Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs

Authors: Koshiro Saito, Sakae Mizuki, Masanari Ohi, Taishi Nakamura, Taihei Shiotani, Koki Maeda, Youmi Ma, Kakeru Hattori, Kazuki Fujii, Takumi Okamoto, Shigeki Ishida, Hiroya Takamura, Rio Yokota, Naoaki Okazaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14471
Pdf URL: https://arxiv.org/pdf/2412.14471
Copy Paste: [[2412.14471]] Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs(https://arxiv.org/abs/2412.14471)
Keywords: large language model
Abstract: Why do we build local large language models (LLMs)? What should a local LLM learn from the target language? Which abilities can be transferred from other languages? Do language-specific scaling laws exist? To explore these research questions, we evaluated 35 Japanese, English, and multilingual LLMs on 19 evaluation benchmarks for Japanese and English, taking Japanese as a local language. Adopting an observational approach, we analyzed correlations of benchmark scores, and conducted principal component analysis (PCA) on the scores to derive \textit{ability factors} of local LLMs. We found that training on English text can improve the scores of academic subjects in Japanese (JMMLU). In addition, it is unnecessary to specifically train on Japanese text to enhance abilities for solving Japanese code generation, arithmetic reasoning, commonsense, and reading comprehension tasks. In contrast, training on Japanese text could improve question-answering tasks about Japanese knowledge and English-Japanese translation, which indicates that abilities for solving these two tasks can be regarded as \textit{Japanese abilities} for LLMs. Furthermore, we confirmed that the Japanese abilities scale with the computational budget for Japanese text.

Title: Promptable Representation Distribution Learning and Data Augmentation for Gigapixel Histopathology WSI Analysis

Authors: Kunming Tang, Zhiguo Jiang, Jun Shi, Wei Wang, Haibo Wu, Yushan Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14473
Pdf URL: https://arxiv.org/pdf/2412.14473
Copy Paste: [[2412.14473]] Promptable Representation Distribution Learning and Data Augmentation for Gigapixel Histopathology WSI Analysis(https://arxiv.org/abs/2412.14473)
Keywords: robust
Abstract: Gigapixel image analysis, particularly for whole slide images (WSIs), often relies on multiple instance learning (MIL). Under the paradigm of MIL, patch image representations are extracted and then fixed during the training of the MIL classifiers for efficiency consideration. However, the invariance of representations makes it difficult to perform data augmentation for WSI-level model training, which significantly limits the performance of the downstream WSI analysis. The current data augmentation methods for gigapixel images either introduce additional computational costs or result in a loss of semantic information, which is hard to meet the requirements for efficiency and stability needed for WSI model training. In this paper, we propose a Promptable Representation Distribution Learning framework (PRDL) for both patch-level representation learning and WSI-level data augmentation. Meanwhile, we explore the use of prompts to guide data augmentation in feature space, which achieves promptable data augmentation for training robust WSI-level models. The experimental results have demonstrated that the proposed method stably outperforms state-of-the-art methods.

Title: DirectorLLM for Human-Centric Video Generation

Authors: Kunpeng Song, Tingbo Hou, Zecheng He, Haoyu Ma, Jialiang Wang, Animesh Sinha, Sam Tsai, Yaqiao Luo, Xiaoliang Dai, Li Chen, Xide Xia, Peizhao Zhang, Peter Vajda, Ahmed Elgammal, Felix Juefei-Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14484
Pdf URL: https://arxiv.org/pdf/2412.14484
Copy Paste: [[2412.14484]] DirectorLLM for Human-Centric Video Generation(https://arxiv.org/abs/2412.14484)
Keywords: large language model
Abstract: In this paper, we introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos. As foundational text-to-video models rapidly evolve, the demand for high-quality human motion and interaction grows. To address this need and enhance the authenticity of human motions, we extend the LLM from a text generator to a video director and human motion simulator. Utilizing open-source resources from Llama 3, we train the DirectorLLM to generate detailed instructional signals, such as human poses, to guide video generation. This approach offloads the simulation of human motion from the video generator to the LLM, effectively creating informative outlines for human-centric scenes. These signals are used as conditions by the video renderer, facilitating more realistic and prompt-following video generation. As an independent LLM module, it can be applied to different video renderers, including UNet and DiT, with minimal effort. Experiments on automatic evaluation benchmarks and human evaluations show that our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.

Title: MAIDS: Malicious Agent Identification-based Data Security Model for Cloud Environments

Authors: Kishu Gupta, Deepika Saxena, Rishabh Gupta, Ashutosh Kumar Singh
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14490
Pdf URL: https://arxiv.org/pdf/2412.14490
Copy Paste: [[2412.14490]] MAIDS: Malicious Agent Identification-based Data Security Model for Cloud Environments(https://arxiv.org/abs/2412.14490)
Keywords: security, protect
Abstract: With the vigorous development of cloud computing, most organizations have shifted their data and applications to the cloud environment for storage, computation, and sharing purposes. During storage and data sharing across the participating entities, a malicious agent may gain access to outsourced data from the cloud environment. A malicious agent is an entity that deliberately breaches the data. This information accessed might be misused or revealed to unauthorized parties. Therefore, data protection and prediction of malicious agents have become a demanding task that needs to be addressed appropriately. To deal with this crucial and challenging issue, this paper presents a Malicious Agent Identification-based Data Security (MAIDS) Model which utilizes XGBoost machine learning classification algorithm for securing data allocation and communication among different participating entities in the cloud system. The proposed model explores and computes intended multiple security parameters associated with online data communication or transactions. Correspondingly, a security-focused knowledge database is produced for developing the XGBoost Classifier-based Malicious Agent Prediction (XC-MAP) unit. Unlike the existing approaches, which only identify malicious agents after data leaks, MAIDS proactively identifies malicious agents by examining their eligibility for respective data access. In this way, the model provides a comprehensive solution to safeguard crucial data from both intentional and non-intentional breaches, by granting data to authorized agents only by evaluating the agents behavior and predicting the malicious agent before granting data.

Title: Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

Authors: Chuang Lin, Bingbing Zhuang, Shanlin Sun, Ziyu Jiang, Jianfei Cai, Manmohan Chandraker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14494
Pdf URL: https://arxiv.org/pdf/2412.14494
Copy Paste: [[2412.14494]] Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles(https://arxiv.org/abs/2412.14494)
Keywords: diffusion
Abstract: The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates a set of good practices to finetune large pretrained models for a real-world task -- harvesting vehicle assets for autonomous driving applications. To this end, we delve into the discrepancies between the synthetic data and real driving data, then develop several strategies to account for them properly. Specifically, we start with a virtual camera rotation of real images to ensure geometric alignment with synthetic data and consistency with the pose manifold defined by pretrained models. We also identify important design choices in object-centric data curation to account for varying object distances in real driving scenes -- learn across varying object scales with fixed camera focal length. Further, we perform occlusion-aware training in latent spaces to account for ubiquitous occlusions in real data, and handle large viewpoint changes by leveraging a symmetric prior. Our insights lead to effective finetuning that results in a $68.8\%$ reduction in FID for novel view synthesis over prior arts.

Title: FedMUP: Federated Learning driven Malicious User Prediction Model for Secure Data Distribution in Cloud Environments

Authors: Kishu Gupta, Deepika Saxena, Rishabh Gupta, Jatinder Kumar, Ashutosh Kumar Singh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.14495
Pdf URL: https://arxiv.org/pdf/2412.14495
Copy Paste: [[2412.14495]] FedMUP: Federated Learning driven Malicious User Prediction Model for Secure Data Distribution in Cloud Environments(https://arxiv.org/abs/2412.14495)
Keywords: secure, security, federate
Abstract: Cloud computing is flourishing at a rapid pace. Significant consequences related to data security appear as a malicious user may get unauthorized access to sensitive data which may be misused, further. This raises an alarm-ringing situation to tackle the crucial issue related to data security and proactive malicious user prediction. This article proposes a Federated learning driven Malicious User Prediction Model for Secure Data Distribution in Cloud Environments (FedMUP). This approach firstly analyses user behavior to acquire multiple security risk parameters. Afterward, it employs the federated learning-driven malicious user prediction approach to reveal doubtful users, proactively. FedMUP trains the local model on their local dataset and transfers computed values rather than actual raw data to obtain an updated global model based on averaging various local versions. This updated model is shared repeatedly at regular intervals with the user for retraining to acquire a better, and more efficient model capable of predicting malicious users more precisely. Extensive experimental work and comparison of the proposed model with state-of-the-art approaches demonstrate the efficiency of the proposed work. Significant improvement is observed in the key performance indicators such as malicious user prediction accuracy, precision, recall, and f1-score up to 14.32%, 17.88%, 14.32%, and 18.35%, respectively.

Title: Content-style disentangled representation for controllable artistic image stylization and generation

Authors: Ma Zhuoqi, Zhang Yixuan, You Zejun, Tian Long, Liu Xiyang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14496
Pdf URL: https://arxiv.org/pdf/2412.14496
Copy Paste: [[2412.14496]] Content-style disentangled representation for controllable artistic image stylization and generation(https://arxiv.org/abs/2412.14496)
Keywords: diffusion
Abstract: Controllable artistic image stylization and generation aims to render the content provided by text or image with the learned artistic style, where content and style decoupling is the key to achieve satisfactory results. However, current methods for content and style disentanglement primarily rely on image information for supervision, which leads to two problems: 1) models can only support one modality for style or content input;2) incomplete disentanglement resulting in semantic interference from the reference image. To address the above issues, this paper proposes a content-style representation disentangling method for controllable artistic image stylization and generation. We construct a WikiStyle+ dataset consists of artworks with corresponding textual descriptions for style and content. Based on the multimodal dataset, we propose a disentangled content and style representations guided diffusion model. The disentangled representations are first learned by Q-Formers and then injected into a pre-trained diffusion model using learnable multi-step cross-attention layers for better controllable stylization. This approach allows model to accommodate inputs from different modalities. Experimental results show that our method achieves a thorough disentanglement of content and style in reference images under multimodal supervision, thereby enabling a harmonious integration of content and style in the generated outputs, successfully producing style-consistent and expressive stylized images.

Title: Guided Diffusion Model for Sensor Data Obfuscation

Authors: Xin Yang, Omid Ardakanian
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14499
Pdf URL: https://arxiv.org/pdf/2412.14499
Copy Paste: [[2412.14499]] Guided Diffusion Model for Sensor Data Obfuscation(https://arxiv.org/abs/2412.14499)
Keywords: privacy, protect, diffusion, generative
Abstract: Sensor data collected by Internet of Things (IoT) devices carries detailed information about individuals in their vicinity. Sharing this data with a semi-trusted service provider may compromise the individuals' privacy, as sensitive information can be extracted by powerful machine learning models. Data obfuscation empowered by generative models is a promising approach to generate synthetic sensor data such that the useful information contained in the original data is preserved and the sensitive information is obscured. This newly generated data will then be shared with the service provider instead of the original sensor data. In this work, we propose PrivDiffuser, a novel data obfuscation technique based on a denoising diffusion model that attains a superior trade-off between data utility and privacy through effective guidance techniques. Specifically, we extract latent representations that contain information about public and private attributes from sensor data to guide the diffusion model, and impose mutual information-based regularization when learning the latent representations to alleviate the entanglement of public and private attributes, thereby increasing the effectiveness of guidance. Evaluation on three real-world datasets containing different sensing modalities reveals that PrivDiffuser yields a better privacy-utility trade-off than the state-of-the-art obfuscation model, decreasing the utility loss by up to $1.81\%$ and the privacy loss by up to $3.42\%$. Moreover, we showed that users with diverse privacy needs can use PrivDiffuser to protect their privacy without having to retrain the model.

Title: Do Large Language Models Defend Inferentialist Semantics?: On the Logical Expressivism and Anti-Representationalism of LLMs

Authors: Yuzuki Arai, Sho Tsugawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14501
Pdf URL: https://arxiv.org/pdf/2412.14501
Copy Paste: [[2412.14501]] Do Large Language Models Defend Inferentialist Semantics?: On the Logical Expressivism and Anti-Representationalism of LLMs(https://arxiv.org/abs/2412.14501)
Keywords: large language model
Abstract: The philosophy of language, which has historically been developed through an anthropocentric lens, is now being forced to move towards post-anthropocentrism due to the advent of large language models (LLMs) like ChatGPT (OpenAI), Claude (Anthropic), which are considered to possess linguistic abilities comparable to those of humans. Traditionally, LLMs have been explained through distributional semantics as their foundational semantics. However, recent research is exploring alternative foundational semantics beyond distributional semantics. This paper proposes Robert Brandom's inferentialist semantics as an suitable foundational semantics for LLMs, specifically focusing on the issue of linguistic representationalism within this post-anthropocentric trend. Here, we show that the anti-representationalism and logical expressivism of inferential semantics, as well as quasi-compositionality, are useful in interpreting the characteristics and behaviors of LLMs. Further, we propose a \emph{consensus theory of truths} for LLMs. This paper argues that the characteristics of LLMs challenge mainstream assumptions in philosophy of language, such as semantic externalism and compositionality. We believe the argument in this paper leads to a re-evaluation of anti\hyphen{}representationalist views of language, potentially leading to new developments in the philosophy of language.

Title: A hybrid framework for effective and efficient machine unlearning

Authors: Mingxin Li, Yizhen Yu, Ning Wang, Zhigang Wang, Xiaodong Wang, Haipeng Qu, Jia Xu, Shen Su, Zhichao Yin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14505
Pdf URL: https://arxiv.org/pdf/2412.14505
Copy Paste: [[2412.14505]] A hybrid framework for effective and efficient machine unlearning(https://arxiv.org/abs/2412.14505)
Keywords: privacy
Abstract: Recently machine unlearning (MU) is proposed to remove the imprints of revoked samples from the already trained model parameters, to solve users' privacy concern. Different from the runtime expensive retraining from scratch, there exist two research lines, exact MU and approximate MU with different favorites in terms of accuracy and efficiency. In this paper, we present a novel hybrid strategy on top of them to achieve an overall success. It implements the unlearning operation with an acceptable computation cost, while simultaneously improving the accuracy as much as possible. Specifically, it runs reasonable unlearning techniques by estimating the retraining workloads caused by revocations. If the workload is lightweight, it performs retraining to derive the model parameters consistent with the accurate ones retrained from scratch. Otherwise, it outputs the unlearned model by directly modifying the current parameters, for better efficiency. In particular, to improve the accuracy in the latter case, we propose an optimized version to amend the output model with lightweight runtime penalty. We particularly study the boundary of two approaches in our frameworks to adaptively make the smart selection. Extensive experiments on real datasets validate that our proposals can improve the unlearning efficiency by 1.5$\times$ to 8$\times$ while achieving comparable accuracy.

Title: PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization

Authors: Jiayi Wu, Hengyi Cai, Lingyong Yan, Hao Sun, Xiang Li, Shuaiqiang Wang, Dawei Yin, Ming Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14510
Pdf URL: https://arxiv.org/pdf/2412.14510
Copy Paste: [[2412.14510]] PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization(https://arxiv.org/abs/2412.14510)
Keywords: robust, large language model
Abstract: The emergence of Retrieval-augmented generation (RAG) has alleviated the issues of outdated and hallucinatory content in the generation of large language models (LLMs), yet it still reveals numerous limitations. When a general-purpose LLM serves as the RAG generator, it often suffers from inadequate response informativeness, response robustness, and citation quality. Past approaches to tackle these limitations, either by incorporating additional steps beyond generating responses or optimizing the generator through supervised fine-tuning (SFT), still failed to align with the RAG requirement thoroughly. Consequently, optimizing the RAG generator from multiple preference perspectives while maintaining its end-to-end LLM form remains a challenge. To bridge this gap, we propose Multiple Perspective Preference Alignment for Retrieval-Augmented Generation (PA-RAG), a method for optimizing the generator of RAG systems to align with RAG requirements comprehensively. Specifically, we construct high-quality instruction fine-tuning data and multi-perspective preference data by sampling varied quality responses from the generator across different prompt documents quality scenarios. Subsequently, we optimize the generator using SFT and Direct Preference Optimization (DPO). Extensive experiments conducted on four question-answer datasets across three LLMs demonstrate that PA-RAG can significantly enhance the performance of RAG generators. Our code and datasets are available at this https URL.

Title: Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

Authors: Teng Xiao, Yige Yuan, Huaisheng Zhu, Mingxiao Li, Vasant G Honavar
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.14516
Pdf URL: https://arxiv.org/pdf/2412.14516
Copy Paste: [[2412.14516]] Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment(https://arxiv.org/abs/2412.14516)
Keywords: large language model
Abstract: We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignoring their actual values, resulting in suboptimal alignment with human preferences. To address this limitation, we propose calibrated direct preference optimization (Cal-DPO), a simple yet effective algorithm. We show that substantial improvement in alignment with the given preferences can be achieved simply by calibrating the implicit reward to ensure that the learned implicit rewards are comparable in scale to the ground-truth rewards. We demonstrate the theoretical advantages of Cal-DPO over existing approaches. The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.

Title: Efficient Self-Supervised Video Hashing with Selective State Spaces

Authors: Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia
Subjects: cs.CV, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2412.14518
Pdf URL: https://arxiv.org/pdf/2412.14518
Copy Paste: [[2412.14518]] Efficient Self-Supervised Video Hashing with Selective State Spaces(https://arxiv.org/abs/2412.14518)
Keywords: transformer
Abstract: Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at this https URL.

Title: CAE-T: A Channelwise AutoEncoder with Transformer for EEG Abnormality Detection

Authors: Youshen Zhao, Keiji Iramina
Subjects: cs.LG, cs.AI, cs.NE, eess.SP
Abstract URL: https://arxiv.org/abs/2412.14522
Pdf URL: https://arxiv.org/pdf/2412.14522
Copy Paste: [[2412.14522]] CAE-T: A Channelwise AutoEncoder with Transformer for EEG Abnormality Detection(https://arxiv.org/abs/2412.14522)
Keywords: interpretability, transformer
Abstract: Electroencephalogram (EEG) signals are critical for detecting abnormal brain activity, but their high dimensionality and complexity pose significant challenges for effective analysis. In this paper, we propose CAE-T, a novel framework that combines a channelwise CNN-based autoencoder with a single-head transformer classifier for efficient EEG abnormality detection. The channelwise autoencoder compresses raw EEG signals while preserving channel independence, reducing computational costs and retaining biologically meaningful features. The compressed representations are then fed into the transformer-based classifier, which efficiently models long-term dependencies to distinguish between normal and abnormal signals. Evaluated on the TUH Abnormal EEG Corpus, the proposed model achieves 85.0% accuracy, 76.2% sensitivity, and 91.2% specificity at the per-case level, outperforming baseline models such as EEGNet, Deep4Conv, and FusionCNN. Furthermore, CAE-T requires only 202M FLOPs and 2.9M parameters, making it significantly more efficient than transformer-based alternatives. The framework retains interpretability through its channelwise design, demonstrating great potential for future applications in neuroscience research and clinical practice. The source code is available at this https URL.

Title: Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Authors: Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14528
Pdf URL: https://arxiv.org/pdf/2412.14528
Copy Paste: [[2412.14528]] Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models(https://arxiv.org/abs/2412.14528)
Keywords: robust, generative, large language model
Abstract: Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes.

Title: Leveraging Time Series Categorization and Temporal Fusion Transformers to Improve Cryptocurrency Price Forecasting

Authors: Arash Peik, Mohammad Ali Zare Chahooki, Amin Milani Fard, Mehdi Agha Sarram
Subjects: cs.LG, cs.CE, q-fin.ST
Abstract URL: https://arxiv.org/abs/2412.14529
Pdf URL: https://arxiv.org/pdf/2412.14529
Copy Paste: [[2412.14529]] Leveraging Time Series Categorization and Temporal Fusion Transformers to Improve Cryptocurrency Price Forecasting(https://arxiv.org/abs/2412.14529)
Keywords: transformer
Abstract: Organizing and managing cryptocurrency portfolios and decision-making on transactions is crucial in this market. Optimal selection of assets is one of the main challenges that requires accurate prediction of the price of cryptocurrencies. In this work, we categorize the financial time series into several similar subseries to increase prediction accuracy by learning each subseries category with similar behavior. For each category of the subseries, we create a deep learning model based on the attention mechanism to predict the next step of each subseries. Due to the limited amount of cryptocurrency data for training models, if the number of categories increases, the amount of training data for each model will decrease, and some complex models will not be trained well due to the large number of parameters. To overcome this challenge, we propose to combine the time series data of other cryptocurrencies to increase the amount of data for each category, hence increasing the accuracy of the models corresponding to each category.

Title: Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Authors: Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang, Ying Shan, Yinqiang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14531
Pdf URL: https://arxiv.org/pdf/2412.14531
Copy Paste: [[2412.14531]] Consistent Human Image and Video Generation with Spatially Conditioned Diffusion(https://arxiv.org/abs/2412.14531)
Keywords: extraction, diffusion
Abstract: Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework, in which reference features can only query from themselves, while target features can query appearance information from both the reference and the target. To further enhance computational efficiency and flexibility, in practical implementation, we decompose the spatially-conditioned generation process into two stages: reference appearance extraction and conditioned target generation. Both stages share a single denoising network, with interactions restricted to self-attention layers. This proposed method ensures flexible control over the appearance of generated human images and videos. By fine-tuning existing base diffusion models on human video data, our method demonstrates strong generalization to unseen human identities and poses without requiring additional per-instance fine-tuning. Experimental results validate the effectiveness of our approach, showing competitive performance compared to existing methods for consistent human image and video synthesis.

Title: Downscaling Precipitation with Bias-informed Conditional Diffusion Model

Authors: Ran Lyu (1), Linhan Wang (1), Yanshen Sun (1), Hedanqiu Bai (2), Chang-Tien Lu (1) ((1) Virginia Tech, (2) Texas A&M University)
Subjects: cs.LG, cs.CV, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2412.14539
Pdf URL: https://arxiv.org/pdf/2412.14539
Copy Paste: [[2412.14539]] Downscaling Precipitation with Bias-informed Conditional Diffusion Model(https://arxiv.org/abs/2412.14539)
Keywords: diffusion
Abstract: Climate change is intensifying rainfall extremes, making high-resolution precipitation projections crucial for society to better prepare for impacts such as flooding. However, current Global Climate Models (GCMs) operate at spatial resolutions too coarse for localized analyses. To address this limitation, deep learning-based statistical downscaling methods offer promising solutions, providing high-resolution precipitation projections with a moderate computational cost. In this work, we introduce a bias-informed conditional diffusion model for statistical downscaling of precipitation. Specifically, our model leverages a conditional diffusion approach to learn distribution priors from large-scale, high-resolution precipitation datasets. The long-tail distribution of precipitation poses a unique challenge for training diffusion models; to address this, we apply gamma correction during preprocessing. Additionally, to correct biases in the downscaled results, we employ a guided-sampling strategy to enhance bias correction. Our experiments demonstrate that the proposed model achieves highly accurate results in an 8 times downscaling setting, outperforming previous deterministic methods. The code and dataset are available at this https URL

Title: Transformer models are gauge invariant: A mathematical connection between AI and particle physics

Authors: Leo van Nierop
Subjects: cs.LG, hep-th
Abstract URL: https://arxiv.org/abs/2412.14543
Pdf URL: https://arxiv.org/pdf/2412.14543
Copy Paste: [[2412.14543]] Transformer models are gauge invariant: A mathematical connection between AI and particle physics(https://arxiv.org/abs/2412.14543)
Keywords: transformer
Abstract: In particle physics, the fundamental forces are subject to symmetries called gauge invariance. It is a redundancy in the mathematical description of any physical system. In this article I will demonstrate that the transformer architecture exhibits the same properties, and show that the default representation of transformers has partially, but not fully removed the gauge invariance.

Title: Summary of Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images

Authors: Kamorudeen A. Amuda, Almustapha A. Wakili
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14545
Pdf URL: https://arxiv.org/pdf/2412.14545
Copy Paste: [[2412.14545]] Summary of Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images(https://arxiv.org/abs/2412.14545)
Keywords: federate, transformer
Abstract: This study introduces a federated learning-based approach to predict HER2 status from hematoxylin and eosin (HE)-stained whole slide images (WSIs), reducing costs and speeding up treatment decisions. To address label imbalance and feature representation challenges in multisite datasets, a point transformer is proposed, incorporating dynamic label distribution, an auxiliary classifier, and farthest cosine sampling. Extensive experiments demonstrate state-of-the-art performance across four sites (2687 WSIs) and strong generalization to two unseen sites (229 WSIs).

Title: {S$^3$-Mamba}: Small-Size-Sensitive Mamba for Lesion Segmentation

Authors: Gui Wang, Yuexiang Li, Wenting Chen, Meidan Ding, Wooi Ping Cheah, Rong Qu, Jianfeng Ren, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14546
Pdf URL: https://arxiv.org/pdf/2412.14546
Copy Paste: [[2412.14546]] {S$^3$-Mamba}: Small-Size-Sensitive Mamba for Lesion Segmentation(https://arxiv.org/abs/2412.14546)
Keywords: segmentation
Abstract: Small lesions play a critical role in early disease diagnosis and intervention of severe infections. Popular models often face challenges in segmenting small lesions, as it occupies only a minor portion of an image, while down\_sampling operations may inevitably lose focus on local features of small lesions. To tackle the challenges, we propose a {\bf S}mall-{\bf S}ize-{\bf S}ensitive {\bf Mamba} ({\bf S$^3$-Mamba}), which promotes the sensitivity to small lesions across three dimensions: channel, spatial, and training strategy. Specifically, an Enhanced Visual State Space block is designed to focus on small lesions through multiple residual connections to preserve local features, and selectively amplify important details while suppressing irrelevant ones through channel-wise attention. A Tensor-based Cross-feature Multi-scale Attention is designed to integrate input image features and intermediate-layer features with edge features and exploit the attentive support of features across multiple scales, thereby retaining spatial details of small lesions at various granularities. Finally, we introduce a novel regularized curriculum learning to automatically assess lesion size and sample difficulty, and gradually focus from easy samples to hard ones like small lesions. Extensive experiments on three medical image segmentation datasets show the superiority of our S$^3$-Mamba, especially in segmenting small lesions. Our code is available at this https URL.

Title: Single-Loop Federated Actor-Critic across Heterogeneous Environments

Authors: Ye Zhu, Xiaowen Gong
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2412.14555
Pdf URL: https://arxiv.org/pdf/2412.14555
Copy Paste: [[2412.14555]] Single-Loop Federated Actor-Critic across Heterogeneous Environments(https://arxiv.org/abs/2412.14555)
Keywords: federate
Abstract: Federated reinforcement learning (FRL) has emerged as a promising paradigm, enabling multiple agents to collaborate and learn a shared policy adaptable across heterogeneous environments. Among the various reinforcement learning (RL) algorithms, the actor-critic (AC) algorithm stands out for its low variance and high sample efficiency. However, little to nothing is known theoretically about AC in a federated manner, especially each agent interacts with a potentially different environment. The lack of such results is attributed to various technical challenges: a two-level structure illustrating the coupling effect between the actor and the critic, heterogeneous environments, Markovian sampling and multiple local updates. In response, we study \textit{Single-loop Federated Actor Critic} (SFAC) where agents perform actor-critic learning in a two-level federated manner while interacting with heterogeneous environments. We then provide bounds on the convergence error of SFAC. The results show that the convergence error asymptotically converges to a near-stationary point, with the extent proportional to environment heterogeneity. Moreover, the sample complexity exhibits a linear speed-up through the federation of agents. We evaluate the performance of SFAC through numerical experiments using common RL benchmarks, which demonstrate its effectiveness.

Title: ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

Authors: Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, Ruimao Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14559
Pdf URL: https://arxiv.org/pdf/2412.14559
Copy Paste: [[2412.14559]] ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model(https://arxiv.org/abs/2412.14559)
Keywords: transformer
Abstract: The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of $1e18$. The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss, thereby validating the scaling law.

Title: GBRIP: Granular Ball Representation for Imbalanced Partial Label Learning

Authors: Jintao Huang, Yiu-ming Cheung, Chi-man Vong, Wenbin Qian
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14561
Pdf URL: https://arxiv.org/pdf/2412.14561
Copy Paste: [[2412.14561]] GBRIP: Granular Ball Representation for Imbalanced Partial Label Learning(https://arxiv.org/abs/2412.14561)
Keywords: robust
Abstract: Partial label learning (PLL) is a complicated weakly supervised multi-classification task compounded by class imbalance. Currently, existing methods only rely on inter-class pseudo-labeling from inter-class features, often overlooking the significant impact of the intra-class imbalanced features combined with the inter-class. To address these limitations, we introduce Granular Ball Representation for Imbalanced PLL (GBRIP), a novel framework for imbalanced PLL. GBRIP utilizes coarse-grained granular ball representation and multi-center loss to construct a granular ball-based nfeature space through unsupervised learning, effectively capturing the feature distribution within each class. GBRIP mitigates the impact of confusing features by systematically refining label disambiguation and estimating imbalance distributions. The novel multi-center loss function enhances learning by emphasizing the relationships between samples and their respective centers within the granular balls. Extensive experiments on standard benchmarks demonstrate that GBRIP outperforms existing state-of-the-art methods, offering a robust solution to the challenges of imbalanced PLL.

Title: AIArena: A Blockchain-Based Decentralized AI Training Platform

Authors: Zhipeng Wang, Rui Sun, Elizabeth Lui, Tuo Zhou, Yizhe Wen, Jiahao Sun
Subjects: cs.CR, cs.AI, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14566
Pdf URL: https://arxiv.org/pdf/2412.14566
Copy Paste: [[2412.14566]] AIArena: A Blockchain-Based Decentralized AI Training Platform(https://arxiv.org/abs/2412.14566)
Keywords: fair
Abstract: The rapid advancement of AI has underscored critical challenges in its development and implementation, largely due to centralized control by a few major corporations. This concentration of power intensifies biases within AI models, resulting from inadequate governance and oversight mechanisms. Additionally, it limits public involvement and heightens concerns about the integrity of model generation. Such monopolistic control over data and AI outputs threatens both innovation and fair data usage, as users inadvertently contribute data that primarily benefits these corporations. In this work, we propose AIArena, a blockchain-based decentralized AI training platform designed to democratize AI development and alignment through on-chain incentive mechanisms. AIArena fosters an open and collaborative environment where participants can contribute models and computing resources. Its on-chain consensus mechanism ensures fair rewards for participants based on their contributions. We instantiate and implement AIArena on the public Base blockchain Sepolia testnet, and the evaluation results demonstrate the feasibility of AIArena in real-world applications.

Title: Global Spatio-Temporal Fusion-based Traffic Prediction Algorithm with Anomaly Aware

Authors: Chaoqun Liu, Xuanpeng Li, Chen Gong, Guangyu Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14569
Pdf URL: https://arxiv.org/pdf/2412.14569
Copy Paste: [[2412.14569]] Global Spatio-Temporal Fusion-based Traffic Prediction Algorithm with Anomaly Aware(https://arxiv.org/abs/2412.14569)
Keywords: transformer
Abstract: Traffic prediction is an indispensable component of urban planning and traffic management. Achieving accurate traffic prediction hinges on the ability to capture the potential spatio-temporal relationships among road sensors. However, the majority of existing works focus on local short-term spatio-temporal correlations, failing to fully consider the interactions of different sensors in the long-term state. In addition, these works do not analyze the influences of anomalous factors, or have insufficient ability to extract personalized features of anomalous factors, which make them ineffectively capture their spatio-temporal influences on traffic prediction. To address the aforementioned issues, We propose a global spatio-temporal fusion-based traffic prediction algorithm that incorporates anomaly awareness. Initially, based on the designed anomaly detection network, we construct an efficient anomalous factors impacting module (AFIM), to evaluate the spatio-temporal impact of unexpected external events on traffic prediction. Furthermore, we propose a multi-scale spatio-temporal feature fusion module (MTSFFL) based on the transformer architecture, to obtain all possible both long and short term correlations among different sensors in a wide-area traffic environment for accurate prediction of traffic flow. Finally, experiments are implemented based on real-scenario public transportation datasets (PEMS04 and PEMS08) to demonstrate that our approach can achieve state-of-the-art performance.

Title: SCKD: Semi-Supervised Cross-Modality Knowledge Distillation for 4D Radar Object Detection

Authors: Ruoyu Xu, Zhiyu Xiang, Chenwei Zhang, Hanzhi Zhong, Xijun Zhao, Ruina Dang, Peng Xu, Tianyu Pu, Eryun Liu
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14571
Pdf URL: https://arxiv.org/pdf/2412.14571
Copy Paste: [[2412.14571]] SCKD: Semi-Supervised Cross-Modality Knowledge Distillation for 4D Radar Object Detection(https://arxiv.org/abs/2412.14571)
Keywords: robust
Abstract: 3D object detection is one of the fundamental perception tasks for autonomous vehicles. Fulfilling such a task with a 4D millimeter-wave radar is very attractive since the sensor is able to acquire 3D point clouds similar to Lidar while maintaining robust measurements under adverse weather. However, due to the high sparsity and noise associated with the radar point clouds, the performance of the existing methods is still much lower than expected. In this paper, we propose a novel Semi-supervised Cross-modality Knowledge Distillation (SCKD) method for 4D radar-based 3D object detection. It characterizes the capability of learning the feature from a Lidar-radar-fused teacher network with semi-supervised distillation. We first propose an adaptive fusion module in the teacher network to boost its performance. Then, two feature distillation modules are designed to facilitate the cross-modality knowledge transfer. Finally, a semi-supervised output distillation is proposed to increase the effectiveness and flexibility of the distillation framework. With the same network structure, our radar-only student trained by SCKD boosts the mAP by 10.38% over the baseline and outperforms the state-of-the-art works on the VoD dataset. The experiment on ZJUODset also shows 5.12% mAP improvements on the moderate difficulty level over the baseline when extra unlabeled data are available. Code is available at this https URL.

Title: Alignment-Free RGB-T Salient Object Detection: A Large-scale Dataset and Progressive Correlation Network

Authors: Kunpeng Wang, Keke Chen, Chenglong Li, Zhengzheng Tu, Bin Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14576
Pdf URL: https://arxiv.org/pdf/2412.14576
Copy Paste: [[2412.14576]] Alignment-Free RGB-T Salient Object Detection: A Large-scale Dataset and Progressive Correlation Network(https://arxiv.org/abs/2412.14576)
Keywords: robust
Abstract: Alignment-free RGB-Thermal (RGB-T) salient object detection (SOD) aims to achieve robust performance in complex scenes by directly leveraging the complementary information from unaligned visible-thermal image pairs, without requiring manual alignment. However, the labor-intensive process of collecting and annotating image pairs limits the scale of existing benchmarks, hindering the advancement of alignment-free RGB-T SOD. In this paper, we construct a large-scale and high-diversity unaligned RGB-T SOD dataset named UVT20K, comprising 20,000 image pairs, 407 scenes, and 1256 object categories. All samples are collected from real-world scenarios with various challenges, such as low illumination, image clutter, complex salient objects, and so on. To support the exploration for further research, each sample in UVT20K is annotated with a comprehensive set of ground truths, including saliency masks, scribbles, boundaries, and challenge attributes. In addition, we propose a Progressive Correlation Network (PCNet), which models inter- and intra-modal correlations on the basis of explicit alignment to achieve accurate predictions in unaligned image pairs. Extensive experiments conducted on unaligned and aligned datasets demonstrate the effectiveness of our this http URL and dataset are available at this https URL.

Title: DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

Authors: Yiren Song, Xiaokang Liu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14580
Pdf URL: https://arxiv.org/pdf/2412.14580
Copy Paste: [[2412.14580]] DiffSim: Taming Diffusion Models for Evaluating Visual Similarity(https://arxiv.org/abs/2412.14580)
Keywords: robust, diffusion, generative
Abstract: Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.

Title: CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation

Authors: Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14581
Pdf URL: https://arxiv.org/pdf/2412.14581
Copy Paste: [[2412.14581]] CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation(https://arxiv.org/abs/2412.14581)
Keywords: robust, large language model
Abstract: With the adoption of retrieval-augmented generation (RAG), large language models (LLMs) are expected to ground their generation to the retrieved contexts. Yet, this is hindered by position bias of LLMs, failing to evenly attend to all contexts. Previous work has addressed this by synthesizing contexts with perturbed positions of gold segment, creating a position-diversified train set. We extend this intuition to propose consistency regularization with augmentation and distillation. First, we augment each training instance with its position perturbation to encourage consistent predictions, regardless of ordering. We also distill behaviors of this pair, although it can be counterproductive in certain RAG scenarios where the given order from the retriever is crucial for generation quality. We thus propose CORD, balancing COnsistency and Rank Distillation. CORD adaptively samples noise-controlled perturbations from an interpolation space, ensuring both consistency and respect for the rank prior. Empirical results show this balance enables CORD to outperform consistently in diverse RAG benchmarks.

Title: Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues

Authors: Tao He, Lizi Liao, Yixin Cao, Yuanxing Liu, Yiheng Sun, Zerui Chen, Ming Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14584
Pdf URL: https://arxiv.org/pdf/2412.14584
Copy Paste: [[2412.14584]] Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues(https://arxiv.org/abs/2412.14584)
Keywords: large language model
Abstract: Recent advancements in proactive dialogues have garnered significant attention, particularly for more complex objectives (e.g. emotion support and persuasion). Unlike traditional task-oriented dialogues, proactive dialogues demand advanced policy planning and adaptability, requiring rich scenarios and comprehensive policy repositories to develop such systems. However, existing approaches tend to rely on Large Language Models (LLMs) for user simulation and online learning, leading to biases that diverge from realistic scenarios and result in suboptimal efficiency. Moreover, these methods depend on manually defined, context-independent, coarse-grained policies, which not only incur high expert costs but also raise concerns regarding their completeness. In our work, we highlight the potential for automatically discovering policies directly from raw, real-world dialogue records. To this end, we introduce a novel dialogue policy planning framework, LDPP. It fully automates the process from mining policies in dialogue records to learning policy planning. Specifically, we employ a variant of the Variational Autoencoder to discover fine-grained policies represented as latent vectors. After automatically annotating the data with these latent policy labels, we propose an Offline Hierarchical Reinforcement Learning (RL) algorithm in the latent space to develop effective policy planning capabilities. Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing ChatGPT with only a 1.8-billion-parameter LLM.

Title: HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Authors: Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14585
Pdf URL: https://arxiv.org/pdf/2412.14585
Copy Paste: [[2412.14585]] HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning(https://arxiv.org/abs/2412.14585)
Keywords: large language model
Abstract: With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

Title: Spike2Former: Efficient Spiking Transformer for High-performance Image Segmentation

Authors: Zhenxin Lei, Man Yao, Jiakui Hu, Xinhao Luo, Yanye Lu, Bo Xu, Guoqi Li
Subjects: cs.CV, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2412.14587
Pdf URL: https://arxiv.org/pdf/2412.14587
Copy Paste: [[2412.14587]] Spike2Former: Efficient Spiking Transformer for High-performance Image Segmentation(https://arxiv.org/abs/2412.14587)
Keywords: transformer, segmentation
Abstract: Spiking Neural Networks (SNNs) have a low-power advantage but perform poorly in image segmentation tasks. The reason is that directly converting neural networks with complex architectural designs for segmentation tasks into spiking versions leads to performance degradation and non-convergence. To address this challenge, we first identify the modules in the architecture design that lead to the severe reduction in spike firing, make targeted improvements, and propose Spike2Former architecture. Second, we propose normalized integer spiking neurons to solve the training stability problem of SNNs with complex architectures. We set a new state-of-the-art for SNNs in various semantic segmentation datasets, with a significant improvement of +12.7% mIoU and 5.0 efficiency on ADE20K, +14.3% mIoU and 5.2 efficiency on VOC2012, and +9.1% mIoU and 6.6 efficiency on CityScapes.

Title: Beyond Guilt: Legal Judgment Prediction with Trichotomous Reasoning

Authors: Kepu Zhang, Haoyue Yang, Xu Tang, Weijie Yu, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14588
Pdf URL: https://arxiv.org/pdf/2412.14588
Copy Paste: [[2412.14588]] Beyond Guilt: Legal Judgment Prediction with Trichotomous Reasoning(https://arxiv.org/abs/2412.14588)
Keywords: large language model
Abstract: In legal practice, judges apply the trichotomous dogmatics of criminal law, sequentially assessing the elements of the offense, unlawfulness, and culpability to determine whether an individual's conduct constitutes a crime. Although current legal large language models (LLMs) show promising accuracy in judgment prediction, they lack trichotomous reasoning capabilities due to the absence of an appropriate benchmark dataset, preventing them from predicting innocent outcomes. As a result, every input is automatically assigned a charge, limiting their practical utility in legal contexts. To bridge this gap, we introduce LJPIV, the first benchmark dataset for Legal Judgment Prediction with Innocent Verdicts. Adhering to the trichotomous dogmatics, we extend three widely-used legal datasets through LLM-based augmentation and manual verification. Our experiments with state-of-the-art legal LLMs and novel strategies that integrate trichotomous reasoning into zero-shot prompting and fine-tuning reveal: (1) current legal LLMs have significant room for improvement, with even the best models achieving an F1 score of less than 0.3 on LJPIV; and (2) our strategies notably enhance both in-domain and cross-domain judgment prediction accuracy, especially for cases resulting in an innocent verdict.

Title: Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties

Authors: Wenqiao Li, Bozhong Zheng, Xiaohao Xu, Jinye Gan, Fading Lu, Xiang Li, Na Ni, Zheng Tian, Xiaonan Huang, Shenghua Gao, Yingna Wu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14592
Pdf URL: https://arxiv.org/pdf/2412.14592
Copy Paste: [[2412.14592]] Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties(https://arxiv.org/abs/2412.14592)
Keywords: robust
Abstract: Object anomaly detection is essential for industrial quality inspection, yet traditional single-sensor methods face critical limitations. They fail to capture the wide range of anomaly types, as single sensors are often constrained to either external appearance, geometric structure, or internal properties. To overcome these challenges, we introduce MulSen-AD, the first high-resolution, multi-sensor anomaly detection dataset tailored for industrial applications. MulSen-AD unifies data from RGB cameras, laser scanners, and lock-in infrared thermography, effectively capturing external appearance, geometric deformations, and internal defects. The dataset spans 15 industrial products with diverse, real-world anomalies. We also present MulSen-AD Bench, a benchmark designed to evaluate multi-sensor methods, and propose MulSen-TripleAD, a decision-level fusion algorithm that integrates these three modalities for robust, unsupervised object anomaly detection. Our experiments demonstrate that multi-sensor fusion substantially outperforms single-sensor approaches, achieving 96.1% AUROC in object-level detection accuracy. These results highlight the importance of integrating multi-sensor data for comprehensive industrial anomaly detection.

Title: LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

Authors: Huawen Shen, Gengluo Li, Jinwen Zhong, Yu Zhou
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14596
Pdf URL: https://arxiv.org/pdf/2412.14596
Copy Paste: [[2412.14596]] LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining(https://arxiv.org/abs/2412.14596)
Keywords: extraction, diffusion
Abstract: Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.

Title: Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization Through Spare-Coding Transformer

Authors: Lei Su, Xiaochen Ma, Xuekang Zhu, Chaoqun Niu, Zeyu Lei, Ji-Zhe Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14598
Pdf URL: https://arxiv.org/pdf/2412.14598
Copy Paste: [[2412.14598]] Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization Through Spare-Coding Transformer(https://arxiv.org/abs/2412.14598)
Keywords: transformer
Abstract: Non-semantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model's generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets.

Title: KARRIEREWEGE: A Large Scale Career Path Prediction Dataset

Authors: Elena Senger, Yuri Campbell, Rob van der Goot, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14612
Pdf URL: https://arxiv.org/pdf/2412.14612
Copy Paste: [[2412.14612]] KARRIEREWEGE: A Large Scale Career Path Prediction Dataset(https://arxiv.org/abs/2412.14612)
Keywords: robust
Abstract: Accurate career path prediction can support many stakeholders, like job seekers, recruiters, HR, and project managers. However, publicly available data and tools for career path prediction are scarce. In this work, we introduce KARRIEREWEGE, a comprehensive, publicly available dataset containing over 500k career paths, significantly surpassing the size of previously available datasets. We link the dataset to the ESCO taxonomy to offer a valuable resource for predicting career trajectories. To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions resulting in KARRIEREWEGE+. This allows for accurate predictions from unstructured data, closely aligning with real-world application challenges. We benchmark existing state-of-the-art (SOTA) models on our dataset and a prior benchmark and observe improved performance and robustness, particularly for free-text use cases, due to the synthesized data.

Title: How good is GPT at writing political speeches for the White House?

Authors: Jacques Savoy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14617
Pdf URL: https://arxiv.org/pdf/2412.14617
Copy Paste: [[2412.14617]] How good is GPT at writing political speeches for the White House?(https://arxiv.org/abs/2412.14617)
Keywords: large language model
Abstract: Using large language models (LLMs), computers are able to generate a written text in response to a us er request. As this pervasive technology can be applied in numerous contexts, this study analyses the written style of one LLM called GPT by comparing its generated speeches with those of the recent US presidents. To achieve this objective, the State of the Union (SOTU) addresses written by Reagan to Biden are contrasted to those produced by both GPT-3.5 and GPT-4.o versions. Compared to US presidents, GPT tends to overuse the lemma "we" and produce shorter messages with, on average, longer sentences. Moreover, GPT opts for an optimistic tone, opting more often for political (e.g., president, Congress), symbolic (e.g., freedom), and abstract terms (e.g., freedom). Even when imposing an author's style to GPT, the resulting speech remains distinct from addresses written by the target author. Finally, the two GPT versions present distinct characteristics, but both appear overall dissimilar to true presidential messages.

Title: Pitfalls of topology-aware image segmentation

Authors: Alexander H. Berger, Laurin Lux, Alexander Weers, Martin Menten, Daniel Rueckert, Johannes C. Paetzold
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14619
Pdf URL: https://arxiv.org/pdf/2412.14619
Copy Paste: [[2412.14619]] Pitfalls of topology-aware image segmentation(https://arxiv.org/abs/2412.14619)
Keywords: robust, fair, segmentation
Abstract: Topological correctness, i.e., the preservation of structural integrity and specific characteristics of shape, is a fundamental requirement for medical imaging tasks, such as neuron or vessel segmentation. Despite the recent surge in topology-aware methods addressing this challenge, their real-world applicability is hindered by flawed benchmarking practices. In this paper, we identify critical pitfalls in model evaluation that include inadequate connectivity choices, overlooked topological artifacts in ground truth annotations, and inappropriate use of evaluation metrics. Through detailed empirical analysis, we uncover these issues' profound impact on the evaluation and ranking of segmentation methods. Drawing from our findings, we propose a set of actionable recommendations to establish fair and robust evaluation standards for topology-aware medical image segmentation methods.

Title: Learning to Generate Research Idea with Dynamic Control

Authors: Ruochen Li, Liqiang Jing, Chi Han, Jiawei Zhou, Xinya Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14626
Pdf URL: https://arxiv.org/pdf/2412.14626
Copy Paste: [[2412.14626]] Learning to Generate Research Idea with Dynamic Control(https://arxiv.org/abs/2412.14626)
Keywords: large language model
Abstract: The rapid advancements in large language models (LLMs) have demonstrated their potential to accelerate scientific discovery, particularly in automating the process of research ideation. LLM-based systems have shown promise in generating hypotheses and research ideas. However, current approaches predominantly rely on prompting-based pre-trained models, limiting their ability to optimize generated content effectively. Moreover, they also lack the capability to deal with the complex interdependence and inherent restrictions among novelty, feasibility, and effectiveness, which remains challenging due to the inherent trade-offs among these dimensions, such as the innovation-feasibility conflict. To address these limitations, we for the first time propose fine-tuning LLMs to be better idea proposers and introduce a novel framework that employs a two-stage approach combining Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL). In the SFT stage, the model learns foundational patterns from pairs of research papers and follow-up ideas. In the RL stage, multi-dimensional reward modeling, guided by fine-grained feedback, evaluates and optimizes the generated ideas across key metrics. Dimensional controllers enable dynamic adjustment of generation, while a sentence-level decoder ensures context-aware emphasis during inference. Our framework provides a balanced approach to research ideation, achieving high-quality outcomes by dynamically navigating the trade-offs among novelty, feasibility, and effectiveness.

Title: Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models

Authors: Keith G. Mills, Mohammad Salameh, Ruichen Chen, Negar Hassanpour, Wei Lu, Di Niu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14628
Pdf URL: https://arxiv.org/pdf/2412.14628
Copy Paste: [[2412.14628]] Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models(https://arxiv.org/abs/2412.14628)
Keywords: diffusion, transformer, generative
Abstract: Diffusion Models (DM) have democratized AI image generation through an iterative denoising process. Quantization is a major technique to alleviate the inference cost and reduce the size of DM denoiser networks. However, as denoisers evolve from variants of convolutional U-Nets toward newer Transformer architectures, it is of growing importance to understand the quantization sensitivity of different weight layers, operations and architecture types to performance. In this work, we address this challenge with Qua$^2$SeDiMo, a mixed-precision Post-Training Quantization framework that generates explainable insights on the cost-effectiveness of various model weight quantization methods for different denoiser operation types and block structures. We leverage these insights to make high-quality mixed-precision quantization decisions for a myriad of diffusion models ranging from foundational U-Nets to state-of-the-art Transformers. As a result, Qua$^2$SeDiMo can construct 3.4-bit, 3.9-bit, 3.65-bit and 3.7-bit weight quantization on PixArt-${\alpha}$, PixArt-${\Sigma}$, Hunyuan-DiT and SDXL, respectively. We further pair our weight-quantization configurations with 6-bit activation quantization and outperform existing approaches in terms of quantitative metrics and generative image quality.

Title: Robust PCA Based on Adaptive Weighted Least Squares and Low-Rank Matrix Factorization

Authors: Kexin Li, You-wei Wen, Xu Xiao, Mingchao Zhao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.14629
Pdf URL: https://arxiv.org/pdf/2412.14629
Copy Paste: [[2412.14629]] Robust PCA Based on Adaptive Weighted Least Squares and Low-Rank Matrix Factorization(https://arxiv.org/abs/2412.14629)
Keywords: robust
Abstract: Robust Principal Component Analysis (RPCA) is a fundamental technique for decomposing data into low-rank and sparse components, which plays a critical role for applications such as image processing and anomaly detection. Traditional RPCA methods commonly use $\ell_1$ norm regularization to enforce sparsity, but this approach can introduce bias and result in suboptimal estimates, particularly in the presence of significant noise or outliers. Non-convex regularization methods have been proposed to mitigate these challenges, but they tend to be complex to optimize and sensitive to initial conditions, leading to potential instability in solutions. To overcome these challenges, in this paper, we propose a novel RPCA model that integrates adaptive weighted least squares (AWLS) and low-rank matrix factorization (LRMF). The model employs a {self-attention-inspired} mechanism in its weight update process, allowing the weight matrix to dynamically adjust and emphasize significant components during each iteration. By employing a weighted F-norm for the sparse component, our method effectively reduces bias while simplifying the computational process compared to traditional $\ell_1$-norm-based methods. We use an alternating minimization algorithm, where each subproblem has an explicit solution, thereby improving computational efficiency. Despite its simplicity, numerical experiments demonstrate that our method outperforms existing non-convex regularization approaches, offering superior performance and stability, as well as enhanced accuracy and robustness in practical applications.

Title: Unified Image Restoration and Enhancement: Degradation Calibrated Cycle Reconstruction Diffusion Model

Authors: Minglong Xue, Jinhong He, Shivakumara Palaiahnakote, Mingliang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14630
Pdf URL: https://arxiv.org/pdf/2412.14630
Copy Paste: [[2412.14630]] Unified Image Restoration and Enhancement: Degradation Calibrated Cycle Reconstruction Diffusion Model(https://arxiv.org/abs/2412.14630)
Keywords: diffusion
Abstract: Image restoration and enhancement are pivotal for numerous computer vision applications, yet unifying these tasks efficiently remains a significant challenge. Inspired by the iterative refinement capabilities of diffusion models, we propose CycleRDM, a novel framework designed to unify restoration and enhancement tasks while achieving high-quality mapping. Specifically, CycleRDM first learns the mapping relationships among the degraded domain, the rough normal domain, and the normal domain through a two-stage diffusion inference process. Subsequently, we transfer the final calibration process to the wavelet low-frequency domain using discrete wavelet transform, performing fine-grained calibration from a frequency domain perspective by leveraging task-specific frequency spaces. To improve restoration quality, we design a feature gain module for the decomposed wavelet high-frequency domain to eliminate redundant features. Additionally, we employ multimodal textual prompts and Fourier transform to drive stable denoising and reduce randomness during the inference process. After extensive validation, CycleRDM can be effectively generalized to a wide range of image restoration and enhancement tasks while requiring only a small number of training samples to be significantly superior on various benchmarks of reconstruction quality and perceptual quality. The source code will be available at this https URL.

Title: Review of Fruit Tree Image Segmentation

Authors: Il-Seok Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14631
Pdf URL: https://arxiv.org/pdf/2412.14631
Copy Paste: [[2412.14631]] Review of Fruit Tree Image Segmentation(https://arxiv.org/abs/2412.14631)
Keywords: segmentation
Abstract: Fruit tree image segmentation is an essential problem in automating a variety of agricultural tasks such as phenotyping, harvesting, spraying, and pruning. Many research papers have proposed a diverse spectrum of solutions suitable to specific tasks and environments. The review scope of this paper is confined to the front views of fruit trees and based on 158 relevant papers collected using a newly designed crawling review method. These papers are systematically reviewed based on a taxonomy that sequentially considers the method, image, task, and fruit. This taxonomy will assist readers to intuitively grasp the big picture of these research activities. Our review reveals that the most noticeable deficiency of the previous studies was the lack of a versatile dataset and segmentation model that could be applied to a variety of tasks and environments. Six important future research tasks are suggested, with the expectation that these will pave the way to building a versatile tree segmentation module.

Title: Progressive Fine-to-Coarse Reconstruction for Accurate Low-Bit Post-Training Quantization in Vision Transformers

Authors: Rui Ding, Liang Yong, Sihuan Zhao, Jing Nie, Lihui Chen, Haijun Liu, Xichuan Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14633
Pdf URL: https://arxiv.org/pdf/2412.14633
Copy Paste: [[2412.14633]] Progressive Fine-to-Coarse Reconstruction for Accurate Low-Bit Post-Training Quantization in Vision Transformers(https://arxiv.org/abs/2412.14633)
Keywords: transformer, segmentation
Abstract: Due to its efficiency, Post-Training Quantization (PTQ) has been widely adopted for compressing Vision Transformers (ViTs). However, when quantized into low-bit representations, there is often a significant performance drop compared to their full-precision counterparts. To address this issue, reconstruction methods have been incorporated into the PTQ framework to improve performance in low-bit quantization settings. Nevertheless, existing related methods predefine the reconstruction granularity and seldom explore the progressive relationships between different reconstruction granularities, which leads to sub-optimal quantization results in ViTs. To this end, in this paper, we propose a Progressive Fine-to-Coarse Reconstruction (PFCR) method for accurate PTQ, which significantly improves the performance of low-bit quantized vision transformers. Specifically, we define multi-head self-attention and multi-layer perceptron modules along with their shortcuts as the finest reconstruction units. After reconstructing these two fine-grained units, we combine them to form coarser blocks and reconstruct them at a coarser granularity level. We iteratively perform this combination and reconstruction process, achieving progressive fine-to-coarse reconstruction. Additionally, we introduce a Progressive Optimization Strategy (POS) for PFCR to alleviate the difficulty of training, thereby further enhancing model performance. Experimental results on the ImageNet dataset demonstrate that our proposed method achieves the best Top-1 accuracy among state-of-the-art methods, particularly attaining 75.61% for 3-bit quantized ViT-B in PTQ. Besides, quantization results on the COCO dataset reveal the effectiveness and generalization of our proposed method on other computer vision tasks like object detection and instance segmentation.

Title: Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

Authors: Eric Brouwer, Jan Erik van Woerden, Gertjan Burghouts, Matias Valedenegro-Toro, Marco Zullich
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14640
Pdf URL: https://arxiv.org/pdf/2412.14640
Copy Paste: [[2412.14640]] Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning(https://arxiv.org/abs/2412.14640)
Keywords: robust, transformer
Abstract: Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model's predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.

Title: RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

Authors: Jie Huang, Ruibing Hou, Jiahe Zhao, Hong Chang, Shiguang Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14643
Pdf URL: https://arxiv.org/pdf/2412.14643
Copy Paste: [[2412.14643]] RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios(https://arxiv.org/abs/2412.14643)
Keywords: transformer
Abstract: Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (Referring Human-Centric Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM's competitive and even superior performance across multiple human-centric referring tasks. The code and data are publicly at this https URL.

Title: Length Controlled Generation for Black-box LLMs

Authors: Yuxuan Gu, Wenjie Wang, Xiaocheng Feng, Weihong Zhong, Kun Zhu, Lei Huang, Tat-Seng Chua, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14656
Pdf URL: https://arxiv.org/pdf/2412.14656
Copy Paste: [[2412.14656]] Length Controlled Generation for Black-box LLMs(https://arxiv.org/abs/2412.14656)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated impressive instruction following capabilities, while still struggling to accurately manage the length of the generated text, which is a fundamental requirement in many real-world applications. Existing length control methods involve fine-tuning the parameters of LLMs, which is inefficient and suboptimal for practical use. In this paper, we propose a novel iterative sampling framework for text length control, integrating the Metropolis-Hastings algorithm with an importance sampling acceleration strategy. This framework efficiently and reliably regulates LLMs to generate length-constrained text without modifying the underlying parameters, thereby preserving the original capabilities of LLMs. Experimental results demonstrate that our framework achieves almost 100\% success rates of length control on Llama3.1 for tasks such as length-controlled abstractive summarization and length-constrained instruction following, with minimal additional computational overhead. This also highlights the significant potential of our method for precise length control across a broader range of applications, without compromising the versatility of LLMs.

Title: Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models

Authors: Zijun Chen, Wenbo Hu, Guande He, Zhijie Deng, Zheng Zhang, Richang Hong
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.14660
Pdf URL: https://arxiv.org/pdf/2412.14660
Copy Paste: [[2412.14660]] Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models(https://arxiv.org/abs/2412.14660)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs' miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don't know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: \href{this https URL}{this https URL}.

Title: LoLaFL: Low-Latency Federated Learning via Forward-only Propagation

Authors: Jierui Zhang, Jianhao Huang, Kaibin Huang
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2412.14668
Pdf URL: https://arxiv.org/pdf/2412.14668
Copy Paste: [[2412.14668]] LoLaFL: Low-Latency Federated Learning via Forward-only Propagation(https://arxiv.org/abs/2412.14668)
Keywords: privacy, federate
Abstract: Federated learning (FL) has emerged as a widely adopted paradigm for enabling edge learning with distributed data while ensuring data privacy. However, the traditional FL with deep neural networks trained via backpropagation can hardly meet the low-latency learning requirements in the sixth generation (6G) mobile networks. This challenge mainly arises from the high-dimensional model parameters to be transmitted and the numerous rounds of communication required for convergence due to the inherent randomness of the training process. To address this issue, we adopt the state-of-the-art principle of maximal coding rate reduction to learn linear discriminative features and extend the resultant white-box neural network into FL, yielding the novel framework of Low-Latency Federated Learning (LoLaFL) via forward-only propagation. LoLaFL enables layer-wise transmissions and aggregation with significantly fewer communication rounds, thereby considerably reducing latency. Additionally, we propose two \emph{nonlinear} aggregation schemes for LoLaFL. The first scheme is based on the proof that the optimal NN parameter aggregation in LoLaFL should be harmonic-mean-like. The second scheme further exploits the low-rank structures of the features and transmits the low-rank-approximated covariance matrices of features to achieve additional latency reduction. Theoretic analysis and experiments are conducted to evaluate the performance of LoLaFL. In comparison with traditional FL, the two nonlinear aggregation schemes for LoLaFL can achieve reductions in latency of over 91\% and 98\%, respectively, while maintaining comparable accuracies.

Title: Simplicity over Complexity: An ARN-Based Intrusion Detection Method for Industrial Control Network

Authors: Ziyi Liu, Dengpan Ye, Changsong Yang, Yong Ding, Yueling Liu, Long Tang, Chuanxi Chen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.14669
Pdf URL: https://arxiv.org/pdf/2412.14669
Copy Paste: [[2412.14669]] Simplicity over Complexity: An ARN-Based Intrusion Detection Method for Industrial Control Network(https://arxiv.org/abs/2412.14669)
Keywords: attack
Abstract: Industrial control network (ICN) is characterized by real-time responsiveness and reliability, which plays a key role in increasing production speed, rational and efficient processing, and managing the production process. Despite tremendous advantages, ICN inevitably struggles with some challenges, such as malicious user intrusion and hacker attack. To detect malicious intrusions in ICN, intrusion detection systems have been deployed. However, in ICN, network traffic data is equipped with characteristics of large scale, irregularity, multiple features, temporal correlation and high dimensionality, which greatly affect the efficiency and performance. To properly solve the above problems, we design a new intrusion detection method for ICN. Specifically, we first design a novel neural network model called associative recurrent network (ARN), which can properly handle the relationship between past moment hidden state and current moment information. Then, we adopt ARN to design a new intrusion detection method that can efficiently and accurately detect malicious intrusions in ICN. Subsequently, we demonstrate the high efficiency of our proposed method through theoretical computational complexity analysis. Finally, we develop a prototype implementation to evaluate the accuracy. The experimental results prove that our proposed method has sate-of-the-art performance on both the ICN dataset SWaT and the conventional network traffic dataset UNSW-NB15. The accuracies on the SWaT dataset and the UNSW-NB15 dataset reach 95.48% and 97.61%, respectively.

Title: Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT

Authors: Hassane Kissane, Achim Schilling, Patrick Krauss
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14670
Pdf URL: https://arxiv.org/pdf/2412.14670
Copy Paste: [[2412.14670]] Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT(https://arxiv.org/abs/2412.14670)
Keywords: transformer, large language model
Abstract: This study investigates the internal representations of verb-particle combinations within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic nuances at different neural network layers. Employing the BERT architecture, we analyse the representational efficacy of its layers for various verb-particle constructions such as 'agree on', 'come back', and 'give up'. Our methodology includes a detailed dataset preparation from the British National Corpus, followed by extensive model training and output analysis through techniques like multi-dimensional scaling (MDS) and generalized discrimination value (GDV) calculations. Results show that BERT's middle layers most effectively capture syntactic structures, with significant variability in representational accuracy across different verb categories. These findings challenge the conventional uniformity assumed in neural network processing of linguistic elements and suggest a complex interplay between network architecture and linguistic representation. Our research contributes to a better understanding of how deep learning models comprehend and process language, offering insights into the potential and limitations of current neural approaches to linguistic analysis. This study not only advances our knowledge in computational linguistics but also prompts further research into optimizing neural architectures for enhanced linguistic precision.

Title: MUSTER: Longitudinal Deformable Registration by Composition of Consecutive Deformations

Authors: Edvard O. S. Grødem, Donatas Sederevičius, Esten H. Leonardsen, Bradley J. MacIntosh, Atle Bjørnerud, Till Schellhorn, Øystein Sørensen, Inge Amlien, Pablo F. Garrido, Anders M. Fjell
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2412.14671
Pdf URL: https://arxiv.org/pdf/2412.14671
Copy Paste: [[2412.14671]] MUSTER: Longitudinal Deformable Registration by Composition of Consecutive Deformations(https://arxiv.org/abs/2412.14671)
Keywords: robust, segmentation
Abstract: Longitudinal imaging allows for the study of structural changes over time. One approach to detecting such changes is by non-linear image registration. This study introduces Multi-Session Temporal Registration (MUSTER), a novel method that facilitates longitudinal analysis of changes in extended series of medical images. MUSTER improves upon conventional pairwise registration by incorporating more than two imaging sessions to recover longitudinal deformations. Longitudinal analysis at a voxel-level is challenging due to effects of a changing image contrast as well as instrumental and environmental sources of bias between sessions. We show that local normalized cross-correlation as an image similarity metric leads to biased results and propose a robust alternative. We test the performance of MUSTER on a synthetic multi-site, multi-session neuroimaging dataset and show that, in various scenarios, using MUSTER significantly enhances the estimated deformations relative to pairwise registration. Additionally, we apply MUSTER on a sample of older adults from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study. The results show that MUSTER can effectively identify patterns of neuro-degeneration from T1-weighted images and that these changes correlate with changes in cognition, matching the performance of state of the art segmentation methods. By leveraging GPU acceleration, MUSTER efficiently handles large datasets, making it feasible also in situations with limited computational resources.

Title: FiVL: A Framework for Improved Vision-Language Alignment

Authors: Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14672
Pdf URL: https://arxiv.org/pdf/2412.14672
Copy Paste: [[2412.14672]] FiVL: A Framework for Improved Vision-Language Alignment(https://arxiv.org/abs/2412.14672)
Keywords: explainability
Abstract: Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. This issue extends to vision-language benchmarks, where it is difficult to make the image indispensable for accurate answer generation, particularly in vision question-answering tasks. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and to evaluate their effectiveness in achieving it. These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence rather than relying solely on linguistic priors, providing insights into the model's reliance on visual information. To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability. The code is available at this https URL.

Title: LLMs as mediators: Can they diagnose conflicts accurately?

Authors: Özgecan Koçak (Emory University), Phanish Puranam (INSEAD), Afşar Yegin (Kadir Has University)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14675
Pdf URL: https://arxiv.org/pdf/2412.14675
Copy Paste: [[2412.14675]] LLMs as mediators: Can they diagnose conflicts accurately?(https://arxiv.org/abs/2412.14675)
Keywords: large language model
Abstract: Prior research indicates that to be able to mediate conflict, observers of disagreements between parties must be able to reliably distinguish the sources of their disagreement as stemming from differences in beliefs about what is true (causality) vs. differences in what they value (morality). In this paper, we test if OpenAI's Large Language Models GPT 3.5 and GPT 4 can perform this task and whether one or other type of disagreement proves particularly challenging for LLM's to diagnose. We replicate study 1 in Koçak et al. (2003), which employes a vignette design, with OpenAI's GPT 3.5 and GPT 4. We find that both LLMs have similar semantic understanding of the distinction between causal and moral codes as humans and can reliably distinguish between them. When asked to diagnose the source of disagreement in a conversation, both LLMs, compared to humans, exhibit a tendency to overestimate the extent of causal disagreement and underestimate the extent of moral disagreement in the moral misalignment condition. This tendency is especially pronounced for GPT 4 when using a proximate scale that relies on concrete language specific to an issue. GPT 3.5 does not perform as well as GPT4 or humans when using either the proximate or the distal scale. The study provides a first test of the potential for using LLMs to mediate conflict by diagnosing the root of disagreements in causal and evaluative codes.

Title: Lorentzian Residual Neural Networks

Authors: Neil He, Menglin Yang, Rex Ying
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14695
Pdf URL: https://arxiv.org/pdf/2412.14695
Copy Paste: [[2412.14695]] Lorentzian Residual Neural Networks(https://arxiv.org/abs/2412.14695)
Keywords: robust, transformer
Abstract: Hyperbolic neural networks have emerged as a powerful tool for modeling hierarchical data structures prevalent in real-world datasets. Notably, residual connections, which facilitate the direct flow of information across layers, have been instrumental in the success of deep neural networks. However, current methods for constructing hyperbolic residual networks suffer from limitations such as increased model complexity, numerical instability, and errors due to multiple mappings to and from the tangent space. To address these limitations, we introduce LResNet, a novel Lorentzian residual neural network based on the weighted Lorentzian centroid in the Lorentz model of hyperbolic geometry. Our method enables the efficient integration of residual connections in Lorentz hyperbolic neural networks while preserving their hierarchical representation capabilities. We demonstrate that our method can theoretically derive previous methods while offering improved stability, efficiency, and effectiveness. Extensive experiments on both graph and vision tasks showcase the superior performance and robustness of our method compared to state-of-the-art Euclidean and hyperbolic alternatives. Our findings highlight the potential of \method for building more expressive neural networks in hyperbolic embedding space as a generally applicable method to multiple architectures, including CNNs, GNNs, and graph Transformers.

Title: Event-assisted 12-stop HDR Imaging of Dynamic Scene

Authors: Shi Guo, Zixuan Chen, Ziran Zhang, Yutian Chen, Gangwei Xu, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14705
Pdf URL: https://arxiv.org/pdf/2412.14705
Copy Paste: [[2412.14705]] Event-assisted 12-stop HDR Imaging of Dynamic Scene(https://arxiv.org/abs/2412.14705)
Keywords: diffusion
Abstract: High dynamic range (HDR) imaging is a crucial task in computational photography, which captures details across diverse lighting conditions. Traditional HDR fusion methods face limitations in dynamic scenes with extreme exposure differences, as aligning low dynamic range (LDR) frames becomes challenging due to motion and brightness variation. In this work, we propose a novel 12-stop HDR imaging approach for dynamic scenes, leveraging a dual-camera system with an event camera and an RGB camera. The event camera provides temporally dense, high dynamic range signals that improve alignment between LDR frames with large exposure differences, reducing ghosting artifacts caused by motion. Also, a real-world finetuning strategy is proposed to increase the generalization of alignment module on real-world events. Additionally, we introduce a diffusion-based fusion module that incorporates image priors from pre-trained diffusion models to address artifacts in high-contrast regions and minimize errors from the alignment process. To support this work, we developed the ESHDR dataset, the first dataset for 12-stop HDR imaging with synchronized event signals, and validated our approach on both simulated and real-world data. Extensive experiments demonstrate that our method achieves state-of-the-art performance, successfully extending HDR imaging to 12 stops in dynamic scenes.

Title: EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

Authors: Jianrong Zhang, Hehe Fan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14706
Pdf URL: https://arxiv.org/pdf/2412.14706
Copy Paste: [[2412.14706]] EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space(https://arxiv.org/abs/2412.14706)
Keywords: diffusion
Abstract: Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.

Title: Holistic Adversarially Robust Pruning

Authors: Qi Zhao, Christian Wressnegger
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14714
Pdf URL: https://arxiv.org/pdf/2412.14714
Copy Paste: [[2412.14714]] Holistic Adversarially Robust Pruning(https://arxiv.org/abs/2412.14714)
Keywords: robust
Abstract: Neural networks can be drastically shrunk in size by removing redundant parameters. While crucial for the deployment on resource-constraint hardware, oftentimes, compression comes with a severe drop in accuracy and lack of adversarial robustness. Despite recent advances, counteracting both aspects has only succeeded for moderate compression rates so far. We propose a novel method, HARP, that copes with aggressive pruning significantly better than prior work. For this, we consider the network holistically. We learn a global compression strategy that optimizes how many parameters (compression rate) and which parameters (scoring connections) to prune specific to each layer individually. Our method fine-tunes an existing model with dynamic regularization, that follows a step-wise incremental function balancing the different objectives. It starts by favoring robustness before shifting focus on reaching the target compression rate and only then handles the objectives equally. The learned compression strategies allow us to maintain the pre-trained model natural accuracy and its adversarial robustness for a reduction by 99% of the network original size. Moreover, we observe a crucial influence of non-uniform compression across layers.

Title: FROC: Building Fair ROC from a Trained Classifier

Authors: Avyukta Manjunatha Vummintala, Shantanu Das, Sujit Gujar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14724
Pdf URL: https://arxiv.org/pdf/2412.14724
Copy Paste: [[2412.14724]] FROC: Building Fair ROC from a Trained Classifier(https://arxiv.org/abs/2412.14724)
Keywords: protect, fair
Abstract: This paper considers the problem of fair probabilistic binary classification with binary protected groups. The classifier assigns scores, and a practitioner predicts labels using a certain cut-off threshold based on the desired trade-off between false positives vs. false negatives. It derives these thresholds from the ROC of the classifier. The resultant classifier may be unfair to one of the two protected groups in the dataset. It is desirable that no matter what threshold the practitioner uses, the classifier should be fair to both the protected groups; that is, the $\mathcal{L}_p$ norm between FPRs and TPRs of both the protected groups should be at most $\varepsilon$. We call such fairness on ROCs of both the protected attributes $\varepsilon_p$-Equalized ROC. Given a classifier not satisfying $\varepsilon_1$-Equalized ROC, we aim to design a post-processing method to transform the given (potentially unfair) classifier's output (score) to a suitable randomized yet fair classifier. That is, the resultant classifier must satisfy $\varepsilon_1$-Equalized ROC. First, we introduce a threshold query model on the ROC curves for each protected group. The resulting classifier is bound to face a reduction in AUC. With the proposed query model, we provide a rigorous theoretical analysis of the minimal AUC loss to achieve $\varepsilon_1$-Equalized ROC. To achieve this, we design a linear time algorithm, namely \texttt{FROC}, to transform a given classifier's output to a probabilistic classifier that satisfies $\varepsilon_1$-Equalized ROC. We prove that under certain theoretical conditions, \texttt{FROC}\ achieves the theoretical optimal guarantees. We also study the performance of our \texttt{FROC}\ on multiple real-world datasets with many trained classifiers.

Title: Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data

Authors: Fabian Sven Karst, Sook-Yee Chong, Abigail A. Antenor, Enyu Lin, Mahei Manhai Li, Jan Marco Leimeister
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14730
Pdf URL: https://arxiv.org/pdf/2412.14730
Copy Paste: [[2412.14730]] Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data(https://arxiv.org/abs/2412.14730)
Keywords: privacy, diffusion, generative
Abstract: The banking sector faces challenges in using deep learning due to data sensitivity and regulatory constraints, but generative AI may offer a solution. Thus, this study identifies effective algorithms for generating synthetic financial transaction data and evaluates five leading models - Conditional Tabular Generative Adversarial Networks (CTGAN), DoppelGANger (DGAN), Wasserstein GAN, Financial Diffusion (FinDiff), and Tabular Variational AutoEncoders (TVAE) - across five criteria: fidelity, synthesis quality, efficiency, privacy, and graph structure. While none of the algorithms is able to replicate the real data's graph structure, each excels in specific areas: DGAN is ideal for privacy-sensitive tasks, FinDiff and TVAE excel in data replication and augmentation, and CTGAN achieves a balance across all five criteria, making it suitable for general applications with moderate privacy concerns. As a result, our findings offer valuable insights for choosing the most suitable algorithm.

Title: On Verbalized Confidence Scores for LLMs

Authors: Daniel Yang, Yao-Hung Hubert Tsai, Makoto Yamada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14737
Pdf URL: https://arxiv.org/pdf/2412.14737
Copy Paste: [[2412.14737]] On Verbalized Confidence Scores for LLMs(https://arxiv.org/abs/2412.14737)
Keywords: large language model
Abstract: The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other's uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at this https URL .

Title: Boosting GNN Performance via Training Sample Selection Based on Adversarial Robustness Evaluation

Authors: Yongyu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14738
Pdf URL: https://arxiv.org/pdf/2412.14738
Copy Paste: [[2412.14738]] Boosting GNN Performance via Training Sample Selection Based on Adversarial Robustness Evaluation(https://arxiv.org/abs/2412.14738)
Keywords: robust
Abstract: Graph Neural Networks (GNNs) have established themselves as one of the most powerful neural network architectures, excelling in leveraging graph topology and node features for various tasks. However, GNNs are inherently vulnerable to noise in their inputs. Such noise can significantly degrade their performance. To address this challenge, we propose a novel approach that employs adversarial robustness evaluation techniques to identify nodes in the graph that are most susceptible to noise. By selecting and constructing a training set composed of these particularly noise-prone nodes, we then use them to train a Graph Convolutional Network (GCN). Our experimental results demonstrate that this strategy leads to substantial improvements in the GCN's performance.

Title: Query pipeline optimization for cancer patient question answering systems

Authors: Maolin He, Rena Gao, Mike Conway, Brian E. Chapman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14751
Pdf URL: https://arxiv.org/pdf/2412.14751
Copy Paste: [[2412.14751]] Query pipeline optimization for cancer patient question answering systems(https://arxiv.org/abs/2412.14751)
Keywords: robust, large language model, segmentation
Abstract: Retrieval-augmented generation (RAG) mitigates hallucination in Large Language Models (LLMs) by using query pipelines to retrieve relevant external information and grounding responses in retrieved knowledge. However, query pipeline optimization for cancer patient question-answering (CPQA) systems requires separately optimizing multiple components with domain-specific considerations. We propose a novel three-aspect optimization approach for the RAG query pipeline in CPQA systems, utilizing public biomedical databases like PubMed and PubMed Central. Our optimization includes: (1) document retrieval, utilizing a comparative analysis of NCBI resources and introducing Hybrid Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval, identifying optimal pairings of dense retrievers and rerankers; and (3) semantic representation, introducing Semantic Enhanced Overlap Segmentation (SEOS) for improved contextual understanding. On a custom-developed dataset tailored for cancer-related inquiries, our optimized RAG approach improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more accurate and reliable CPQA systems, advancing the development of RAG-based biomedical systems.

Title: FLAMe: Federated Learning with Attention Mechanism using Spatio-Temporal Keypoint Transformers for Pedestrian Fall Detection in Smart Cities

Authors: Byeonghun Kim, Byeongjoon Noh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14768
Pdf URL: https://arxiv.org/pdf/2412.14768
Copy Paste: [[2412.14768]] FLAMe: Federated Learning with Attention Mechanism using Spatio-Temporal Keypoint Transformers for Pedestrian Fall Detection in Smart Cities(https://arxiv.org/abs/2412.14768)
Keywords: privacy, robust, federate, transformer
Abstract: In smart cities, detecting pedestrian falls is a major challenge to ensure the safety and quality of life of citizens. In this study, we propose a novel fall detection system using FLAMe (Federated Learning with Attention Mechanism), a federated learning (FL) based algorithm. FLAMe trains around important keypoint information and only transmits the trained important weights to the server, reducing communication costs and preserving data privacy. Furthermore, the lightweight keypoint transformer model is integrated into the FL framework to effectively learn spatio-temporal features. We validated the experiment using 22,672 video samples from the "Fall Accident Risk Behavior Video-Sensor Pair data" dataset from AI-Hub. As a result of the experiment, the FLAMe-based system achieved an accuracy of 94.02% with about 190,000 transmission parameters, maintaining performance similar to that of existing centralized learning while maximizing efficiency by reducing communication costs by about 40% compared to the existing FL algorithm, FedAvg. Therefore, the FLAMe algorithm has demonstrated that it provides robust performance in the distributed environment of smart cities and is a practical and effective solution for public safety.

Title: PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children

Authors: Yiqun Zhang, Xiaocui Yang, Xiaobai Li, Siyuan Yu, Yi Luan, Shi Feng, Daling Wang, Yifei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14769
Pdf URL: https://arxiv.org/pdf/2412.14769
Copy Paste: [[2412.14769]] PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children(https://arxiv.org/abs/2412.14769)
Keywords: extraction, large language model
Abstract: Left-behind children (LBCs), numbering over 66 million in China, face severe mental health challenges due to parental migration for work. Early screening and identification of at-risk LBCs is crucial, yet challenging due to the severe shortage of mental health professionals, especially in rural areas. While the House-Tree-Person (HTP) test shows higher child participation rates, its requirement for expert interpretation limits its application in resource-scarce regions. To address this challenge, we propose PsyDraw, a multi-agent system based on Multimodal Large Language Models that assists mental health professionals in analyzing HTP drawings. The system employs specialized agents for feature extraction and psychological interpretation, operating in two stages: comprehensive feature analysis and professional report generation. Evaluation of HTP drawings from 290 primary school students reveals that 71.03% of the analyzes achieved High Consistency with professional evaluations, 26.21% Moderate Consistency and only 2.41% Low Consistency. The system identified 31.03% of cases requiring professional attention, demonstrating its effectiveness as a preliminary screening tool. Currently deployed in pilot schools, \method shows promise in supporting mental health professionals, particularly in resource-limited areas, while maintaining high professional standards in psychological assessment.

Title: ALKAFI-LLAMA3: Fine-Tuning LLMs for Precise Legal Understanding in Palestine

Authors: Rabee Qasem, Mohannad Hendi, Banan Tantour
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14771
Pdf URL: https://arxiv.org/pdf/2412.14771
Copy Paste: [[2412.14771]] ALKAFI-LLAMA3: Fine-Tuning LLMs for Precise Legal Understanding in Palestine(https://arxiv.org/abs/2412.14771)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable potential in diverse domains, yet their application in the legal sector, particularly in low-resource contexts, remains limited. This study addresses the challenges of adapting LLMs to the Palestinian legal domain, where political instability, fragmented legal frameworks, and limited AI resources hinder effective machine-learning applications. We present a fine-tuned model based on a quantized version of Llama-3.2-1B-Instruct, trained on a synthetic data set derived from Palestinian legal texts. Using smaller-scale models and strategically generated question-answer pairs, we achieve a cost-effective, locally sustainable solution that provides accurate and contextually relevant legal guidance. Our experiments demonstrate promising performance on various query types, ranging from yes/no questions and narrative explanations to complex legal differentiations, while highlighting areas for improvement, such as handling calculation-based inquiries and structured list formatting. This work provides a pathway for the deployment of AI-driven legal assistance tools tailored to the needs of resource-constrained environments.

Title: Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

Authors: Ziang Ye, Zhenru Zhang, Yang Zhang, Jianxin Ma, Junyang Lin, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14780
Pdf URL: https://arxiv.org/pdf/2412.14780
Copy Paste: [[2412.14780]] Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning(https://arxiv.org/abs/2412.14780)
Keywords: large language model
Abstract: When using agent-task datasets to enhance agent capabilities for Large Language Models (LLMs), current methodologies often treat all tokens within a sample equally. However, we argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens (e.g., those governing output format) - differ significantly in importance and learning complexity, necessitating their disentanglement and distinct treatment. To address this, we propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. SHAD classifies tokens by exploiting predictability differences observed after shuffling input-output combinations across samples: boilerplate tokens, due to their repetitive nature among samples, maintain predictability, whereas reasoning tokens do not. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning, yielding notable performance gains over common Supervised Fine-Tuning (SFT).

Title: Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Authors: Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.14803
Pdf URL: https://arxiv.org/pdf/2412.14803
Copy Paste: [[2412.14803]] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations(https://arxiv.org/abs/2412.14803)
Keywords: diffusion
Abstract: Recent advancements in robotics have focused on developing generalist policies capable of performing multiple tasks. Typically, these policies utilize pre-trained vision encoders to capture crucial information from current observations. However, previous vision encoders, which trained on two-image contrastive learning or single-image reconstruction, can not perfectly capture the sequential information essential for embodied tasks. Recently, video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences, exhibiting a good understanding of physical dynamics. Motivated by the strong visual prediction capabilities of VDMs, we hypothesize that they inherently possess visual representations that reflect the evolution of the physical world, which we term predictive visual representations. Building on this hypothesis, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. To further enhance these representations, we incorporate diverse human or robotic manipulation datasets, employing unified video-generation training objectives. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks. Notably, it achieves a 28.1\% relative improvement in the Calvin ABC-D benchmark compared to the previous state-of-the-art and delivers a 28.8\% increase in success rates for complex real-world dexterous manipulation tasks.

Title: ResoFilter: Rine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis

Authors: Zeao Tu, Xiangdi Meng, Yu He, Zihan Yao, Tianyu Qi, Jun Liu, Ming Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14809
Pdf URL: https://arxiv.org/pdf/2412.14809
Copy Paste: [[2412.14809]] ResoFilter: Rine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis(https://arxiv.org/abs/2412.14809)
Keywords: interpretability, large language model
Abstract: Large language models (LLMs) have shown remarkable effectiveness across various domains, with data augmentation methods utilizing GPT for synthetic data generation becoming prevalent. However, the quality and utility of augmented data remain questionable, and current methods lack clear metrics for evaluating data characteristics. To address these challenges, we propose ResoFilter, a novel method that integrates models, data, and tasks to refine datasets. ResoFilter leverages the fine-tuning process to obtain Data-Parameter features for data selection, offering improved interpretability by representing data characteristics through model weights. Our experiments demonstrate that ResoFilter achieves comparable results to full-scale fine-tuning using only half the data in mathematical tasks and exhibits strong generalization across different models and domains. This method provides valuable insights for constructing synthetic datasets and evaluating high-quality data, offering a promising solution for enhancing data augmentation techniques and improving training dataset quality for LLMs. For reproducibility, we will release our code and data upon acceptance.

Title: MARIA: a Multimodal Transformer Model for Incomplete Healthcare Data

Authors: Camillo Maria Caruso, Paolo Soda, Valerio Guarrasi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14810
Pdf URL: https://arxiv.org/pdf/2412.14810
Copy Paste: [[2412.14810]] MARIA: a Multimodal Transformer Model for Incomplete Healthcare Data(https://arxiv.org/abs/2412.14810)
Keywords: robust, transformer
Abstract: In healthcare, the integration of multimodal data is pivotal for developing comprehensive diagnostic and predictive models. However, managing missing data remains a significant challenge in real-world applications. We introduce MARIA (Multimodal Attention Resilient to Incomplete datA), a novel transformer-based deep learning model designed to address these challenges through an intermediate fusion strategy. Unlike conventional approaches that depend on imputation, MARIA utilizes a masked self-attention mechanism, which processes only the available data without generating synthetic values. This approach enables it to effectively handle incomplete datasets, enhancing robustness and minimizing biases introduced by imputation methods. We evaluated MARIA against 10 state-of-the-art machine learning and deep learning models across 8 diagnostic and prognostic tasks. The results demonstrate that MARIA outperforms existing methods in terms of performance and resilience to varying levels of data incompleteness, underscoring its potential for critical healthcare applications.

Title: Non-intrusive and Unconstrained Keystroke Inference in VR Platforms via Infrared Side Channel

Authors: Tao Ni, Yuefeng Du, Qingchuan Zhao, Cong Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.14815
Pdf URL: https://arxiv.org/pdf/2412.14815
Copy Paste: [[2412.14815]] Non-intrusive and Unconstrained Keystroke Inference in VR Platforms via Infrared Side Channel(https://arxiv.org/abs/2412.14815)
Keywords: security, attack
Abstract: Virtual Reality (VR) technologies are increasingly employed in numerous applications across various areas. Therefore, it is essential to ensure the security of interactions between users and VR devices. In this paper, we disclose a new side-channel leakage in the constellation tracking system of mainstream VR platforms, where the infrared (IR) signals emitted from the VR controllers for controller-headset interactions can be maliciously exploited to reconstruct unconstrained input keystrokes on the virtual keyboard non-intrusively. We propose a novel keystroke inference attack named VRecKey to demonstrate the feasibility and practicality of this novel infrared side channel. Specifically, VRecKey leverages a customized 2D IR sensor array to intercept ambient IR signals emitted from VR controllers and subsequently infers (i) character-level key presses on the virtual keyboard and (ii) word-level keystrokes along with their typing trajectories. We extensively evaluate the effectiveness of VRecKey with two commercial VR devices, and the results indicate that it can achieve over 94.2% and 90.5% top-3 accuracy in inferring character-level and word-level keystrokes with varying lengths, respectively. In addition, empirical results show that VRecKey is resilient to several practical impact factors and presents effectiveness in various real-world scenarios, which provides a complementary and orthogonal attack surface for the exploration of keystroke inference attacks in VR platforms.

Title: Explainable Tampered Text Detection via Multimodal Large Models

Authors: Chenfan Qu, Jian Liu, Haoxing Chen, Baihan Yu, Jingjing Liu, Weiqiang Wang, Lianwen Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14816
Pdf URL: https://arxiv.org/pdf/2412.14816
Copy Paste: [[2412.14816]] Explainable Tampered Text Detection via Multimodal Large Models(https://arxiv.org/abs/2412.14816)
Keywords: security
Abstract: Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear, making the prediction unreliable. To address this black-box problem, we propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD, which contains both pixel-level annotations indicating the tampered text region and natural language annotations describing the anomaly of the tampered text. Multiple methods are employed to improve the quality of the proposed data. For example, a fused mask prompt is proposed to reduce confusion when querying GPT4o to generate anomaly descriptions. By weighting the input image with the mask annotation, the tampered region can be clearly indicated and the content in and around the tampered region can also be preserved. We also propose prompting GPT4o to recognize tampered texts and filtering out the responses with low OCR accuracy, which can effectively improve annotation quality in an automatic manner. To further improve explainable tampered text detection, we propose a simple yet effective model called TTD, which benefits from improved fine-grained perception by paying attention to the suspected region with auxiliary reference grounding query. Extensive experiments on both the ETTD dataset and the public dataset have verified the effectiveness of the proposed methods. In-depth analysis is also provided to inspire further research. The dataset and code will be made publicly available.

Title: Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization

Authors: Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14819
Pdf URL: https://arxiv.org/pdf/2412.14819
Copy Paste: [[2412.14819]] Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization(https://arxiv.org/abs/2412.14819)
Keywords: robust
Abstract: Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images. However, the imaging gaps between platforms are often significant and the variations in viewpoints are substantial, which limits the ability of existing methods to effectively associate cross-view features and extract consistent and invariant characteristics. Moreover, existing methods often overlook the problem of increased computational and storage requirements when improving model performance. To handle these limitations, we propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN). The MEAN network uses a progressive multi-level enhancement strategy, global-to-local associations, and cross-domain alignment, enabling feature communication across levels. This allows MEAN to effectively connect features at different levels and learn robust cross-view consistent mappings and modality-invariant features. Moreover, MEAN adopts a shallow backbone network combined with a lightweight branch design, effectively reducing parameter count and computational complexity. Experimental results on the University-1652 and SUES-200 datasets demonstrate that MEAN reduces parameter count by 62.17% and computational complexity by 70.99% compared to state-of-the-art models, while maintaining competitive or even superior performance. The codes will be released soon.

Title: PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation

Authors: Shoumeng Qiu, Xinrun Li, XiangYang Xue, Jian Pu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14821
Pdf URL: https://arxiv.org/pdf/2412.14821
Copy Paste: [[2412.14821]] PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation(https://arxiv.org/abs/2412.14821)
Keywords: transformer, segmentation
Abstract: Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive point-based interactions, arising from the lack of fixed correspondences between views such as range view and Bird's-Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV-only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (170$\times$ speedup) than conventional point-based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives. To enhance scene understanding while maintaining inference efficiency, we also introduce a hybrid Transformer-CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV-based fusion for LiDAR segmentation. Code is available at \url{this https URL.}

Title: Mention Attention for Pronoun Translation

Authors: Gongbo Tang, Christian Hardmeier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14829
Pdf URL: https://arxiv.org/pdf/2412.14829
Copy Paste: [[2412.14829]] Mention Attention for Pronoun Translation(https://arxiv.org/abs/2412.14829)
Keywords: transformer
Abstract: Most pronouns are referring expressions, computers need to resolve what do the pronouns refer to, and there are divergences on pronoun usage across languages. Thus, dealing with these divergences and translating pronouns is a challenge in machine translation. Mentions are referring candidates of pronouns and have closer relations with pronouns compared to general tokens. We assume that extracting additional mention features can help pronoun translation. Therefore, we introduce an additional mention attention module in the decoder to pay extra attention to source mentions but not non-mention tokens. Our mention attention module not only extracts features from source mentions, but also considers target-side context which benefits pronoun translation. In addition, we also introduce two mention classifiers to train models to recognize mentions, whose outputs guide the mention attention. We conduct experiments on the WMT17 English-German translation task, and evaluate our models on general translation and pronoun translation, using BLEU, APT, and contrastive evaluation metrics. Our proposed model outperforms the baseline Transformer model in terms of APT and BLEU scores, this confirms our hypothesis that we can improve pronoun translation by paying additional attention to source mentions, and shows that our introduced additional modules do not have negative effect on the general translation quality.

Title: Federated Heavy Hitter Analytics with Local Differential Privacy

Authors: Yuemin Zhang, Qingqing Ye, Haibo Hu
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2412.14832
Pdf URL: https://arxiv.org/pdf/2412.14832
Copy Paste: [[2412.14832]] Federated Heavy Hitter Analytics with Local Differential Privacy(https://arxiv.org/abs/2412.14832)
Keywords: privacy, federate
Abstract: Federated heavy hitter analytics enables service providers to better understand the preferences of cross-party users by analyzing the most frequent items. As with federated learning, it faces challenges of privacy concerns, statistical heterogeneity, and expensive communication. Local differential privacy (LDP), as the \textit{de facto} standard for privacy-preserving data collection, solves the privacy challenge by letting each user perturb her data locally and report the sanitized version. However, in federated settings, applying LDP complicates the other two challenges, due to the deteriorated utility by the injected LDP noise or increasing communication/computation costs by perturbation mechanism. To tackle these problems, we propose a novel target-aligning prefix tree mechanism satisfying $\epsilon$-LDP, for federated heavy hitter analytics. In particular, we propose an adaptive extension strategy to address the inconsistencies between covering necessary prefixes and estimating heavy hitters within a party to enhance the utility. We also present a consensus-based pruning strategy that utilizes noisy prior knowledge from other parties to further align the inconsistency between finding heavy hitters in each party and providing reasonable frequency information to identify the global ones. To the best of our knowledge, our study is the first solution to the federated heavy hitter analytics in a cross-party setting while satisfying the stringent $\epsilon$-LDP. Comprehensive experiments on both real-world and synthetic datasets confirm the effectiveness of our proposed mechanism.

Title: Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

Authors: Hao Huang, Yujie Lin, Siyu Chen, Haiyang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14833
Pdf URL: https://arxiv.org/pdf/2412.14833
Copy Paste: [[2412.14833]] Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition(https://arxiv.org/abs/2412.14833)
Keywords: extraction
Abstract: Skeleton-based action recognition using GCNs has achieved remarkable performance, but recognizing ambiguous actions, such as "waving" and "saluting", remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and TCNs, where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasize local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, we propose a lightweight plug-and-play module called Synchronized and Fine-grained Head (SF-Head), inserted between GCN and TCN layers. SF-Head first conducts Synchronized Spatial-Temporal Extraction (SSTE) with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction between the two types of features. It then performs Adaptive Cross-dimensional Feature Aggregation (AC-FA), with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. This aggregation step effectively combines both global context and local details. Experimental results on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets demonstrate significant improvements in distinguishing ambiguous actions. Our code will be made available at this https URL.

Title: Progressive Multimodal Reasoning via Active Retrieval

Authors: Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2412.14835
Pdf URL: https://arxiv.org/pdf/2412.14835
Copy Paste: [[2412.14835]] Progressive Multimodal Reasoning via Active Retrieval(https://arxiv.org/abs/2412.14835)
Keywords: large language model
Abstract: Multi-step multimodal reasoning tasks pose significant challenges for multimodal large language models (MLLMs), and finding effective ways to enhance their performance in such scenarios remains an unresolved issue. In this paper, we propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo Tree Search (MCTS). Our approach begins with the development of a unified retrieval module that retrieves key supporting insights for solving complex reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in automated multimodal reasoning verification, we employ the MCTS algorithm combined with an active retrieval mechanism, which enables the automatic generation of step-wise annotations. This strategy dynamically retrieves key insights for each reasoning step, moving beyond traditional beam search sampling to improve the diversity and reliability of the reasoning space. Additionally, we introduce a process reward model that aligns progressively to support the automatic verification of multimodal reasoning tasks. Experimental results across three complex multimodal reasoning benchmarks confirm the effectiveness of the AR-MCTS framework in enhancing the performance of various multimodal models. Further analysis demonstrates that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.

Title: Mapping and Influencing the Political Ideology of Large Language Models using Synthetic Personas

Authors: Pietro Bernardelle, Leon Fröhling, Stefano Civelli, Riccardo Lunardi, Kevin Roiter, Gianluca Demartini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14843
Pdf URL: https://arxiv.org/pdf/2412.14843
Copy Paste: [[2412.14843]] Mapping and Influencing the Political Ideology of Large Language Models using Synthetic Personas(https://arxiv.org/abs/2412.14843)
Keywords: large language model
Abstract: The analysis of political biases in large language models (LLMs) has primarily examined these systems as single entities with fixed viewpoints. While various methods exist for measuring such biases, the impact of persona-based prompting on LLMs' political orientation remains unexplored. In this work we leverage PersonaHub, a collection of synthetic persona descriptions, to map the political distribution of persona-based prompted LLMs using the Political Compass Test (PCT). We then examine whether these initial compass distributions can be manipulated through explicit ideological prompting towards diametrically opposed political orientations: right-authoritarian and left-libertarian. Our experiments reveal that synthetic personas predominantly cluster in the left-libertarian quadrant, with models demonstrating varying degrees of responsiveness when prompted with explicit ideological descriptors. While all models demonstrate significant shifts towards right-authoritarian positions, they exhibit more limited shifts towards left-libertarian positions, suggesting an asymmetric response to ideological manipulation that may reflect inherent biases in model training.

Title: A Survey of RWKV

Authors: Zhiyuan Li, Tingyu Xia, Yi Chang, Yuan Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14847
Pdf URL: https://arxiv.org/pdf/2412.14847
Copy Paste: [[2412.14847]] A Survey of RWKV(https://arxiv.org/abs/2412.14847)
Keywords: robust, transformer
Abstract: The Receptance Weighted Key Value (RWKV) model offers a novel alternative to the Transformer architecture, merging the benefits of recurrent and attention-based systems. Unlike conventional Transformers, which depend heavily on self-attention, RWKV adeptly captures long-range dependencies with minimal computational demands. By utilizing a recurrent framework, RWKV addresses some computational inefficiencies found in Transformers, particularly in tasks with long sequences. RWKV has recently drawn considerable attention for its robust performance across multiple domains. Despite its growing popularity, no systematic review of the RWKV model exists. This paper seeks to fill this gap as the first comprehensive review of the RWKV architecture, its core principles, and its varied applications, such as natural language generation, natural language understanding, and computer vision. We assess how RWKV compares to traditional Transformer models, highlighting its capability to manage long sequences efficiently and lower computational costs. Furthermore, we explore the challenges RWKV encounters and propose potential directions for future research and advancement. We consistently maintain the related open-source materials at: this https URL.

Title: DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis

Authors: Hongling Xu, Yice Zhang, Qianlong Wang, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14849
Pdf URL: https://arxiv.org/pdf/2412.14849
Copy Paste: [[2412.14849]] DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2412.14849)
Keywords: large language model
Abstract: Recently developed large language models (LLMs) have presented promising new avenues to address data scarcity in low-resource scenarios. In few-shot aspect-based sentiment analysis (ABSA), previous efforts have explored data augmentation techniques, which prompt LLMs to generate new samples by modifying existing ones. However, these methods fail to produce adequately diverse data, impairing their effectiveness. Besides, some studies apply in-context learning for ABSA by using specific instructions and a few selected examples as prompts. Though promising, LLMs often yield labels that deviate from task requirements. To overcome these limitations, we propose DS$^2$-ABSA, a dual-stream data synthesis framework targeted for few-shot ABSA. It leverages LLMs to synthesize data from two complementary perspectives: \textit{key-point-driven} and \textit{instance-driven}, which effectively generate diverse and high-quality ABSA samples in low-resource settings. Furthermore, a \textit{label refinement} module is integrated to improve the synthetic labels. Extensive experiments demonstrate that DS$^2$-ABSA significantly outperforms previous few-shot ABSA solutions and other LLM-oriented data generation methods.

Title: Position: A taxonomy for reporting and describing AI security incidents

Authors: Lukas Bieringer, Kevin Paeth, Andreas Wespi, Kathrin Grosse
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.14855
Pdf URL: https://arxiv.org/pdf/2412.14855
Copy Paste: [[2412.14855]] Position: A taxonomy for reporting and describing AI security incidents(https://arxiv.org/abs/2412.14855)
Keywords: security, attack
Abstract: AI systems are vulnerable to attacks, and corresponding AI security incidents have been described. Although a collection of safety incidents around AI will become a regulatory requirement, there is no proposal to collect AI security incidents. In this position paper, we argue that a proposal should be made, taking into account the interests and needs of different stakeholders: industry, providers, users, and researchers. We thus attempt to close this gap and propose a taxonomy alongside its requirements like machine readability and link-ability with existing databases. We aim to spark discussions and enable discussion of which information is feasible, necessary, and possible to report and share within and outside organizations using AI.

Title: Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling

Authors: Junyi Li, Hwee Tou Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14860
Pdf URL: https://arxiv.org/pdf/2412.14860
Copy Paste: [[2412.14860]] Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling(https://arxiv.org/abs/2412.14860)
Keywords: large language model
Abstract: Despite their outstanding capabilities, large language models (LLMs) are prone to hallucination and producing factually incorrect information. This challenge has spurred efforts in attributed text generation, which prompts LLMs to generate content with supporting evidence. In this paper, we propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi-step reasoning problem integrated with search. Specifically, we propose Self-Guided Monte Carlo Tree Search (SG-MCTS), which capitalizes on the self-reflection capability of LLMs to reflect on the intermediate states of MCTS for guiding the tree expansion process. To provide reliable and comprehensive feedback, we introduce Progress Reward Models to measure the progress of tree search from the root to the current state from two aspects, i.e., generation and attribution progress. We conduct extensive experiments on three datasets and the results show that our approach significantly outperforms baseline approaches.

Title: Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering

Authors: Imed Keraghel, Mohamed Nadif
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14867
Pdf URL: https://arxiv.org/pdf/2412.14867
Copy Paste: [[2412.14867]] Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering(https://arxiv.org/abs/2412.14867)
Keywords: large language model
Abstract: Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.

Title: Large-scale School Mapping using Weakly Supervised Deep Learning for Universal School Connectivity

Authors: Isabelle Tingzon, Utku Can Ozturk, Ivan Dotu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14870
Pdf URL: https://arxiv.org/pdf/2412.14870
Copy Paste: [[2412.14870]] Large-scale School Mapping using Weakly Supervised Deep Learning for Universal School Connectivity(https://arxiv.org/abs/2412.14870)
Keywords: transformer
Abstract: Improving global school connectivity is critical for ensuring inclusive and equitable quality education. To reliably estimate the cost of connecting schools, governments and connectivity providers require complete and accurate school location data - a resource that is often scarce in many low- and middle-income countries. To address this challenge, we propose a cost-effective, scalable approach to locating schools in high-resolution satellite images using weakly supervised deep learning techniques. Our best models, which combine vision transformers and convolutional neural networks, achieve AUPRC values above 0.96 across 10 pilot African countries. Leveraging explainable AI techniques, our approach can approximate the precise geographical coordinates of the school locations using only low-cost, classification-level annotations. To demonstrate the scalability of our method, we generate nationwide maps of school location predictions in African countries and present a detailed analysis of our results, using Senegal as our case study. Finally, we demonstrate the immediate usability of our work by introducing an interactive web mapping tool to streamline human-in-the-loop model validation efforts by government partners. This work successfully showcases the real-world utility of deep learning and satellite images for planning regional infrastructure and accelerating universal school connectivity.

Title: Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Authors: Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14880
Pdf URL: https://arxiv.org/pdf/2412.14880
Copy Paste: [[2412.14880]] Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering(https://arxiv.org/abs/2412.14880)
Keywords: large language model
Abstract: Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

Title: Diffusion priors for Bayesian 3D reconstruction from incomplete measurements

Authors: Julian L. Möbius, Michael Habeck
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14897
Pdf URL: https://arxiv.org/pdf/2412.14897
Copy Paste: [[2412.14897]] Diffusion priors for Bayesian 3D reconstruction from incomplete measurements(https://arxiv.org/abs/2412.14897)
Keywords: diffusion
Abstract: Many inverse problems are ill-posed and need to be complemented by prior information that restricts the class of admissible models. Bayesian approaches encode this information as prior distributions that impose generic properties on the model such as sparsity, non-negativity or smoothness. However, in case of complex structured models such as images, graphs or three-dimensional (3D) objects,generic prior distributions tend to favor models that differ largely from those observed in the real world. Here we explore the use of diffusion models as priors that are combined with experimental data within a Bayesian framework. We use 3D point clouds to represent 3D objects such as household items or biomolecular complexes formed from proteins and nucleic acids. We train diffusion models that generate coarse-grained 3D structures at a medium resolution and integrate these with incomplete and noisy experimental data. To demonstrate the power of our approach, we focus on the reconstruction of biomolecular assemblies from cryo-electron microscopy (cryo-EM) images, which is an important inverse problem in structural biology. We find that posterior sampling with diffusion model priors allows for 3D reconstruction from very sparse, low-resolution and partial observations.

Title: MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models

Authors: Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wanrong Hunag, Yuhua Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14902
Pdf URL: https://arxiv.org/pdf/2412.14902
Copy Paste: [[2412.14902]] MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models(https://arxiv.org/abs/2412.14902)
Keywords: diffusion
Abstract: Large-scale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name? In this paper, we explore the existence of a "Name Space", where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities' names. Specifically, we first extract the embeddings of celebrities' names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models. Project homepage: \url{this https URL}.

Title: Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation

Authors: Zexiong Ma, Shengnan An, Zeqi Lin, Yanzhen Zou, Jian-Guang Lou, Bing Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14905
Pdf URL: https://arxiv.org/pdf/2412.14905
Copy Paste: [[2412.14905]] Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation(https://arxiv.org/abs/2412.14905)
Keywords: large language model
Abstract: Large language models (LLMs) are susceptible to generating hallucinated information, despite the integration of retrieval-augmented generation (RAG). Parallel context extension (PCE) is a line of research attempting to effectively integrating parallel (unordered) contexts, while it still suffers from hallucinations when adapted to RAG scenarios. In this paper, we propose DePaC (Dehallucinating Parallel Context Extension), which alleviates the hallucination problem with context-aware negative training and information-calibrated aggregation. DePaC is designed to alleviate two types of in-context hallucination: fact fabrication (i.e., LLMs present claims that are not supported by the contexts) and fact omission (i.e., LLMs fail to present claims that can be supported by the contexts). Specifically, (1) for fact fabrication, we apply the context-aware negative training that fine-tunes the LLMs with negative supervisions, thus explicitly guiding the LLMs to refuse to answer when contexts are not related to questions; (2) for fact omission, we propose the information-calibrated aggregation which prioritizes context windows with higher information increment from their contexts. The experimental results on nine RAG tasks demonstrate that DePaC significantly alleviates the two types of hallucination and consistently achieves better performances on these tasks.

Title: RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

Authors: Junyu Luo, Xiao Luo, Kaize Ding, Jingyang Yuan, Zhiping Xiao, Ming Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14922
Pdf URL: https://arxiv.org/pdf/2412.14922
Copy Paste: [[2412.14922]] RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response(https://arxiv.org/abs/2412.14922)
Keywords: robust, large language model
Abstract: Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications, which poses significant challenges to model performance on downstream tasks. Therefore, there is an urgent need for a noise-robust SFT framework to enhance model capabilities in downstream tasks. To address this challenge, we introduce a robust SFT framework (RobustFT) that performs noise detection and relabeling on downstream task data. For noise identification, our approach employs a multi-expert collaborative system with inference-enhanced models to achieve superior noise detection. In the denoising phase, we utilize a context-enhanced strategy, which incorporates the most relevant and confident knowledge followed by careful assessment to generate reliable annotations. Additionally, we introduce an effective data selection mechanism based on response entropy, ensuring only high-quality samples are retained for fine-tuning. Extensive experiments conducted on multiple LLMs across five datasets demonstrate RobustFT's exceptional performance in noisy scenarios.

Title: Automatic Spectral Calibration of Hyperspectral Images:Method, Dataset and Benchmark

Authors: Zhuoran Du, Shaodi You, Cheng Cheng, Shikui Wei
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14925
Pdf URL: https://arxiv.org/pdf/2412.14925
Copy Paste: [[2412.14925]] Automatic Spectral Calibration of Hyperspectral Images:Method, Dataset and Benchmark(https://arxiv.org/abs/2412.14925)
Keywords: transformer
Abstract: Hyperspectral image (HSI) densely samples the world in both the space and frequency domain and therefore is more distinctive than RGB images. Usually, HSI needs to be calibrated to minimize the impact of various illumination conditions. The traditional way to calibrate HSI utilizes a physical reference, which involves manual operations, occlusions, and/or limits camera mobility. These limitations inspire this paper to automatically calibrate HSIs using a learning-based method. Towards this goal, a large-scale HSI calibration dataset is created, which has 765 high-quality HSI pairs covering diversified natural scenes and illuminations. The dataset is further expanded to 7650 pairs by combining with 10 different physically measured illuminations. A spectral illumination transformer (SIT) together with an illumination attention module is proposed. Extensive benchmarks demonstrate the SoTA performance of the proposed SIT. The benchmarks also indicate that low-light conditions are more challenging than normal conditions. The dataset and codes are available online:this https URL

Title: TDCNet: Transparent Objects Depth Completion with CNN-Transformer Dual-Branch Parallel Network

Authors: Xianghui Fan, Chao Ye, Anping Deng, Xiaotian Wu, Mengyang Pan, Hang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14961
Pdf URL: https://arxiv.org/pdf/2412.14961
Copy Paste: [[2412.14961]] TDCNet: Transparent Objects Depth Completion with CNN-Transformer Dual-Branch Parallel Network(https://arxiv.org/abs/2412.14961)
Keywords: transformer
Abstract: The sensing and manipulation of transparent objects present a critical challenge in industrial and laboratory robotics. Conventional sensors face challenges in obtaining the full depth of transparent objects due to the refraction and reflection of light on their surfaces and their lack of visible texture. Previous research has attempted to obtain complete depth maps of transparent objects from RGB and damaged depth maps (collected by depth sensor) using deep learning models. However, existing methods fail to fully utilize the original depth map, resulting in limited accuracy for deep completion. To solve this problem, we propose TDCNet, a novel dual-branch CNN-Transformer parallel network for transparent object depth completion. The proposed framework consists of two different branches: one extracts features from partial depth maps, while the other processes RGB-D images. Experimental results demonstrate that our model achieves state-of-the-art performance across multiple public datasets. Our code and the pre-trained model are publicly available at this https URL.

Title: IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Authors: Yiyu Zhuang, Jiaxi Lv, Hao Wen, Qing Shuai, Ailing Zeng, Hao Zhu, Shifeng Chen, Yujiu Yang, Xun Cao, Wei Liu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14963
Pdf URL: https://arxiv.org/pdf/2412.14963
Copy Paste: [[2412.14963]] IDOL: Instant Photorealistic 3D Human Creation from a Single Image(https://arxiv.org/abs/2412.14963)
Keywords: transformer
Abstract: Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K diverse, photorealistic sets of human images. Each set contains 24-view frames in specific human poses, generated using a pose-controllable image-to-multi-view model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space from a given human image. This model is trained to disentangle human pose, body shape, clothing geometry, and texture. The estimated Gaussians can be animated without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the ability to efficiently reconstruct photorealistic humans at 1K resolution from a single input image using a single GPU instantly. Additionally, it seamlessly supports various applications, as well as shape and texture editing tasks.

Title: Knowledge Injection via Prompt Distillation

Authors: Kalle Kujanpää, Harri Valpola, Alexander Ilin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14964
Pdf URL: https://arxiv.org/pdf/2412.14964
Copy Paste: [[2412.14964]] Knowledge Injection via Prompt Distillation(https://arxiv.org/abs/2412.14964)
Keywords: large language model
Abstract: In many practical applications, large language models (LLMs) need to incorporate new knowledge not present in their pre-training data. The primary methods for this are fine-tuning and retrieval-augmented generation (RAG). Although RAG has emerged as the industry standard for knowledge injection, fine-tuning has not yet achieved comparable success. In this paper, we propose a new fine-tuning technique for learning new knowledge and show that it can reach the performance of RAG. The proposed method is based on the self-distillation approach, which we call prompt distillation. First, we generate question-answer pairs about the new knowledge. Then, we fine-tune a student model on the question-answer pairs to imitate the output distributions of a teacher model, which additionally receives the new knowledge in its prompt. The student model is identical to the teacher, except it is equipped with a LoRA adapter. This training procedure facilitates distilling the new knowledge from the teacher's prompt into the student's weights.

Title: Movie2Story: A framework for understanding videos and telling stories in the form of novel text

Authors: Kangning Li, Zheyang Jia, Anyu Ying
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.14965
Pdf URL: https://arxiv.org/pdf/2412.14965
Copy Paste: [[2412.14965]] Movie2Story: A framework for understanding videos and telling stories in the form of novel text(https://arxiv.org/abs/2412.14965)
Keywords: large language model
Abstract: Multimodal video-to-text models have made considerable progress, primarily in generating brief descriptions of video content. However, there is still a deficiency in generating rich long-form text descriptions that integrate both video and audio. In this paper, we introduce a framework called M2S, designed to generate novel-length text by combining audio, video, and character recognition. M2S includes modules for video long-form text description and comprehension, audio-based analysis of emotion, speech rate, and character alignment, and visual-based character recognition alignment. By integrating multimodal information using the large language model GPT4o, M2S stands out in the field of multimodal text generation. We demonstrate the effectiveness and accuracy of M2S through comparative experiments and human evaluation. Additionally, the model framework has good scalability and significant potential for future research.

Title: Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts

Authors: Ioana Buhnila, Georgeta Cislaru, Amalia Todirascu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.14986
Pdf URL: https://arxiv.org/pdf/2412.14986
Copy Paste: [[2412.14986]] Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts(https://arxiv.org/abs/2412.14986)
Keywords: large language model
Abstract: Large Language Models (LLMs) have been used to generate texts in response to different writing tasks: reports, essays, story telling. However, language models do not have a meta-representation of the text writing process, nor inherent communication learning needs, comparable to those of young human students. This paper introduces a fine-grained linguistic and textual analysis of multilingual Small Language Models' (SLMs) writing. With our method, Chain-of-MetaWriting, SLMs can imitate some steps of the human writing process, such as planning and evaluation. We mainly focused on short story and essay writing tasks in French for schoolchildren and undergraduate students respectively. Our results show that SLMs encounter difficulties in assisting young students on sensitive topics such as violence in the schoolyard, and they sometimes use words too complex for the target audience. In particular, the output is quite different from the human produced texts in term of text cohesion and coherence regarding temporal connectors, topic progression, reference.

Title: Stitch Contrast and Segment_Learning a Human Action Segmentation Model Using Trimmed Skeleton Videos

Authors: Haitao Tian, Pierre Payeur
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14988
Pdf URL: https://arxiv.org/pdf/2412.14988
Copy Paste: [[2412.14988]] Stitch Contrast and Segment_Learning a Human Action Segmentation Model Using Trimmed Skeleton Videos(https://arxiv.org/abs/2412.14988)
Keywords: segmentation
Abstract: Existing skeleton-based human action classification models rely on well-trimmed action-specific skeleton videos for both training and testing, precluding their scalability to real-world applications where untrimmed videos exhibiting concatenated actions are predominant. To overcome this limitation, recently introduced skeleton action segmentation models involve un-trimmed skeleton videos into end-to-end training. The model is optimized to provide frame-wise predictions for any length of testing videos, simultaneously realizing action localization and classification. Yet, achieving such an improvement im-poses frame-wise annotated skeleton videos, which remains time-consuming in practice. This paper features a novel framework for skeleton-based action segmentation trained on short trimmed skeleton videos, but that can run on longer un-trimmed videos. The approach is implemented in three steps: Stitch, Contrast, and Segment. First, Stitch proposes a tem-poral skeleton stitching scheme that treats trimmed skeleton videos as elementary human motions that compose a semantic space and can be sampled to generate multi-action stitched se-quences. Contrast learns contrastive representations from stitched sequences with a novel discrimination pretext task that enables a skeleton encoder to learn meaningful action-temporal contexts to improve action segmentation. Finally, Segment relates the proposed method to action segmentation by learning a segmentation layer while handling particular da-ta availability. Experiments involve a trimmed source dataset and an untrimmed target dataset in an adaptation formulation for real-world skeleton-based human action segmentation to evaluate the effectiveness of the proposed method.

Title: Large Language Models and Code Security: A Systematic Literature Review

Authors: Enna Basic, Alberto Giaretta
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.15004
Pdf URL: https://arxiv.org/pdf/2412.15004
Copy Paste: [[2412.15004]] Large Language Models and Code Security: A Systematic Literature Review(https://arxiv.org/abs/2412.15004)
Keywords: security, attack, large language model
Abstract: Large Language Models (LLMs) have emerged as powerful tools for automating various programming tasks, including security-related ones, such as detecting and fixing vulnerabilities. Despite their promising capabilities, when required to produce or modify pre-existing code, LLMs could introduce vulnerabilities unbeknown to the programmer. When analyzing code, they could miss clear vulnerabilities or signal nonexistent ones. In this Systematic Literature Review (SLR), we aim to investigate both the security benefits and potential drawbacks of using LLMs for a variety of code-related tasks. In particular, first we focus on the types of vulnerabilities that could be introduced by LLMs, when used for producing code. Second, we analyze the capabilities of LLMs to detect and fix vulnerabilities, in any given code, and how the prompting strategy of choice impacts their performance in these two tasks. Last, we provide an in-depth analysis on how data poisoning attacks on LLMs can impact performance in the aforementioned tasks.

Title: Robust Federated Learning in the Face of Covariate Shift: A Magnitude Pruning with Hybrid Regularization Framework for Enhanced Model Aggregation

Authors: Ozgu Goksu, Nicolas Pugeault
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.15010
Pdf URL: https://arxiv.org/pdf/2412.15010
Copy Paste: [[2412.15010]] Robust Federated Learning in the Face of Covariate Shift: A Magnitude Pruning with Hybrid Regularization Framework for Enhanced Model Aggregation(https://arxiv.org/abs/2412.15010)
Keywords: security, privacy, robust, federate
Abstract: The development of highly sophisticated neural networks has allowed for fast progress in every field of computer vision, however, applications where annotated data is prohibited due to privacy or security concerns remain challenging. Federated Learning (FL) offers a promising framework for individuals aiming to collaboratively develop a shared model while preserving data privacy. Nevertheless, our findings reveal that variations in data distribution among clients can profoundly affect FL methodologies, primarily due to instabilities in the aggregation process. We also propose a novel FL framework to mitigate the adverse effects of covariate shifts among federated clients by combining individual parameter pruning and regularization techniques to improve the robustness of individual clients' models to aggregate. Each client's model is optimized through magnitude-based pruning and the addition of dropout and noise injection layers to build more resilient decision pathways in the networks and improve the robustness of the model's parameter aggregation step. The proposed framework is capable of extracting robust representations even in the presence of very large covariate shifts among client data distributions and in the federation of a small number of clients. Empirical findings substantiate the effectiveness of our proposed methodology across common benchmark datasets, including CIFAR10, MNIST, SVHN, and Fashion MNIST. Furthermore, we introduce the CelebA-Gender dataset, specifically designed to evaluate performance on a more realistic domain. The proposed method is capable of extracting robust representations even in the presence of both high and low covariate shifts among client data distributions.

Title: DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space

Authors: Mang Ning, Mingxiao Li, Jianlin Su, Haozhe Jia, Lanmiao Liu, Martin Beneš, Albert Ali Salah, Itir Onal Ertugrul
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.15032
Pdf URL: https://arxiv.org/pdf/2412.15032
Copy Paste: [[2412.15032]] DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space(https://arxiv.org/abs/2412.15032)
Keywords: diffusion, generative
Abstract: This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to high-resolution generation without using the latent diffusion paradigm. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is at \url{this https URL}.

Title: LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps

Authors: Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15035
Pdf URL: https://arxiv.org/pdf/2412.15035
Copy Paste: [[2412.15035]] LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps(https://arxiv.org/abs/2412.15035)
Keywords: robust, large language model
Abstract: Building safe Large Language Models (LLMs) across multiple languages is essential in ensuring both safe access and linguistic diversity. To this end, we introduce M-ALERT, a multilingual benchmark that evaluates the safety of LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs highlight the importance of language-specific safety analysis, revealing that models often exhibit significant inconsistencies in safety across languages and categories. For instance, Llama3.2 shows high unsafety in the category crime_tax for Italian but remains safe in other languages. Similar differences can be observed across all models. In contrast, certain categories, such as substance_cannabis and crime_propaganda, consistently trigger unsafe responses across models and languages. These findings underscore the need for robust multilingual safety practices in LLMs to ensure safe and responsible usage across diverse user communities.

Title: Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion

Authors: Zhifei Chen, Tianshuo Xu, Wenhang Ge, Leyi Wu, Dongyu Yan, Jing He, Luozhou Wang, Lu Zeng, Shunsi Zhang, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15050
Pdf URL: https://arxiv.org/pdf/2412.15050
Copy Paste: [[2412.15050]] Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion(https://arxiv.org/abs/2412.15050)
Keywords: diffusion
Abstract: Rendering and inverse rendering are pivotal tasks in both computer vision and graphics. The rendering equation is the core of the two tasks, as an ideal conditional distribution transfer function from intrinsic properties to RGB images. Despite achieving promising results of existing rendering methods, they merely approximate the ideal estimation for a specific scene and come with a high computational cost. Additionally, the inverse conditional distribution transfer is intractable due to the inherent ambiguity. To address these challenges, we propose a data-driven method that jointly models rendering and inverse rendering as two conditional generation tasks within a single diffusion framework. Inspired by UniDiffuser, we utilize two distinct time schedules to model both tasks, and with a tailored dual streaming module, we achieve cross-conditioning of two pre-trained diffusion models. This unified approach, named Uni-Renderer, allows the two processes to facilitate each other through a cycle-consistent constrain, mitigating ambiguity by enforcing consistency between intrinsic properties and rendered images. Combined with a meticulously prepared dataset, our method effectively decomposition of intrinsic properties and demonstrates a strong capability to recognize changes during rendering. We will open-source our training and inference code to the public, fostering further research and development in this area.

Title: GIRAFE: Glottal Imaging Dataset for Advanced Segmentation, Analysis, and Facilitative Playbacks Evaluation

Authors: G. Andrade-Miranda, K. Chatzipapas, J.D. Arias-Londoño, J. I. Godino-Llorente
Subjects: cs.CV, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.15054
Pdf URL: https://arxiv.org/pdf/2412.15054
Copy Paste: [[2412.15054]] GIRAFE: Glottal Imaging Dataset for Advanced Segmentation, Analysis, and Facilitative Playbacks Evaluation(https://arxiv.org/abs/2412.15054)
Keywords: segmentation
Abstract: The advances in the development of Facilitative Playbacks extracted from High-Speed videoendoscopic sequences of the vocal folds are hindered by a notable lack of publicly available datasets annotated with the semantic segmentations corresponding to the area of the glottal gap. This fact also limits the reproducibility and further exploration of existing research in this field. To address this gap, GIRAFE is a data repository designed to facilitate the development of advanced techniques for the semantic segmentation, analysis, and fast evaluation of High-Speed videoendoscopic sequences of the vocal folds. The repository includes 65 high-speed videoendoscopic recordings from a cohort of 50 patients (30 female, 20 male). The dataset comprises 15 recordings from healthy controls, 26 from patients with diagnosed voice disorders, and 24 with an unknown health condition. All of them were manually annotated by an expert, including the masks corresponding to the semantic segmentation of the glottal gap. The repository is also complemented with the automatic segmentation of the glottal area using different state-of-the-art approaches. This data set has already supported several studies, which demonstrates its usefulness for the development of new glottal gap segmentation algorithms from High-Speed-Videoendoscopic sequences to improve or create new Facilitative Playbacks. Despite these advances and others in the field, the broader challenge of performing an accurate and completely automatic semantic segmentation method of the glottal area remains open.

Title: MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance

Authors: Hallee E. Wong, Jose Javier Gonzalez Ortiz, John Guttag, Adrian V. Dalca
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.15058
Pdf URL: https://arxiv.org/pdf/2412.15058
Copy Paste: [[2412.15058]] MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance(https://arxiv.org/abs/2412.15058)
Keywords: segmentation
Abstract: Medical researchers and clinicians often need to perform novel segmentation tasks on a set of related images. Existing methods for segmenting a new dataset are either interactive, requiring substantial human effort for each image, or require an existing set of manually labeled images. We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without requiring access to any existing labeled data from that task or domain. Along with the image to segment, the model takes user interactions such as clicks, bounding boxes or scribbles as input, and predicts a segmentation. As the user segments more images, those images and segmentations become additional inputs to the model, providing context. As the context set of labeled images grows, the number of interactions required to segment each new image decreases. We demonstrate that MultiverSeg enables users to interactively segment new datasets efficiently, by amortizing the number of interactions per image to achieve an accurate segmentation. Compared to using a state-of-the-art interactive segmentation method, using MultiverSeg reduced the total number of scribble steps by 53% and clicks by 36% to achieve 90% Dice on sets of images from unseen tasks. We release code and model weights at this https URL

Title: ConfliBERT: A Language Model for Political Conflict

Authors: Patrick T. Brandt, Sultan Alsarra, Vito J. D`Orazio, Dagmar Heintze, Latifur Khan, Shreyas Meher, Javier Osorio, Marcus Sianan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15060
Pdf URL: https://arxiv.org/pdf/2412.15060
Copy Paste: [[2412.15060]] ConfliBERT: A Language Model for Political Conflict(https://arxiv.org/abs/2412.15060)
Keywords: large language model
Abstract: Conflict scholars have used rule-based approaches to extract information about political violence from news reports and texts. Recent Natural Language Processing developments move beyond rigid rule-based approaches. We review our recent ConfliBERT language model (Hu et al. 2022) to process political and violence related texts. The model can be used to extract actor and action classifications from texts about political conflict. When fine-tuned, results show that ConfliBERT has superior performance in accuracy, precision and recall over other large language models (LLM) like Google's Gemma 2 (9B), Meta's Llama 3.1 (7B), and Alibaba's Qwen 2.5 (14B) within its relevant domains. It is also hundreds of times faster than these more generalist LLMs. These results are illustrated using texts from the BBC, re3d, and the Global Terrorism Dataset (GTD).

Title: ScamChatBot: An End-to-End Analysis of Fake Account Recovery on Social Media via Chatbots

Authors: Bhupendra Acharya, Dominik Sautter, Muhammad Saad, Thorsten Holz
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.15072
Pdf URL: https://arxiv.org/pdf/2412.15072
Copy Paste: [[2412.15072]] ScamChatBot: An End-to-End Analysis of Fake Account Recovery on Social Media via Chatbots(https://arxiv.org/abs/2412.15072)
Keywords: large language model
Abstract: Social media platforms have become the hubs for various user interactions covering a wide range of needs, including technical support and services related to brands, products, or user accounts. Unfortunately, there has been a recent surge in scammers impersonating official services and providing fake technical support to users through these platforms. In this study, we focus on scammers engaging in such fake technical support to target users who are having problems recovering their accounts. More specifically, we focus on users encountering access problems with social media profiles (e.g., on platforms such as Facebook, Instagram, Gmail, and X) and cryptocurrency wallets. The main contribution of our work is the development of an automated system that interacts with scammers via a chatbot that mimics different personas. By initiating decoy interactions (e.g., through deceptive tweets), we have enticed scammers to interact with our system so that we can analyze their modus operandi. Our results show that scammers employ many social media profiles asking users to contact them via a few communication channels. Using a large language model (LLM), our chatbot had conversations with 450 scammers and provided valuable insights into their tactics and, most importantly, their payment profiles. This automated approach highlights how scammers use a variety of strategies, including role-playing, to trick victims into disclosing personal or financial information. With this study, we lay the foundation for using automated chat-based interactions with scammers to detect and study fraudulent activities at scale in an automated way.

Title: AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Authors: Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15084
Pdf URL: https://arxiv.org/pdf/2412.15084
Copy Paste: [[2412.15084]] AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling(https://arxiv.org/abs/2412.15084)
Keywords: robust
Abstract: In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: this https URL

Title: Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation

Authors: Haoran Liu, Youzhi Luo, Tianxiao Li, James Caverlee, Martin Renqiang Min
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.15086
Pdf URL: https://arxiv.org/pdf/2412.15086
Copy Paste: [[2412.15086]] Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation(https://arxiv.org/abs/2412.15086)
Keywords: generative
Abstract: We consider the conditional generation of 3D drug-like molecules with \textit{explicit control} over molecular properties such as drug-like properties (e.g., Quantitative Estimate of Druglikeness or Synthetic Accessibility score) and effectively binding to specific protein sites. To tackle this problem, we propose an E(3)-equivariant Wasserstein autoencoder and factorize the latent space of our generative model into two disentangled aspects: molecular properties and the remaining structural context of 3D molecules. Our model ensures explicit control over these molecular attributes while maintaining equivariance of coordinate representation and invariance of data likelihood. Furthermore, we introduce a novel alignment-based coordinate loss to adapt equivariant networks for auto-regressive de-novo 3D molecule generation from scratch. Extensive experiments validate our model's effectiveness on property-guided and context-guided molecule generation, both for de-novo 3D molecule design and structure-based drug discovery against protein targets.

Title: A Full Transformer-based Framework for Automatic Pain Estimation using Videos

Authors: Stefanos Gkikas, Manolis Tsiknakis
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15095
Pdf URL: https://arxiv.org/pdf/2412.15095
Copy Paste: [[2412.15095]] A Full Transformer-based Framework for Automatic Pain Estimation using Videos(https://arxiv.org/abs/2412.15095)
Keywords: transformer
Abstract: The automatic estimation of pain is essential in designing an optimal pain management system offering reliable assessment and reducing the suffering of patients. In this study, we present a novel full transformer-based framework consisting of a Transformer in Transformer (TNT) model and a Transformer leveraging cross-attention and self-attention blocks. Elaborating on videos from the BioVid database, we demonstrate state-of-the-art performances, showing the efficacy, efficiency, and generalization capability across all the primary pain estimation tasks.

Title: Review-Then-Refine: A Dynamic Framework for Multi-Hop Question Answering with Temporal Adaptability

Authors: Xiangsen Chen, Xuming Hu, Nan Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15101
Pdf URL: https://arxiv.org/pdf/2412.15101
Copy Paste: [[2412.15101]] Review-Then-Refine: A Dynamic Framework for Multi-Hop Question Answering with Temporal Adaptability(https://arxiv.org/abs/2412.15101)
Keywords: large language model
Abstract: Retrieve-augmented generation (RAG) frameworks have emerged as a promising solution to multi-hop question answering(QA) tasks since it enables large language models (LLMs) to incorporate external knowledge and mitigate their inherent knowledge deficiencies. Despite this progress, existing RAG frameworks, which usually follows the retrieve-then-read paradigm, often struggle with multi-hop QA with temporal information since it has difficulty retrieving and synthesizing accurate time-related information. To address the challenge, this paper proposes a novel framework called review-then-refine, which aims to enhance LLM performance in multi-hop QA scenarios with temporal information. Our approach begins with a review phase, where decomposed sub-queries are dynamically rewritten with temporal information, allowing for subsequent adaptive retrieval and reasoning process. In addition, we implement adaptive retrieval mechanism to minimize unnecessary retrievals, thus reducing the potential for hallucinations. In the subsequent refine phase, the LLM synthesizes the retrieved information from each sub-query along with its internal knowledge to formulate a coherent answer. Extensive experimental results across multiple datasets demonstrate the effectiveness of our proposed framework, highlighting its potential to significantly improve multi-hop QA capabilities in LLMs.

Title: Qwen2.5 Technical Report

Authors: Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu (additional authors not shown)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15115
Pdf URL: https://arxiv.org/pdf/2412.15115
Copy Paste: [[2412.15115]] Qwen2.5 Technical Report(https://arxiv.org/abs/2412.15115)
Keywords: large language model
Abstract: In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

Title: Outcome-Refining Process Supervision for Code Generation

Authors: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2412.15118
Pdf URL: https://arxiv.org/pdf/2412.15118
Copy Paste: [[2412.15118]] Outcome-Refining Process Supervision for Code Generation(https://arxiv.org/abs/2412.15118)
Keywords: large language model
Abstract: Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: this https URL

Title: Efficient Ranking, Order Statistics, and Sorting under CKKS

Authors: Federico Mazzone, Maarten Everts, Florian Hahn, Andreas Peter
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.15126
Pdf URL: https://arxiv.org/pdf/2412.15126
Copy Paste: [[2412.15126]] Efficient Ranking, Order Statistics, and Sorting under CKKS(https://arxiv.org/abs/2412.15126)
Keywords: privacy
Abstract: Fully Homomorphic Encryption (FHE) enables operations on encrypted data, making it extremely useful for privacy-preserving applications, especially in cloud computing environments. In such contexts, operations like ranking, order statistics, and sorting are fundamental functionalities often required for database queries or as building blocks of larger protocols. However, the high computational overhead and limited native operations of FHE pose significant challenges for an efficient implementation of these tasks. These challenges are exacerbated by the fact that all these functionalities are based on comparing elements, which is a severely expensive operation under encryption. Previous solutions have typically based their designs on swap-based techniques, where two elements are conditionally swapped based on the results of their comparison. These methods aim to reduce the primary computational bottleneck: the comparison depth, which is the number of non-parallelizable homomorphic comparisons. The current state of the art solution for sorting by Lu et al. (IEEE S&P'21), for instance, achieves a comparison depth of O(log^2(N)). In this paper, we address the challenge of reducing the comparison depth by shifting away from the swap-based paradigm. We present solutions for ranking, order statistics, and sorting, that all achieve a comparison depth of O(1), making our approach highly parallelizable. Leveraging the SIMD capabilities of the CKKS FHE scheme, our approach re-encodes the input vector under encryption to allow for simultaneous comparisons of all elements with each other. The homomorphic re-encoding incurs a minimal computational overhead of O(log(N)) rotations. Experimental results show that our approach ranks a 128-element vector in approximately 2.64s, computes its argmin/argmax in 14.18s, and sorts it in 21.10s.

Title: Adaptive Pruning for Large Language Models with Structural Importance Awareness

Authors: Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15127
Pdf URL: https://arxiv.org/pdf/2412.15127
Copy Paste: [[2412.15127]] Adaptive Pruning for Large Language Models with Structural Importance Awareness(https://arxiv.org/abs/2412.15127)
Keywords: large language model
Abstract: The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation. Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.

Title: Jet: A Modern Transformer-Based Normalizing Flow

Authors: Alexander Kolesnikov, André Susano Pinto, Michael Tschannen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15129
Pdf URL: https://arxiv.org/pdf/2412.15129
Copy Paste: [[2412.15129]] Jet: A Modern Transformer-Based Normalizing Flow(https://arxiv.org/abs/2412.15129)
Keywords: diffusion, transformer, generative
Abstract: In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing research frontier by serving as building components of more powerful generative models.

Title: Leveraging Color Channel Independence for Improved Unsupervised Object Detection

Authors: Bastian Jäckl, Yannick Metz, Udo Schlegel, Daniel A. Keim, Maximilian T. Fischer
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15150
Pdf URL: https://arxiv.org/pdf/2412.15150
Copy Paste: [[2412.15150]] Leveraging Color Channel Independence for Improved Unsupervised Object Detection(https://arxiv.org/abs/2412.15150)
Keywords: robust
Abstract: Object-centric architectures can learn to extract distinct object representations from visual scenes, enabling downstream applications on the object level. Similarly to autoencoder-based image models, object-centric approaches have been trained on the unsupervised reconstruction loss of images encoded by RGB color spaces. In our work, we challenge the common assumption that RGB images are the optimal color space for unsupervised learning in computer vision. We discuss conceptually and empirically that other color spaces, such as HSV, bear essential characteristics for object-centric representation learning, like robustness to lighting conditions. We further show that models improve when requiring them to predict additional color channels. Specifically, we propose to transform the predicted targets to the RGB-S space, which extends RGB with HSV's saturation component and leads to markedly better reconstruction and disentanglement for five common evaluation datasets. The use of composite color spaces can be implemented with basically no computational overhead, is agnostic of the models' architecture, and is universally applicable across a wide range of visual computing tasks and training types. The findings of our approach encourage additional investigations in computer vision tasks beyond object-centric learning.

Title: Language Models as Continuous Self-Evolving Data Engineers

Authors: Peidong Wang, Ming Wang, Zhiming Ma, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.15151
Pdf URL: https://arxiv.org/pdf/2412.15151
Copy Paste: [[2412.15151]] Language Models as Continuous Self-Evolving Data Engineers(https://arxiv.org/abs/2412.15151)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on expert-labeled data, setting an upper limit on the performance of LLMs. To address this issue, we propose a novel paradigm that enables LLMs to train itself by autonomously generating, cleaning, reviewing, and annotating data with preference information, named LANCE. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction process. Through iterative fine-tuning on different variants of the Qwen2, we validate the effectiveness of LANCE across various tasks, showing that it can continuously improve model performance and maintain high-quality data generation. Across eight benchmark dimensions, LANCE resulted in an average score enhancement of 3.36 for Qwen2-7B and 2.70 for Qwen2-7B-Instruct. This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human values and preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities.

Title: Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Authors: Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.15156
Pdf URL: https://arxiv.org/pdf/2412.15156
Copy Paste: [[2412.15156]] Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM(https://arxiv.org/abs/2412.15156)
Keywords: diffusion
Abstract: Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.

Title: OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Authors: Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15159
Pdf URL: https://arxiv.org/pdf/2412.15159
Copy Paste: [[2412.15159]] OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization(https://arxiv.org/abs/2412.15159)
Keywords: diffusion
Abstract: In recent years, the field of text-to-video (T2V) generation has made significant strides. Despite this progress, there is still a gap between theoretical advancements and practical application, amplified by issues like degraded image quality and flickering artifacts. Recent advancements in enhancing the video diffusion model (VDM) through feedback learning have shown promising results. However, these methods still exhibit notable limitations, such as misaligned feedback and inferior scalability. To tackle these issues, we introduce OnlineVPO, a more efficient preference learning approach tailored specifically for video diffusion models. Our method features two novel designs, firstly, instead of directly using image-based reward feedback, we leverage the video quality assessment (VQA) model trained on synthetic data as the reward model to provide distribution and modality-aligned feedback on the video diffusion model. Additionally, we introduce an online DPO algorithm to address the off-policy optimization and scalability issue in existing video preference learning frameworks. By employing the video reward model to offer concise video feedback on the fly, OnlineVPO offers effective and efficient preference guidance. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and more importantly scalable preference learning algorithm for video diffusion models, offering valuable insights for future advancements in this domain.

Title: Rethinking Uncertainty Estimation in Natural Language Generation

Authors: Lukas Aichberger, Kajetan Schweighofer, Sepp Hochreiter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.15176
Pdf URL: https://arxiv.org/pdf/2412.15176
Copy Paste: [[2412.15176]] Rethinking Uncertainty Estimation in Natural Language Generation(https://arxiv.org/abs/2412.15176)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs. Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty. However, generating output sequences is computationally expensive, making these methods impractical at scale. In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure. To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding. This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks. Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.

Title: Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning

Authors: Simon Frieder, Jonas Bayer, Katherine M. Collins, Julius Berner, Jacob Loader, András Juhász, Fabian Ruehle, Sean Welleck, Gabriel Poesia, Ryan-Rhys Griffiths, Adrian Weller, Anirudh Goyal, Thomas Lukasiewicz, Timothy Gowers
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.15184
Pdf URL: https://arxiv.org/pdf/2412.15184
Copy Paste: [[2412.15184]] Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning(https://arxiv.org/abs/2412.15184)
Keywords: large language model
Abstract: The suite of datasets commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings. These limitations include a restricted scope of mathematical complexity, typically not exceeding lower undergraduate-level mathematics, binary rating protocols and other issues, which makes comprehensive proof-based evaluation suites difficult. We systematically explore these limitations and contend that enhancing the capabilities of large language models, or any forthcoming advancements in AI-based mathematical assistants (copilots or "thought partners"), necessitates a paradigm shift in the design of mathematical datasets and the evaluation criteria of mathematical ability: It is necessary to move away from result-based datasets (theorem statement to theorem proof) and convert the rich facets of mathematical research practice to data LLMs can train on. Examples of these are mathematical workflows (sequences of atomic, potentially subfield-dependent tasks that are often performed when creating new mathematics), which are an important part of the proof-discovery process. Additionally, we advocate for mathematical dataset developers to consider the concept of "motivated proof", introduced by G. Pólya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal, alleviating some of the mentioned limitations. Lastly, we introduce math datasheets for datasets, extending the general, dataset-agnostic variants of datasheets: We provide a questionnaire designed specifically for math datasets that we urge dataset creators to include with their datasets. This will make creators aware of potential limitations of their datasets while at the same time making it easy for readers to assess it from the point of view of training and evaluating mathematical copilots.

Title: Tiled Diffusion

Authors: Or Madar, Ohad Fried
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15185
Pdf URL: https://arxiv.org/pdf/2412.15185
Copy Paste: [[2412.15185]] Tiled Diffusion(https://arxiv.org/abs/2412.15185)
Keywords: diffusion, generative
Abstract: Image tiling -- the seamless connection of disparate images to create a coherent visual field -- is crucial for applications such as texture creation, video game asset development, and digital art. Traditionally, tiles have been constructed manually, a method that poses significant limitations in scalability and flexibility. Recent research has attempted to automate this process using generative models. However, current approaches primarily focus on tiling textures and manipulating models for single-image generation, without inherently supporting the creation of multiple interconnected tiles across diverse domains. This paper presents Tiled Diffusion, a novel approach that extends the capabilities of diffusion models to accommodate the generation of cohesive tiling patterns across various domains of image synthesis that require tiling. Our method supports a wide range of tiling scenarios, from self-tiling to complex many-to-many connections, enabling seamless integration of multiple images. Tiled Diffusion automates the tiling process, eliminating the need for manual intervention and enhancing creative possibilities in various applications, such as seamlessly tiling of existing images, tiled texture creation, and 360° synthesis.

Title: LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

Authors: Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15188
Pdf URL: https://arxiv.org/pdf/2412.15188
Copy Paste: [[2412.15188]] LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation(https://arxiv.org/abs/2412.15188)
Keywords: diffusion, transformer, generative, large language model
Abstract: We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LlamaFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LlamaFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LlamaFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities. We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development.

Title: AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Authors: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.15191
Pdf URL: https://arxiv.org/pdf/2412.15191
Copy Paste: [[2412.15191]] AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation(https://arxiv.org/abs/2412.15191)
Keywords: diffusion
Abstract: We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: this http URL

Title: MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

Authors: Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.15194
Pdf URL: https://arxiv.org/pdf/2412.15194
Copy Paste: [[2412.15194]] MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark(https://arxiv.org/abs/2412.15194)
Keywords: large language model
Abstract: Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs). However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF. This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage. To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules. To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification. Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard. The GitHub repository is available at this https URL and the dataset refers to this https URL.

Title: DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

Authors: Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.15200
Pdf URL: https://arxiv.org/pdf/2412.15200
Copy Paste: [[2412.15200]] DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation(https://arxiv.org/abs/2412.15200)
Keywords: diffusion, transformer
Abstract: Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.

Title: AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving

Authors: Shuo Xing, Hongyuan Hua, Xiangbo Gao, Shenzhe Zhu, Renjie Li, Kexin Tian, Xiaopeng Li, Heng Huang, Tianbao Yang, Zhangyang Wang, Yang Zhou, Huaxiu Yao, Zhengzhong Tu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.15206
Pdf URL: https://arxiv.org/pdf/2412.15206
Copy Paste: [[2412.15206]] AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving(https://arxiv.org/abs/2412.15206)
Keywords: privacy, attack, robust, fair
Abstract: Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems. However, limited work exists on studying the trustworthiness of DriveVLMs -- a critical factor that directly impacts public transportation safety. In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives -- including trustfulness, safety, robustness, privacy, and fairness. We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries. We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models. Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness. DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information. Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations. Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs -- an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems. Our benchmark is publicly available at \url{this https URL}, and the leaderboard is released at \url{this https URL}.

Title: OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving

Authors: Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, Zhengzhong Tu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.15208
Pdf URL: https://arxiv.org/pdf/2412.15208
Copy Paste: [[2412.15208]] OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving(https://arxiv.org/abs/2412.15208)
Keywords: robust, large language model
Abstract: Since the advent of Multimodal Large Language Models (MLLMs), they have made a significant impact across a wide range of real-world applications, particularly in Autonomous Driving (AD). Their ability to process complex visual data and reason about intricate driving scenarios has paved the way for a new paradigm in end-to-end AD systems. However, the progress of developing end-to-end models for AD has been slow, as existing fine-tuning methods demand substantial resources, including extensive computational power, large-scale datasets, and significant funding. Drawing inspiration from recent advancements in inference computing, we propose OpenEMMA, an open-source end-to-end framework based on MLLMs. By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline when leveraging a diverse range of MLLMs. Furthermore, OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios, offering a more efficient and effective approach to autonomous driving. We release all the codes in this https URL.

Title: PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

Authors: Muntasir Wahed, Kiet A. Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, Ismini Lourentzou
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15209
Pdf URL: https://arxiv.org/pdf/2412.15209
Copy Paste: [[2412.15209]] PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation(https://arxiv.org/abs/2412.15209)
Keywords: robust, segmentation
Abstract: Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.

Title: Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation

Authors: Hadi Alzayer, Philipp Henzler, Jonathan T. Barron, Jia-Bin Huang, Pratul P. Srinivasan, Dor Verbin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15211
Pdf URL: https://arxiv.org/pdf/2412.15211
Copy Paste: [[2412.15211]] Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation(https://arxiv.org/abs/2412.15211)
Keywords: robust, diffusion, generative
Abstract: Reconstructing the geometry and appearance of objects from photographs taken in different environments is difficult as the illumination and therefore the object appearance vary across captured images. This is particularly challenging for more specular objects whose appearance strongly depends on the viewing direction. Some prior approaches model appearance variation across images using a per-image embedding vector, while others use physically-based rendering to recover the materials and per-image illumination. Such approaches fail at faithfully recovering view-dependent appearance given significant variation in input illumination and tend to produce mostly diffuse results. We present an approach that reconstructs objects from images taken under different illuminations by first relighting the images under a single reference illumination with a multiview relighting diffusion model and then reconstructing the object's geometry and appearance with a radiance field architecture that is robust to the small remaining inconsistencies among the relit images. We validate our proposed approach on both synthetic and real datasets and demonstrate that it greatly outperforms existing techniques at reconstructing high-fidelity appearance from images taken under extreme illumination variation. Moreover, our approach is particularly effective at recovering view-dependent "shiny" appearance which cannot be reconstructed by prior methods.

Title: Scaling 4D Representations

Authors: João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica Pătrăucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15212
Pdf URL: https://arxiv.org/pdf/2412.15212
Copy Paste: [[2412.15212]] Scaling 4D Representations(https://arxiv.org/abs/2412.15212)
Keywords: transformer
Abstract: Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.

Title: Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Authors: Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15213
Pdf URL: https://arxiv.org/pdf/2412.15213
Copy Paste: [[2412.15213]] Flowing from Words to Pixels: A Framework for Cross-Modality Evolution(https://arxiv.org/abs/2412.15213)
Keywords: diffusion, transformer
Abstract: Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

Title: LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15214
Pdf URL: https://arxiv.org/pdf/2412.15214
Copy Paste: [[2412.15214]] LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis(https://arxiv.org/abs/2412.15214)
Keywords: diffusion
Abstract: The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. Project page: this https URL