2024-12-16

Title: Blockchain Data Analysis in the Era of Large-Language Models

Authors: Kentaroh Toyoda, Xiao Wang, Mingzhe Li, Bo Gao, Yuan Wang, Qingsong Wei
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09640
Pdf URL: https://arxiv.org/pdf/2412.09640
Copy Paste: [[2412.09640]] Blockchain Data Analysis in the Era of Large-Language Models(https://arxiv.org/abs/2412.09640)
Keywords: security, large language model
Abstract: Blockchain data analysis is essential for deriving insights, tracking transactions, identifying patterns, and ensuring the integrity and security of decentralized networks. It plays a key role in various areas, such as fraud detection, regulatory compliance, smart contract auditing, and decentralized finance (DeFi) risk management. However, existing blockchain data analysis tools face challenges, including data scarcity, the lack of generalizability, and the lack of reasoning capability. We believe large language models (LLMs) can mitigate these challenges; however, we have not seen papers discussing LLM integration in blockchain data analysis in a comprehensive and systematic way. This paper systematically explores potential techniques and design patterns in LLM-integrated blockchain data analysis. We also outline prospective research opportunities and challenges, emphasizing the need for further exploration in this promising field. This paper aims to benefit a diverse audience spanning academia, industry, and policy-making, offering valuable insights into the integration of LLMs in blockchain data analysis.

Title: Machine Learning Driven Smishing Detection Framework for Mobile Security

Authors: Diksha Goel, Hussain Ahmad, Ankit Kumar Jain, Nikhil Kumar Goel
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09641
Pdf URL: https://arxiv.org/pdf/2412.09641
Copy Paste: [[2412.09641]] Machine Learning Driven Smishing Detection Framework for Mobile Security(https://arxiv.org/abs/2412.09641)
Keywords: security, attack, robust
Abstract: The increasing reliance on smartphones for communication, financial transactions, and personal data management has made them prime targets for cyberattacks, particularly smishing, a sophisticated variant of phishing conducted via SMS. Despite the growing threat, traditional detection methods often struggle with the informal and evolving nature of SMS language, which includes abbreviations, slang, and short forms. This paper presents an enhanced content-based smishing detection framework that leverages advanced text normalization techniques to improve detection accuracy. By converting nonstandard text into its standardized form, the proposed model enhances the efficacy of machine learning classifiers, particularly the Naive Bayesian classifier, in distinguishing smishing messages from legitimate ones. Our experimental results, validated on a publicly available dataset, demonstrate a detection accuracy of 96.2%, with a low False Positive Rate of 3.87% and False Negative Rate of 2.85%. This approach significantly outperforms existing methodologies, providing a robust solution to the increasingly sophisticated threat of smishing in the mobile environment.

Title: Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

Authors: Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, Ziwei Liu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09645
Pdf URL: https://arxiv.org/pdf/2412.09645
Copy Paste: [[2412.09645]] Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models(https://arxiv.org/abs/2412.09645)
Keywords: explainability, diffusion, generative
Abstract: Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.

Title: From Noise to Nuance: Advances in Deep Generative Image Models

Authors: Benji Peng, Chia Xin Liang, Ziqian Bi, Ming Liu, Yichao Zhang, Tianyang Wang, Keyu Chen, Xinyuan Song, Pohsun Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09656
Pdf URL: https://arxiv.org/pdf/2412.09656
Copy Paste: [[2412.09656]] From Noise to Nuance: Advances in Deep Generative Image Models(https://arxiv.org/abs/2412.09656)
Keywords: diffusion, transformer, generative
Abstract: Deep learning-based image generation has undergone a paradigm shift since 2021, marked by fundamental architectural breakthroughs and computational innovations. Through reviewing architectural innovations and empirical results, this paper analyzes the transition from traditional generative methods to advanced architectures, with focus on compute-efficient diffusion models and vision transformer architectures. We examine how recent developments in Stable Diffusion, DALL-E, and consistency models have redefined the capabilities and performance boundaries of image synthesis, while addressing persistent challenges in efficiency and quality. Our analysis focuses on the evolution of latent space representations, cross-attention mechanisms, and parameter-efficient training methodologies that enable accelerated inference under resource constraints. While more efficient training methods enable faster inference, advanced control mechanisms like ControlNet and regional attention systems have simultaneously improved generation precision and content customization. We investigate how enhanced multi-modal understanding and zero-shot generation capabilities are reshaping practical applications across industries. Our analysis demonstrates that despite remarkable advances in generation quality and computational efficiency, critical challenges remain in developing resource-conscious architectures and interpretable generation systems for industrial applications. The paper concludes by mapping promising research directions, including neural architecture optimization and explainable generation frameworks.

Title: SEGT: A General Spatial Expansion Group Transformer for nuScenes Lidar-based Object Detection Task

Authors: Cheng Mei, Hao He, Yahui Liu, Zhenhua Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09658
Pdf URL: https://arxiv.org/pdf/2412.09658
Copy Paste: [[2412.09658]] SEGT: A General Spatial Expansion Group Transformer for nuScenes Lidar-based Object Detection Task(https://arxiv.org/abs/2412.09658)
Keywords: transformer
Abstract: In the technical report, we present a novel transformer-based framework for nuScenes lidar-based object detection task, termed Spatial Expansion Group Transformer (SEGT). To efficiently handle the irregular and sparse nature of point cloud, we propose migrating the voxels into distinct specialized ordered fields with the general spatial expansion strategies, and employ group attention mechanisms to extract the exclusive feature maps within each field. Subsequently, we integrate the feature representations across different ordered fields by alternately applying diverse expansion strategies, thereby enhancing the model's ability to capture comprehensive spatial information. The method was evaluated on the nuScenes lidar-based object detection test dataset, achieving an NDS score of 73.5 without Test-Time Augmentation (TTA) and 74.2 with TTA, demonstrating the effectiveness of the proposed method.

Title: Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals

Authors: Messi H.J. Lee, Soyeon Jeon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09668
Pdf URL: https://arxiv.org/pdf/2412.09668
Copy Paste: [[2412.09668]] Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals(https://arxiv.org/abs/2412.09668)
Keywords: large language model
Abstract: Vision-Language Models (VLMs) combine Large Language Model (LLM) capabilities with image processing, enabling tasks like image captioning and text-to-image generation. Yet concerns persist about their potential to amplify human-like biases, including skin tone bias. Skin tone bias, where darker-skinned individuals face more negative stereotyping than lighter-skinned individuals, is well-documented in the social sciences but remains under-explored in Artificial Intelligence (AI), particularly in VLMs. While well-documented in the social sciences, this bias remains under-explored in AI, particularly in VLMs. Using the GAN Face Database, we sampled computer-generated images of Black American men and women, controlling for skin tone variations while keeping other features constant. We then asked VLMs to write stories about these faces and compared the homogeneity of the generated stories. Stories generated by VLMs about darker-skinned Black individuals were more homogeneous than those about lighter-skinned individuals in three of four models, and Black women were consistently represented more homogeneously than Black men across all models. Interaction effects revealed a greater impact of skin tone on women in two VLMs, while the other two showed nonsignificant results, reflecting known stereotyping patterns. These findings underscore the propagation of biases from single-modality AI systems to multimodal models and highlight the need for further research to address intersectional biases in AI.

Title: The Cost of Replicability in Active Learning

Authors: Rupkatha Hira, Dominik Kau, Jessica Sorrell
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09686
Pdf URL: https://arxiv.org/pdf/2412.09686
Copy Paste: [[2412.09686]] The Cost of Replicability in Active Learning(https://arxiv.org/abs/2412.09686)
Keywords: robust
Abstract: Active learning aims to reduce the required number of labeled data for machine learning algorithms by selectively querying the labels of initially unlabeled data points. Ensuring the replicability of results, where an algorithm consistently produces the same outcome across different runs, is essential for the reliability of machine learning models but often increases sample complexity. This report investigates the cost of replicability in active learning using the CAL algorithm, a classical disagreement-based active learning method. By integrating replicable statistical query subroutines and random thresholding techniques, we propose two versions of a replicable CAL algorithm. Our theoretical analysis demonstrates that while replicability does increase label complexity, the CAL algorithm can still achieve significant savings in label complexity even with the replicability constraint. These findings offer valuable insights into balancing efficiency and robustness in machine learning models.

Title: DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations

Authors: Wenhao Hu, Paul Henderson, José Cano
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2412.09687
Pdf URL: https://arxiv.org/pdf/2412.09687
Copy Paste: [[2412.09687]] DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations(https://arxiv.org/abs/2412.09687)
Keywords: segmentation
Abstract: Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive searches for the best hyper-parameters. However, these expensive operations are impractical on devices with limited computation capabilities, memory capacities, and energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization. To fill these gaps, in this paper we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and Huffman coding to be efficient and achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit quantization levels and three different DNN models for two different tasks, image classification and image segmentation, on two different datasets. DQA shows significantly better accuracy (up to 29.28%) compared to the direct quantization method and the state-of-the-art NoisyQuant for sub-6-bit quantization.

Title: TOAP: Towards Better Robustness in Universal Transferable Anti-Facial Retrieval

Authors: Yunna Lv, Long Tang, Dengpan Ye, Caiyun Xie, Jiacheng Deng, Yiheng He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09692
Pdf URL: https://arxiv.org/pdf/2412.09692
Copy Paste: [[2412.09692]] TOAP: Towards Better Robustness in Universal Transferable Anti-Facial Retrieval(https://arxiv.org/abs/2412.09692)
Keywords: privacy, protect, robust
Abstract: Deep hash-based retrieval techniques are widely used in facial retrieval systems to improve the efficiency of facial matching. However, it also brings the risk of privacy leakage. Deep hash models are easily influenced by adversarial examples, which can be leveraged to prevent the malicious retrieval of private images. The existing adversarial example methods against deep hash models focus on universality and transferability, lacking the research on its robustness in online social networks (OSNs), which leads to their failure in anti-retrieval after post-processing. Therefore, we provide the first in-depth discussion on robustness adversarial perturbation in universal transferable anti-facial retrieval and propose Three-in-One Adversarial Perturbation (TOAP). Specifically, we firstly analyze the performance of deep hash models after post-processing and construct a local and global Compression Generator (CG) to simulate complex post-processing scenarios. Then, we explore the variation patterns of the model's objective under image post-processing and propose robust optimization objectives, cluster centers and data space centers, optimizing them using meta-learning. Finally, we iteratively optimize perturbation by alternately generating adversarial examples and fine-tuning the CG, balancing the performance of perturbation while enhancing CG's ability to mitigate them. Numerous experiments demonstrate that, in addition to its advantages in universality and transferability, TOAP significantly outperforms current state-of-the-art methods in multiple robustness metrics. It further improves universality and transferability by 5% to 28%, and achieves up to about 33% significant improvement in several simulated post-processing scenarios as well as mainstream OSNs, demonstrating that TOAP can effectively protect private images from malicious retrieval in real-world scenarios.

Title: Omni-ID: Holistic Identity Representation Designed for Generative Tasks

Authors: Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09694
Pdf URL: https://arxiv.org/pdf/2412.09694
Copy Paste: [[2412.09694]] Omni-ID: Holistic Identity Representation Designed for Generative Tasks(https://arxiv.org/abs/2412.09694)
Keywords: generative
Abstract: We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach uses a few-to-many identity reconstruction training paradigm, where a limited set of input images is used to reconstruct multiple target images of the same individual in various poses and expressions. A multi-decoder framework is further employed to leverage the complementary strengths of diverse decoders during training. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.

Title: Soybean Maturity Prediction using 2D Contour Plots from Drone based Time Series Imagery

Authors: Bitgoeul Kim, Samuel W. Blair, Talukder Z. Jubery, Soumik Sarkar, Arti Singh, Asheesh K. Singh, Baskar Ganapathysubramanian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09696
Pdf URL: https://arxiv.org/pdf/2412.09696
Copy Paste: [[2412.09696]] Soybean Maturity Prediction using 2D Contour Plots from Drone based Time Series Imagery(https://arxiv.org/abs/2412.09696)
Keywords: robust
Abstract: Plant breeding programs require assessments of days to maturity for accurate selection and placement of entries in appropriate tests. In the early stages of the breeding pipeline, soybean breeding programs assign relative maturity ratings to experimental varieties that indicate their suitable maturity zones. Traditionally, the estimation of maturity value for breeding varieties has involved breeders manually inspecting fields and assessing maturity value visually. This approach relies heavily on rater judgment, making it subjective and time-consuming. This study aimed to develop a machine-learning model for evaluating soybean maturity using UAV-based time-series imagery. Images were captured at three-day intervals, beginning as the earliest varieties started maturing and continuing until the last varieties fully matured. The data collected for this experiment consisted of 22,043 plots collected across three years (2021 to 2023) and represent relative maturity groups 1.6 - 3.9. We utilized contour plot images extracted from the time-series UAV RGB imagery as input for a neural network model. This contour plot approach encoded the temporal and spatial variation within each plot into a single image. A deep learning model was trained to utilize this contour plot to predict maturity ratings. This model significantly improves accuracy and robustness, achieving up to 85% accuracy. We also evaluate the model's accuracy as we reduce the number of time points, quantifying the trade-off between temporal resolution and maturity prediction. The predictive model offers a scalable, objective, and efficient means of assessing crop maturity, enabling phenomics and ML approaches to reduce the reliance on manual inspection and subjective assessment. This approach enables the automatic prediction of relative maturity ratings in a breeding program, saving time and resources.

Title: Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation

Authors: Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09706
Pdf URL: https://arxiv.org/pdf/2412.09706
Copy Paste: [[2412.09706]] Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation(https://arxiv.org/abs/2412.09706)
Keywords: diffusion, generative
Abstract: Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce IT3A, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data. Additionally, to ensure that key semantics are accurately retained when generating various visual and text enhancements, we employ cosine similarity filtering between the logits of the enhanced images and text with the original test data. This process allows us to filter out some spurious augmentation and inadequate combinations. To leverage the diverse enhancements provided by the generation model across different modals, we have replaced prompt tuning with an adapter for greater flexibility in utilizing text templates. Our experiments on the test datasets with distribution shifts and domain gaps show that in a zero-shot setting, IT3A outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.

Title: GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers

Authors: Sarkar Snigdha Sarathi Das, Ryo Kamoi, Bo Pang, Yusen Zhang, Caiming Xiong, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09722
Pdf URL: https://arxiv.org/pdf/2412.09722
Copy Paste: [[2412.09722]] GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers(https://arxiv.org/abs/2412.09722)
Keywords: large language model
Abstract: The effectiveness of large language models (LLMs) is closely tied to the design of prompts, making prompt optimization essential for enhancing their performance across a wide range of tasks. Many existing approaches to automating prompt engineering rely exclusively on textual feedback, refining prompts based solely on inference errors identified by large, computationally expensive LLMs. Unfortunately, smaller models struggle to generate high-quality feedback, resulting in complete dependence on large LLM judgment. Moreover, these methods fail to leverage more direct and finer-grained information, such as gradients, due to operating purely in text space. To this end, we introduce GReaTer, a novel prompt optimization technique that directly incorporates gradient information over task-specific reasoning. By utilizing task loss gradients, GReaTer enables self-optimization of prompts for open-source, lightweight language models without the need for costly closed-source LLMs. This allows high-performance prompt optimization without dependence on massive LLMs, closing the gap between smaller models and the sophisticated reasoning often needed for prompt refinement. Extensive evaluations across diverse reasoning tasks including BBH, GSM8k, and FOLIO demonstrate that GReaTer consistently outperforms previous state-of-the-art prompt optimization methods, even those reliant on powerful LLMs. Additionally, GReaTer-optimized prompts frequently exhibit better transferability and, in some cases, boost task performance to levels comparable to or surpassing those achieved by larger language models, highlighting the effectiveness of prompt optimization guided by gradients over reasoning. Code of GReaTer is available at this https URL.

Title: The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications

Authors: Binxu Wang, John J. Vastola
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09726
Pdf URL: https://arxiv.org/pdf/2412.09726
Copy Paste: [[2412.09726]] The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications(https://arxiv.org/abs/2412.09726)
Keywords: diffusion
Abstract: By learning the gradient of smoothed data distributions, diffusion models can iteratively generate samples from complex distributions. The learned score function enables their generalization capabilities, but how the learned score relates to the score of the underlying data manifold remains largely unclear. Here, we aim to elucidate this relationship by comparing learned neural scores to the scores of two kinds of analytically tractable distributions: Gaussians and Gaussian mixtures. The simplicity of the Gaussian model makes it theoretically attractive, and we show that it admits a closed-form solution and predicts many qualitative aspects of sample generation dynamics. We claim that the learned neural score is dominated by its linear (Gaussian) approximation for moderate to high noise scales, and supply both theoretical and empirical arguments to support this claim. Moreover, the Gaussian approximation empirically works for a larger range of noise scales than naive theory suggests it should, and is preferentially learned early in training. At smaller noise scales, we observe that learned scores are better described by a coarse-grained (Gaussian mixture) approximation of training data than by the score of the training distribution, a finding consistent with generalization. Our findings enable us to precisely predict the initial phase of trained models' sampling trajectories through their Gaussian approximations. We show that this allows the skipping of the first 15-30% of sampling steps while maintaining high sample quality (with a near state-of-the-art FID score of 1.93 on CIFAR-10 unconditional generation). This forms the foundation of a novel hybrid sampling method, termed analytical teleportation, which can seamlessly integrate with and accelerate existing samplers, including DPM-Solver-v3 and UniPC. Our findings suggest ways to improve the design and training of diffusion models.

Title: Agtech Framework for Cranberry-Ripening Analysis Using Vision Foundation Models

Authors: Faith Johnson, Ryan Meegan, Jack Lowry, Peter Oudemans, Kristin Dana
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09739
Pdf URL: https://arxiv.org/pdf/2412.09739
Copy Paste: [[2412.09739]] Agtech Framework for Cranberry-Ripening Analysis Using Vision Foundation Models(https://arxiv.org/abs/2412.09739)
Keywords: interpretability, transformer, segmentation
Abstract: Agricultural domains are being transformed by recent advances in AI and computer vision that support quantitative visual evaluation. Using aerial and ground imaging over a time series, we develop a framework for characterizing the ripening process of cranberry crops, a crucial component for precision agriculture tasks such as comparing crop breeds (high-throughput phenotyping) and detecting disease. Using drone imaging, we capture images from 20 waypoints across multiple bogs, and using ground-based imaging (hand-held camera), we image same bog patch using fixed fiducial markers. Both imaging methods are repeated to gather a multi-week time series spanning the entire growing season. Aerial imaging provides multiple samples to compute a distribution of albedo values. Ground imaging enables tracking of individual berries for a detailed view of berry appearance changes. Using vision transformers (ViT) for feature detection after segmentation, we extract a high dimensional feature descriptor of berry appearance. Interpretability of appearance is critical for plant biologists and cranberry growers to support crop breeding decisions (e.g.\ comparison of berry varieties from breeding programs). For interpretability, we create a 2D manifold of cranberry appearance by using a UMAP dimensionality reduction on ViT features. This projection enables quantification of ripening paths and a useful metric of ripening rate. We demonstrate the comparison of four cranberry varieties based on our ripening assessments. This work is the first of its kind and has future impact for cranberries and for other crops including wine grapes, olives, blueberries, and maize. Aerial and ground datasets are made publicly available.

Title: Bad Crypto: Chessography and Weak Randomness of Chess Games

Authors: Martin Stanek
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.09742
Pdf URL: https://arxiv.org/pdf/2412.09742
Copy Paste: [[2412.09742]] Bad Crypto: Chessography and Weak Randomness of Chess Games(https://arxiv.org/abs/2412.09742)
Keywords: security
Abstract: This short communication shows that the Chessography encryption scheme is incorrect, redundant, and the the security claims based on the complexity of chess games are unjustified. It also demonstrates an insufficient randomness in the final chess game positions, which could be of separate interest.

Title: ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Authors: Ali Athar, Xueqing Deng, Liang-Chieh Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09754
Pdf URL: https://arxiv.org/pdf/2412.09754
Copy Paste: [[2412.09754]] ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation(https://arxiv.org/abs/2412.09754)
Keywords: large language model, segmentation
Abstract: Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: this https URL

Title: Private Synthetic Data Generation in Small Memory

Authors: Rayne Holland, Seyit Camtepe, Chandra Thapa, Jason Xue
Subjects: cs.CR, cs.DS
Abstract URL: https://arxiv.org/abs/2412.09756
Pdf URL: https://arxiv.org/pdf/2412.09756
Copy Paste: [[2412.09756]] Private Synthetic Data Generation in Small Memory(https://arxiv.org/abs/2412.09756)
Keywords: privacy, protect
Abstract: Protecting sensitive information on data streams is a critical challenge for modern systems. Current approaches to privacy in data streams follow two strategies. The first transforms the stream into a private sequence, enabling the use of non-private analyses but incurring high memory costs. The second uses compact data structures to create private summaries but restricts flexibility to predefined queries. To address these limitations, we propose $\textsf{PrivHP}$, a lightweight synthetic data generator that ensures differential privacy while being resource-efficient. $\textsf{PrivHP}$ generates private synthetic data that preserves the input stream's distribution, allowing flexible downstream analyses without additional privacy costs. It leverages a hierarchical decomposition of the domain, pruning low-frequency subdomains while preserving high-frequency ones in a privacy-preserving manner. To achieve memory efficiency in streaming contexts, $\textsf{PrivHP}$ uses private sketches to estimate subdomain frequencies without accessing the full dataset. $\textsf{PrivHP}$ is parameterized by a privacy budget $\varepsilon$, a pruning parameter $k$ and the sketch width $w$. It can process a dataset of size $n$ in $\mathcal{O}((w+k)\log (\varepsilon n))$ space, $\mathcal{O}(\log (\varepsilon n))$ update time, and outputs a private synthetic data generator in $\mathcal{O}(k\log k\log (\varepsilon n))$ time. Prior methods require $\Omega(n)$ space and construction time. Our evaluation uses the expected 1-Wasserstein distance between the sampler and the empirical distribution. Compared to state-of-the-art methods, we demonstrate that the additional cost in utility is inversely proportional to $k$ and $w$. This represents the first meaningful trade-off between performance and utility for private synthetic data generation.

Title: L-WISE: Boosting Human Image Category Learning Through Model-Based Image Selection And Enhancement

Authors: Morgan B. Talbot, Gabriel Kreiman, James J. DiCarlo, Guy Gaziv
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2412.09765
Pdf URL: https://arxiv.org/pdf/2412.09765
Copy Paste: [[2412.09765]] L-WISE: Boosting Human Image Category Learning Through Model-Based Image Selection And Enhancement(https://arxiv.org/abs/2412.09765)
Keywords: robust
Abstract: The currently leading artificial neural network (ANN) models of the visual ventral stream -- which are derived from a combination of performance optimization and robustification methods -- have demonstrated a remarkable degree of behavioral alignment with humans on visual categorization tasks. Extending upon previous work, we show that not only can these models guide image perturbations that change the induced human category percepts, but they also can enhance human ability to accurately report the original ground truth. Furthermore, we find that the same models can also be used out-of-the-box to predict the proportion of correct human responses to individual images, providing a simple, human-aligned estimator of the relative difficulty of each image. Motivated by these observations, we propose to augment visual learning in humans in a way that improves human categorization accuracy at test time. Our learning augmentation approach consists of (i) selecting images based on their model-estimated recognition difficulty, and (ii) using image perturbations that aid recognition for novice learners. We find that combining these model-based strategies gives rise to test-time categorization accuracy gains of 33-72% relative to control subjects without these interventions, despite using the same number of training feedback trials. Surprisingly, beyond the accuracy gain, the training time for the augmented learning group was also shorter by 20-23%. We demonstrate the efficacy of our approach in a fine-grained categorization task with natural images, as well as tasks in two clinically relevant image domains -- histology and dermoscopy -- where visual learning is notoriously challenging. To the best of our knowledge, this is the first application of ANNs to increase visual learning performance in humans by enhancing category-specific features.

Title: A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization

Authors: Chi-Jui Ho, Yash Belhe, Steve Rotenberg, Ravi Ramamoorthi, Tzu-Mao Li, Nicholas Antipa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09774
Pdf URL: https://arxiv.org/pdf/2412.09774
Copy Paste: [[2412.09774]] A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization(https://arxiv.org/abs/2412.09774)
Keywords: robust
Abstract: End-to-end optimization, which simultaneously optimizes optics and algorithms, has emerged as a powerful data-driven method for computational imaging system design. This method achieves joint optimization through backpropagation by incorporating differentiable optics simulators to generate measurements and algorithms to extract information from measurements. However, due to high computational costs, it is challenging to model both aberration and diffraction in light transport for end-to-end optimization of compound optics. Therefore, most existing methods compromise physical accuracy by neglecting wave optics effects or off-axis aberrations, which raises concerns about the robustness of the resulting designs. In this paper, we propose a differentiable optics simulator that efficiently models both aberration and diffraction for compound optics. Using the simulator, we conduct end-to-end optimization on scene reconstruction and classification. Experimental results demonstrate that both lenses and algorithms adopt different configurations depending on whether wave optics is modeled. We also show that systems optimized without wave optics suffer from performance degradation when wave optics effects are introduced during testing. These findings underscore the importance of accurate wave optics modeling in optimizing imaging systems for robust, high-performance applications.

Title: Is it the model or the metric -- On robustness measures of deeplearning models

Authors: Zhijin Lyu, Yutong Jin, Sneha Das
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09795
Pdf URL: https://arxiv.org/pdf/2412.09795
Copy Paste: [[2412.09795]] Is it the model or the metric -- On robustness measures of deeplearning models(https://arxiv.org/abs/2412.09795)
Keywords: robust
Abstract: Determining the robustness of deep learning models is an established and ongoing challenge within automated decision-making systems. With the advent and success of techniques that enable advanced deep learning (DL), these models are being used in widespread applications, including high-stake ones like healthcare, education, border-control. Therefore, it is critical to understand the limitations of these models and predict their regions of failures, in order to create the necessary guardrails for their successful and safe deployment. In this work, we revisit robustness, specifically investigating the sufficiency of robust accuracy (RA), within the context of deepfake detection. We present robust ratio (RR) as a complementary metric, that can quantify the changes to the normalized or probability outcomes under input perturbation. We present a comparison of RA and RR and demonstrate that despite similar RA between models, the models show varying RR under different tolerance (perturbation) levels.

Title: AutoPatent: A Multi-Agent Framework for Automatic Patent Generation

Authors: Qiyao Wang, Shiwen Ni, Huaren Liu, Shule Lu, Guhong Chen, Xi Feng, Chi Wei, Qiang Qu, Hamid Alinejad-Rokny, Yuan Lin, Min Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09796
Pdf URL: https://arxiv.org/pdf/2412.09796
Copy Paste: [[2412.09796]] AutoPatent: A Multi-Agent Framework for Automatic Patent Generation(https://arxiv.org/abs/2412.09796)
Keywords: large language model
Abstract: As the capabilities of Large Language Models (LLMs) continue to advance, the field of patent processing has garnered increased attention within the natural language processing community. However, the majority of research has been concentrated on classification tasks, such as patent categorization and examination, or on short text generation tasks like patent summarization and patent quizzes. In this paper, we introduce a novel and practical task known as Draft2Patent, along with its corresponding D2P benchmark, which challenges LLMs to generate full-length patents averaging 17K tokens based on initial drafts. Patents present a significant challenge to LLMs due to their specialized nature, standardized terminology, and extensive length. We propose a multi-agent framework called AutoPatent which leverages the LLM-based planner agent, writer agents, and examiner agent with PGTree and RRAG to generate lengthy, intricate, and high-quality complete patent documents. The experimental results demonstrate that our AutoPatent framework significantly enhances the ability to generate comprehensive patents across various LLMs. Furthermore, we have discovered that patents generated solely with the AutoPatent framework based on the Qwen2.5-7B model outperform those produced by larger and more powerful LLMs, such as GPT-4o, Qwen2.5-72B, and LLAMA3.1-70B, in both objective metrics and human evaluations. We will make the data and code available upon acceptance at \url{this https URL}.

Title: deepNoC: A deep learning system to assign the number of contributors to a short tandem repeat DNA profile

Authors: Duncan Taylor, Melissa A. Humphries
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.09803
Pdf URL: https://arxiv.org/pdf/2412.09803
Copy Paste: [[2412.09803]] deepNoC: A deep learning system to assign the number of contributors to a short tandem repeat DNA profile(https://arxiv.org/abs/2412.09803)
Keywords: explainability
Abstract: A common task in forensic biology is to interpret and evaluate short tandem repeat DNA profiles. The first step in these interpretations is to assign a number of contributors to the profiles, a task that is most often performed manually by a scientist using their knowledge of DNA profile behaviour. Studies using constructed DNA profiles have shown that as DNA profiles become more complex, and the number of DNA-donating individuals increases, the ability for scientists to assign the target number. There have been a number of machine learning algorithms developed that seek to assign the number of contributors to a DNA profile, however due to practical limitations in being able to generate DNA profiles in a laboratory, the algorithms have been based on summaries of the available information. In this work we develop an analysis pipeline that simulates the electrophoretic signal of an STR profile, allowing virtually unlimited, pre-labelled training material to be generated. We show that by simulating 100 000 profiles and training a number of contributors estimation tool using a deep neural network architecture (in an algorithm named deepNoC) that a high level of performance is achieved (89% for 1 to 10 contributors). The trained network can then have fine-tuning training performed with only a few hundred profiles in order to achieve the same accuracy within a specific laboratory. We also build into deepNoC secondary outputs that provide a level of explainability to a user of algorithm, and show how they can be displayed in an intuitive manner.

Title: LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering

Authors: Patrick Sutanto, Joan Santoso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09807
Pdf URL: https://arxiv.org/pdf/2412.09807
Copy Paste: [[2412.09807]] LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering(https://arxiv.org/abs/2412.09807)
Keywords: large language model
Abstract: Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education. The high cost of building MCQA datasets makes few-shot learning pivotal in this domain. While Large Language Models (LLMs) can enable few-shot learning, their direct application in real-world scenarios is often hindered by their high computational cost. To address this challenge, we propose a simple yet effective approach that uses LLMs for data generation and scoring. Our approach utilizes LLMs to create MCQA data which contains questions and choices, and to assign probability scores to the generated choices. We then use the generated data and LLM-assigned scores to finetune a smaller and more efficient encoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive experiments on the Massive Multitask Language Understanding (MMLU) benchmark demonstrate that our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples. This shows the effectiveness of LLM-driven data generation and knowledge distillation for few-shot MCQA.

Title: ScaleOT: Privacy-utility-scalable Offsite-tuning with Dynamic LayerReplace and Selective Rank Compression

Authors: Kai Yao, Zhaorui Tan, Tiandi Ye, Lichun Li, Yuan Zhao, Wenyan Liu, Wei Wang, Jianke Zhu
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.09812
Pdf URL: https://arxiv.org/pdf/2412.09812
Copy Paste: [[2412.09812]] ScaleOT: Privacy-utility-scalable Offsite-tuning with Dynamic LayerReplace and Selective Rank Compression(https://arxiv.org/abs/2412.09812)
Keywords: privacy, protect, large language model
Abstract: Offsite-tuning is a privacy-preserving method for tuning large language models (LLMs) by sharing a lossy compressed emulator from the LLM owners with data owners for downstream task tuning. This approach protects the privacy of both the model and data owners. However, current offsite tuning methods often suffer from adaptation degradation, high computational costs, and limited protection strength due to uniformly dropping LLM layers or relying on expensive knowledge distillation. To address these issues, we propose ScaleOT, a novel privacy-utility-scalable offsite-tuning framework that effectively balances privacy and utility. ScaleOT introduces a novel layerwise lossy compression algorithm that uses reinforcement learning to obtain the importance of each layer. It employs lightweight networks, termed harmonizers, to replace the raw LLM layers. By combining important original LLM layers and harmonizers in different ratios, ScaleOT generates emulators tailored for optimal performance with various model scales for enhanced privacy protection. Additionally, we present a rank reduction method to further compress the original LLM layers, significantly enhancing privacy with negligible impact on utility. Comprehensive experiments show that ScaleOT can achieve nearly lossless offsite tuning performance compared with full fine-tuning while obtaining better model privacy.

Title: Temporal Causal Discovery in Dynamic Bayesian Networks Using Federated Learning

Authors: Jianhong Chen, Ying Ma, Xubo Yue
Subjects: cs.LG, cs.AI, stat.CO
Abstract URL: https://arxiv.org/abs/2412.09814
Pdf URL: https://arxiv.org/pdf/2412.09814
Copy Paste: [[2412.09814]] Temporal Causal Discovery in Dynamic Bayesian Networks Using Federated Learning(https://arxiv.org/abs/2412.09814)
Keywords: security, privacy, federate
Abstract: Traditionally, learning the structure of a Dynamic Bayesian Network has been centralized, with all data pooled in one location. However, in real-world scenarios, data are often dispersed among multiple parties (e.g., companies, devices) that aim to collaboratively learn a Dynamic Bayesian Network while preserving their data privacy and security. In this study, we introduce a federated learning approach for estimating the structure of a Dynamic Bayesian Network from data distributed horizontally across different parties. We propose a distributed structure learning method that leverages continuous optimization so that only model parameters are exchanged during optimization. Experimental results on synthetic and real datasets reveal that our method outperforms other state-of-the-art techniques, particularly when there are many clients with limited individual sample sizes.

Title: Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation

Authors: Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, Jiawei Yao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09817
Pdf URL: https://arxiv.org/pdf/2412.09817
Copy Paste: [[2412.09817]] Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation(https://arxiv.org/abs/2412.09817)
Keywords: interpretability, large language model
Abstract: Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. The paper's source code can be accessed from \url{this https URL}.

Title: MERaLiON-AudioLLM: Technical Report

Authors: Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09818
Pdf URL: https://arxiv.org/pdf/2412.09818
Copy Paste: [[2412.09818]] MERaLiON-AudioLLM: Technical Report(https://arxiv.org/abs/2412.09818)
Keywords: large language model
Abstract: We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning in One Network), the first speech-text model tailored for Singapore's multilingual and multicultural landscape. Developed under the National Large Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates advanced speech and text processing to address the diverse linguistic nuances of local accents and dialects, enhancing accessibility and usability in complex, multilingual environments. Our results demonstrate improvements in both speech recognition and task-specific understanding, positioning MERaLiON-AudioLLM as a pioneering solution for region specific AI applications. We envision this release to set a precedent for future models designed to address localised linguistic and cultural contexts in a global framework.

Title: FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks

Authors: Ahmadreza Eslaminia, Adrian Jackson, Beitong Tian, Avi Stern, Hallie Gordon, Rajiv Malhotra, Klara Nahrstedt, Chenhui Shao
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2412.09819
Pdf URL: https://arxiv.org/pdf/2412.09819
Copy Paste: [[2412.09819]] FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks(https://arxiv.org/abs/2412.09819)
Keywords: large language model
Abstract: Fused Deposition Modeling (FDM) is a widely used additive manufacturing (AM) technique valued for its flexibility and cost-efficiency, with applications in a variety of industries including healthcare and aerospace. Recent developments have made affordable FDM machines accessible and encouraged adoption among diverse users. However, the design, planning, and production process in FDM require specialized interdisciplinary knowledge. Managing the complex parameters and resolving print defects in FDM remain challenging. These technical complexities form the most critical barrier preventing individuals without technical backgrounds and even professional engineers without training in other domains from participating in AM design and manufacturing. Large Language Models (LLMs), with their advanced capabilities in text and code processing, offer the potential for addressing these challenges in FDM. However, existing research on LLM applications in this field is limited, typically focusing on specific use cases without providing comprehensive evaluations across multiple models and tasks. To this end, we introduce FDM-Bench, a benchmark dataset designed to evaluate LLMs on FDM-specific tasks. FDM-Bench enables a thorough assessment by including user queries across various experience levels and G-code samples that represent a range of anomalies. We evaluate two closed-source models (GPT-4o and Claude 3.5 Sonnet) and two open-source models (Llama-3.1-70B and Llama-3.1-405B) on FDM-Bench. A panel of FDM experts assess the models' responses to user queries in detail. Results indicate that closed-source models generally outperform open-source models in G-code anomaly detection, whereas Llama-3.1-405B demonstrates a slight advantage over other models in responding to user queries. These findings underscore FDM-Bench's potential as a foundational tool for advancing research on LLM capabilities in FDM.

Title: Empowering Patients for Disease Diagnosis and Clinical Treatment: A Smart Contract-Enabled Informed Consent Strategy

Authors: Md Al Amin, Hemanth Tummala, Rushabh Shah, Indrajit Ray
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.09820
Pdf URL: https://arxiv.org/pdf/2412.09820
Copy Paste: [[2412.09820]] Empowering Patients for Disease Diagnosis and Clinical Treatment: A Smart Contract-Enabled Informed Consent Strategy(https://arxiv.org/abs/2412.09820)
Keywords: security, privacy, protect, robust
Abstract: Digital healthcare systems have revolutionized medical services, facilitating provider collaboration, enhancing diagnosis, and optimizing and improving treatments. They deliver superior quality, faster, reliable, and cost-effective services. Researchers are addressing pressing health challenges by integrating information technology, computing resources, and digital health records. However, digitizing healthcare introduces significant risks to patient data privacy and security, with the potential for unauthorized access to protected health information. Although patients can authorize data access through consent, there is a pressing need for mechanisms to ensure such given consent is informed and executed properly and timely. Patients deserve transparency and accountability regarding the access to their data: who access it, when, and under what circumstances. Current healthcare systems, often centralized, leave much to be desired in managing these concerns, leading to numerous security incidents. To address these issues, we propose a system based on blockchain and smart contracts for managing informed consent for accessing health records by the treatment team members, incorporating safeguards to verify that consent processes are correctly executed. Blockchain's inherent immutability ensures the integrity of consent. Smart contracts automatically execute agreements, enhancing accountability. They provide a robust framework for protecting patient privacy in the digital age. Experimental evaluations show that the proposed approach can be integrated easily with the existing healthcare systems without incurring financial and technological challenges.

Title: Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

Authors: Jun Zheng, Jing Wang, Fuwei Zhao, Xujie Zhang, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09822
Pdf URL: https://arxiv.org/pdf/2412.09822
Copy Paste: [[2412.09822]] Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism(https://arxiv.org/abs/2412.09822)
Keywords: diffusion, transformer
Abstract: Video try-on stands as a promising area for its tremendous real-world potential. Previous research on video try-on has primarily focused on transferring product clothing images to videos with simple human poses, while performing poorly with complex movements. To better preserve clothing details, those approaches are armed with an additional garment encoder, resulting in higher computational resource consumption. The primary challenges in this domain are twofold: (1) leveraging the garment encoder's capabilities in video try-on while lowering computational requirements; (2) ensuring temporal consistency in the synthesis of human body parts, especially during rapid movements. To tackle these issues, we propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On. To reduce computational overhead, we adopt a straightforward approach by utilizing the DiT backbone itself as the garment encoder and employing a dynamic feature fusion module to store and integrate garment features. To ensure temporal consistency of human body parts, we introduce a limb-aware dynamic attention module that enforces the DiT backbone to focus on the regions of human limbs during the denoising process. Extensive experiments demonstrate the superiority of Dynamic Try-On in generating stable and smooth try-on results, even for videos featuring complicated human postures.

Title: Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models

Authors: Changqun Li, Chaofan Ding, Kexin Luan, Xinhan Di
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09827
Pdf URL: https://arxiv.org/pdf/2412.09827
Copy Paste: [[2412.09827]] Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models(https://arxiv.org/abs/2412.09827)
Keywords: large language model
Abstract: Fine-tuning pre-trained large language models in a parameter-efficient manner is widely studied for its effectiveness and efficiency. LoRA is one of the most widely used methods, which assumes that the optimization process is essentially low dimensional. Although LoRA has demonstrated commendable performance, there remains a significant performance gap between LoRA and full fine-tuning when learning new tasks. In this work, we propose Low-Rank Adaptation with Task-Relevant Feature Enhancement(LoRATRF) for enhancing task-relevant features from the perspective of editing neural network representations. To prioritize task-relevant features, a task-aware filter that selectively extracts valuable knowledge from hidden representations for the target or current task is designed. As the experiments on a vareity of datasets including NLU, commonsense reasoning and mathematical reasoning tasks demonstrates, our method reduces 33.71% parameters and achieves better performance on a variety of datasets in comparison with SOTA low-rank methods.

Title: MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

Authors: Xunnong Xu, Mengying Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09828
Pdf URL: https://arxiv.org/pdf/2412.09828
Copy Paste: [[2412.09828]] MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion(https://arxiv.org/abs/2412.09828)
Keywords: diffusion, transformer, generative
Abstract: Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.

Title: Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training

Authors: Yujin Choi, Jinseong Park, Junyoung Byun, Jaewook Lee
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09842
Pdf URL: https://arxiv.org/pdf/2412.09842
Copy Paste: [[2412.09842]] Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training(https://arxiv.org/abs/2412.09842)
Keywords: privacy, diffusion, generative
Abstract: Programmatically generated synthetic data has been used in differential private training for classification to enhance performance without privacy leakage. However, as the synthetic data is generated from a random process, the distribution of real data and the synthetic data are distinguishable and difficult to transfer. Therefore, the model trained with the synthetic data generates unrealistic random images, raising challenges to adapt the synthetic data for generative models. In this work, we propose DP-SynGen, which leverages programmatically generated synthetic data in diffusion models to address this challenge. By exploiting the three stages of diffusion models(coarse, context, and cleaning) we identify stages where synthetic data can be effectively utilized. We theoretically and empirically verified that cleaning and coarse stages can be trained without private data, replacing them with synthetic data to reduce the privacy budget. The experimental results show that DP-SynGen improves the quality of generative data by mitigating the negative impact of privacy-induced noise on the generation process.

Title: Learning Structural Causal Models from Ordering: Identifiable Flow Models

Authors: Minh Khoa Le, Kien Do, Truyen Tran
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.09843
Pdf URL: https://arxiv.org/pdf/2412.09843
Copy Paste: [[2412.09843]] Learning Structural Causal Models from Ordering: Identifiable Flow Models(https://arxiv.org/abs/2412.09843)
Keywords: diffusion
Abstract: In this study, we address causal inference when only observational data and a valid causal ordering from the causal graph are available. We introduce a set of flow models that can recover component-wise, invertible transformation of exogenous variables. Our flow-based methods offer flexible model design while maintaining causal consistency regardless of the number of discretization steps. We propose design improvements that enable simultaneous learning of all causal mechanisms and reduce abduction and prediction complexity to linear O(n) relative to the number of layers, independent of the number of causal variables. Empirically, we demonstrate that our method outperforms previous state-of-the-art approaches and delivers consistent performance across a wide range of structural causal models in answering observational, interventional, and counterfactual questions. Additionally, our method achieves a significant reduction in computational time compared to existing diffusion-based techniques, making it practical for large structural causal models.

Title: Real-time Identity Defenses against Malicious Personalization of Diffusion Models

Authors: Hanzhong Guo, Shen Nie, Chao Du, Tianyu Pang, Hao Sun, Chongxuan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09844
Pdf URL: https://arxiv.org/pdf/2412.09844
Copy Paste: [[2412.09844]] Real-time Identity Defenses against Malicious Personalization of Diffusion Models(https://arxiv.org/abs/2412.09844)
Keywords: defense, attack, robust, diffusion
Abstract: Personalized diffusion models, capable of synthesizing highly realistic images based on a few reference portraits, pose substantial social, ethical, and legal risks by enabling identity replication. Existing defense mechanisms rely on computationally intensive adversarial perturbations tailored to individual images, rendering them impractical for real-world deployment. This study introduces Real-time Identity Defender (RID), a neural network designed to generate adversarial perturbations through a single forward pass, bypassing the need for image-specific optimization. RID achieves unprecedented efficiency, with defense times as low as 0.12 seconds on a single GPU (4,400 times faster than leading methods) and 1.1 seconds per image on a standard Intel i9 CPU, making it suitable for edge devices such as smartphones. Despite its efficiency, RID matches state-of-the-art performance across visual and quantitative benchmarks, effectively mitigating identity replication risks. Our analysis reveals that RID's perturbations mimic the efficacy of traditional defenses while exhibiting properties distinct from natural noise, such as Gaussian perturbations. To enhance robustness, we extend RID into an ensemble framework that integrates multiple pre-trained text-to-image diffusion models, ensuring resilience against black-box attacks and post-processing techniques, including JPEG compression and diffusion-based purification.

Title: LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Authors: Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09856
Pdf URL: https://arxiv.org/pdf/2412.09856
Copy Paste: [[2412.09856]] LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity(https://arxiv.org/abs/2412.09856)
Keywords: diffusion, transformer
Abstract: Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$\times$ (11.5$\times$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: this https URL.

Title: Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction

Authors: Liu Jing, Amirul Rahman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09870
Pdf URL: https://arxiv.org/pdf/2412.09870
Copy Paste: [[2412.09870]] Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction(https://arxiv.org/abs/2412.09870)
Keywords: robust
Abstract: Semantic location prediction from multimodal social media posts is a critical task with applications in personalized services and human mobility analysis. This paper introduces \textit{Contextualized Vision-Language Alignment (CoVLA)}, a discriminative framework designed to address the challenges of contextual ambiguity and modality discrepancy inherent in this task. CoVLA leverages a Contextual Alignment Module (CAM) to enhance cross-modal feature alignment and a Cross-modal Fusion Module (CMF) to dynamically integrate textual and visual information. Extensive experiments on a benchmark dataset demonstrate that CoVLA significantly outperforms state-of-the-art methods, achieving improvements of 2.3\% in accuracy and 2.5\% in F1-score. Ablation studies validate the contributions of CAM and CMF, while human evaluations highlight the contextual relevance of the predictions. Additionally, robustness analysis shows that CoVLA maintains high performance under noisy conditions, making it a reliable solution for real-world applications. These results underscore the potential of CoVLA in advancing semantic location prediction research.

Title: Byte Latent Transformer: Patches Scale Better Than Tokens

Authors: Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09871
Pdf URL: https://arxiv.org/pdf/2412.09871
Copy Paste: [[2412.09871]] Byte Latent Transformer: Patches Scale Better Than Tokens(https://arxiv.org/abs/2412.09871)
Keywords: robust, transformer
Abstract: We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

Title: Selective State Space Memory for Large Vision-Language Models

Authors: Chee Ng, Yuen Fung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09875
Pdf URL: https://arxiv.org/pdf/2412.09875
Copy Paste: [[2412.09875]] Selective State Space Memory for Large Vision-Language Models(https://arxiv.org/abs/2412.09875)
Keywords: robust, interpretability
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model's parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization capabilities. Comprehensive analysis further validates the advantages of SSMI in terms of efficiency, adaptability, and interpretability, positioning it as a compelling solution for fine-tuning large-scale vision-language models.

Title: On the Limit of Language Models as Planning Formalizers

Authors: Cassie Huang, Li Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09879
Pdf URL: https://arxiv.org/pdf/2412.09879
Copy Paste: [[2412.09879]] On the Limit of Language Models as Planning Formalizers(https://arxiv.org/abs/2412.09879)
Keywords: robust, large language model
Abstract: Large Language Models have been shown to fail to create executable and verifiable plans in grounded environments. An emerging line of work shows success in using LLM as a formalizer to generate a formal representation (e.g., PDDL) of the planning domain, which can be deterministically solved to find a plan. We systematically evaluate this methodology while bridging some major gaps. While previous work only generates a partial PDDL representation given templated and thus unrealistic environment descriptions, we generate the complete representation given descriptions of various naturalness levels. Among an array of observations critical to improve LLMs' formal planning ability, we note that large enough models can effectively formalize descriptions as PDDL, outperforming those directly generating plans, while being robust to lexical perturbation. As the descriptions become more natural-sounding, we observe a decrease in performance and provide detailed error analysis.

Title: Sharpening Your Density Fields: Spiking Neuron Aided Fast Geometry Learning

Authors: Yi Gu, Zhaorui Wang, Dongjun Ye, Renjing Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09881
Pdf URL: https://arxiv.org/pdf/2412.09881
Copy Paste: [[2412.09881]] Sharpening Your Density Fields: Spiking Neuron Aided Fast Geometry Learning(https://arxiv.org/abs/2412.09881)
Keywords: robust, extraction
Abstract: Neural Radiance Fields (NeRF) have achieved remarkable progress in neural rendering. Extracting geometry from NeRF typically relies on the Marching Cubes algorithm, which uses a hand-crafted threshold to define the level set. However, this threshold-based approach requires laborious and scenario-specific tuning, limiting its practicality for real-world applications. In this work, we seek to enhance the efficiency of this method during the training time. To this end, we introduce a spiking neuron mechanism that dynamically adjusts the threshold, eliminating the need for manual selection. Despite its promise, directly training with the spiking neuron often results in model collapse and noisy outputs. To overcome these challenges, we propose a round-robin strategy that stabilizes the training process and enables the geometry network to achieve a sharper and more precise density distribution with minimal computational overhead. We validate our approach through extensive experiments on both synthetic and real-world datasets. The results show that our method significantly improves the performance of threshold-based techniques, offering a more robust and efficient solution for NeRF geometry extraction.

Title: Benchmarking Table Comprehension In The Wild

Authors: Yikang Pan, Yi Zhu, Rand Xie, Yizhi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09884
Pdf URL: https://arxiv.org/pdf/2412.09884
Copy Paste: [[2412.09884]] Benchmarking Table Comprehension In The Wild(https://arxiv.org/abs/2412.09884)
Keywords: large language model
Abstract: Large Language Models (LLMs), while being increasingly dominant on a myriad of knowledge-intensive activities, have only had limited success understanding lengthy table-text mixtures, such as academic papers and financial reports. Recent advances of long-context LLMs have opened up new possibilities for this field. Nonetheless, we identify two roadblocks: (1) Prior benchmarks of table question answering (TableQA) have focused on isolated tables without context, making it hard to evaluate models in real-world scenarios. (2) Prior benchmarks have focused on some narrow skill sets of table comprehension such as table recognition, data manipulation/calculation, table summarization etc., while a skilled human employs those skills collectively. In this work, we introduce TableQuest, a new benchmark designed to evaluate the holistic table comprehension capabilities of LLMs in the natural table-rich context of financial reports. We employ a rigorous data processing and filtering procedure to ensure that the question-answer pairs are logical, reasonable, and diverse. We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations. We conclude with a qualitative study of the failure modes and discuss the challenges of constructing a challenging benchmark. We make the evaluation data, judging procedure and results of this study publicly available to facilitate research in this field.

Title: T-GMSI: A transformer-based generative model for spatial interpolation under sparse measurements

Authors: Xiangxi Tian, Jie Shan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09886
Pdf URL: https://arxiv.org/pdf/2412.09886
Copy Paste: [[2412.09886]] T-GMSI: A transformer-based generative model for spatial interpolation under sparse measurements(https://arxiv.org/abs/2412.09886)
Keywords: extraction, transformer, generative
Abstract: Generating continuous environmental models from sparsely sampled data is a critical challenge in spatial modeling, particularly for topography. Traditional spatial interpolation methods often struggle with handling sparse measurements. To address this, we propose a Transformer-based Generative Model for Spatial Interpolation (T-GMSI) using a vision transformer (ViT) architecture for digital elevation model (DEM) generation under sparse conditions. T-GMSI replaces traditional convolution-based methods with ViT for feature extraction and DEM interpolation while incorporating a terrain feature-aware loss function for enhanced accuracy. T-GMSI excels in producing high-quality elevation surfaces from datasets with over 70% sparsity and demonstrates strong transferability across diverse landscapes without fine-tuning. Its performance is validated through extensive experiments, outperforming traditional methods such as ordinary Kriging (OK) and natural neighbor (NN) and a conditional generative adversarial network (CGAN)-based model (CEDGAN). Compared to OK and NN, T-GMSI reduces root mean square error (RMSE) by 40% and 25% on airborne lidar data and by 23% and 10% on spaceborne lidar data. Against CEDGAN, T-GMSI achieves a 20% RMSE improvement on provided DEM data, requiring no fine-tuning. The ability of model on generalizing to large, unseen terrains underscores its transferability and potential applicability beyond topographic modeling. This research establishes T-GMSI as a state-of-the-art solution for spatial interpolation on sparse datasets and highlights its broader utility for other sparse data interpolation challenges.

Title: Analyzing Fairness of Classification Machine Learning Model with Structured Dataset

Authors: Ahmed Rashed, Abdelkrim Kallich, Mohamed Eltayeb
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09896
Pdf URL: https://arxiv.org/pdf/2412.09896
Copy Paste: [[2412.09896]] Analyzing Fairness of Classification Machine Learning Model with Structured Dataset(https://arxiv.org/abs/2412.09896)
Keywords: robust, fair
Abstract: Machine learning (ML) algorithms have become integral to decision making in various domains, including healthcare, finance, education, and law enforcement. However, concerns about fairness and bias in these systems pose significant ethical and social challenges. This study investigates the fairness of ML models applied to structured datasets in classification tasks, highlighting the potential for biased predictions to perpetuate systemic inequalities. A publicly available dataset from Kaggle was selected for analysis, offering a realistic scenario for evaluating fairness in machine learning workflows. To assess and mitigate biases, three prominent fairness libraries; Fairlearn by Microsoft, AIF360 by IBM, and the What If Tool by Google were employed. These libraries provide robust frameworks for analyzing fairness, offering tools to evaluate metrics, visualize results, and implement bias mitigation strategies. The research aims to assess the extent of bias in the ML models, compare the effectiveness of these libraries, and derive actionable insights for practitioners. The findings reveal that each library has unique strengths and limitations in fairness evaluation and mitigation. By systematically comparing their capabilities, this study contributes to the growing field of ML fairness by providing practical guidance for integrating fairness tools into real world applications. These insights are intended to support the development of more equitable machine learning systems.

Title: Analyzing Fairness of Computer Vision and Natural Language Processing Models

Authors: Ahmed Rashed, Abdelkrim Kallich, Mohamed Eltayeb
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09900
Pdf URL: https://arxiv.org/pdf/2412.09900
Copy Paste: [[2412.09900]] Analyzing Fairness of Computer Vision and Natural Language Processing Models(https://arxiv.org/abs/2412.09900)
Keywords: fair
Abstract: Machine learning (ML) algorithms play a crucial role in decision making across diverse fields such as healthcare, finance, education, and law enforcement. Despite their widespread adoption, these systems raise ethical and social concerns due to potential biases and fairness issues. This study focuses on evaluating and improving the fairness of Computer Vision and Natural Language Processing (NLP) models applied to unstructured datasets, emphasizing how biased predictions can reinforce existing systemic inequalities. A publicly available dataset from Kaggle was utilized to simulate a practical scenario for examining fairness in ML workflows. To address and mitigate biases, the study employed two leading fairness libraries: Fairlearn by Microsoft, and AIF360 by IBM. These tools offer comprehensive frameworks for fairness analysis, including metrics evaluation, result visualization, and bias mitigation techniques. The research aims to measure bias levels in ML models, compare the effectiveness of these fairness libraries, and provide actionable recommendations for practitioners. The results demonstrate that each library possesses distinct strengths and limitations in evaluating and mitigating fairness. By systematically analyzing these tools, the study contributes valuable insights to the growing field of ML fairness, offering practical guidance for integrating fairness solutions into real world applications. This research underscores the importance of building more equitable and responsible machine learning systems.

Title: Enhancing the Reasoning Capabilities of Small Language Models via Solution Guidance Fine-Tuning

Authors: Jing Bi, Yuting Wu, Weiwei Xing, Zhenjie Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09906
Pdf URL: https://arxiv.org/pdf/2412.09906
Copy Paste: [[2412.09906]] Enhancing the Reasoning Capabilities of Small Language Models via Solution Guidance Fine-Tuning(https://arxiv.org/abs/2412.09906)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Advances in prompt engineering and fine-tuning techniques have further enhanced their ability to address complex reasoning challenges. However, these advanced capabilities are often exclusive to models exceeding 100 billion parameters. Although Chain-of-Thought (CoT) fine-tuning methods have been explored for smaller models (under 10 billion parameters), they typically depend on extensive CoT training data, which can introduce inconsistencies and limit effectiveness in low-data settings. To overcome these limitations, this paper introduce a new reasoning strategy Solution Guidance (SG) and a plug-and-play training paradigm Solution-Guidance Fine-Tuning (SGFT) for enhancing the reasoning capabilities of small language models. SG focuses on problem understanding and decomposition at the semantic and logical levels, rather than specific computations, which can effectively improve the SLMs' generalization and reasoning abilities. With only a small amount of SG training data, SGFT can fine-tune a SLM to produce accurate problem-solving guidances, which can then be flexibly fed to any SLM as prompts, enabling it to generate correct answers directly. Experimental results demonstrate that our method significantly improves the performance of SLMs on various reasoning tasks, enhancing both their practicality and efficiency within resource-constrained environments.

Title: IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs

Authors: Sosuke Yamao, Natsuki Miyahara, Yuki Harazono, Shun Takeuchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09907
Pdf URL: https://arxiv.org/pdf/2412.09907
Copy Paste: [[2412.09907]] IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs(https://arxiv.org/abs/2412.09907)
Keywords: transformer
Abstract: With the increasing complexity of video data and the need for more efficient long-term temporal understanding, existing long-term video understanding methods often fail to accurately capture and analyze extended video sequences. These methods typically struggle to maintain performance over longer durations and to handle the intricate dependencies within the video content. To address these limitations, we propose a simple yet effective large multi-modal model framework for long-term video understanding that incorporates a novel visual compressor, the In-context, Question Adaptive Visual Compressor (IQViC). The key idea, inspired by humans' selective attention and in-context memory mechanisms, is to introduce a novel visual compressor and incorporate efficient memory management techniques to enhance long-term video question answering. Our framework utilizes IQViC, a transformer-based visual compressor, enabling question-conditioned in-context compression, unlike existing methods that rely on full video visual features. This selectively extracts relevant information, significantly reducing memory token requirements. Through extensive experiments on a new dataset based on InfiniBench for long-term video understanding, and standard benchmarks used for existing methods' evaluation, we demonstrate the effectiveness of our proposed IQViC framework and its superiority over state-of-the-art methods in terms of video understanding accuracy and memory efficiency.

Title: Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images

Authors: Yasamin Medghalchi, Moein Heidari, Clayton Allard, Leonid Sigal, Ilker Hacihaliloglu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09910
Pdf URL: https://arxiv.org/pdf/2412.09910
Copy Paste: [[2412.09910]] Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images(https://arxiv.org/abs/2412.09910)
Keywords: security, attack, diffusion
Abstract: Deep neural networks (DNNs) offer significant promise for improving breast cancer diagnosis in medical imaging. However, these models are highly susceptible to adversarial attacks--small, imperceptible changes that can mislead classifiers--raising critical concerns about their reliability and security. Traditional attacks rely on fixed-norm perturbations, misaligning with human perception. In contrast, diffusion-based attacks require pre-trained models, demanding substantial data when these models are unavailable, limiting practical use in data-scarce scenarios. In medical imaging, however, this is often unfeasible due to the limited availability of datasets. Building on recent advancements in learnable prompts, we propose Prompt2Perturb (P2P), a novel language-guided attack method capable of generating meaningful attack examples driven by text instructions. During the prompt learning phase, our approach leverages learnable prompts within the text encoder to create subtle, yet impactful, perturbations that remain imperceptible while guiding the model towards targeted outcomes. In contrast to current prompt learning-based approaches, our P2P stands out by directly updating text embeddings, avoiding the need for retraining diffusion models. Further, we leverage the finding that optimizing only the early reverse diffusion steps boosts efficiency while ensuring that the generated adversarial examples incorporate subtle noise, thus preserving ultrasound image quality without introducing noticeable artifacts. We show that our method outperforms state-of-the-art attack techniques across three breast ultrasound datasets in FID and LPIPS. Moreover, the generated images are both more natural in appearance and more effective compared to existing adversarial attacks. Our code will be publicly available this https URL.

Title: All-in-One: Transferring Vision Foundation Models into Stereo Matching

Authors: Jingyi Zhou, Haoyu Zhang, Jiakang Yuan, Peng Ye, Tao Chen, Hao Jiang, Meiya Chen, Yangyang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09912
Pdf URL: https://arxiv.org/pdf/2412.09912
Copy Paste: [[2412.09912]] All-in-One: Transferring Vision Foundation Models into Stereo Matching(https://arxiv.org/abs/2412.09912)
Keywords: extraction
Abstract: As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1^{st}$ on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

Title: B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Authors: Zhuqiang Lu, Zhenfei Yin, Mengwei He, Zhihui Wang, Zicheng Liu, Zhiyong Wang, Kun Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09919
Pdf URL: https://arxiv.org/pdf/2412.09919
Copy Paste: [[2412.09919]] B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens(https://arxiv.org/abs/2412.09919)
Keywords: large language model
Abstract: Recently, Vision Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in vision understanding. The key of VLLMs is to encode visual content into sequences of visual tokens, enabling VLLMs to simultaneously process both visual and textual content. However, understanding videos, especially long videos, remain a challenge to VLLMs as the number of visual tokens grows rapidly when encoding videos, resulting in the risk of exceeding the context window of VLLMs and introducing heavy computation burden. To restrict the number of visual tokens, existing VLLMs either: (1) uniformly downsample videos into a fixed number of frames or (2) reducing the number of visual tokens encoded from each frame. We argue the former solution neglects the rich temporal cue in videos and the later overlooks the spatial details in each frame. In this work, we present Balanced-VLLM (B-VLLM): a novel VLLM framework that aims to effectively leverage task relevant spatio-temporal cues while restricting the number of visual tokens under the VLLM context window length. At the core of our method, we devise a text-conditioned adaptive frame selection module to identify frames relevant to the visual understanding task. The selected frames are then de-duplicated using a temporal frame token merging technique. The visual tokens of the selected frames are processed through a spatial token sampling module and an optional spatial token merging strategy to achieve precise control over the token count. Experimental results show that B-VLLM is effective in balancing the number of frames and visual tokens in video understanding, yielding superior performance on various video understanding benchmarks. Our code is available at this https URL.

Title: FaceShield: Defending Facial Image against Deepfake Threats

Authors: Jaehwan Jeong, Sumin In, Sieun Kim, Hannie Shin, Jongheon Jeong, Sang Ho Yoon, Jaewook Chung, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09921
Pdf URL: https://arxiv.org/pdf/2412.09921
Copy Paste: [[2412.09921]] FaceShield: Defending Facial Image against Deepfake Threats(https://arxiv.org/abs/2412.09921)
Keywords: protect, defense, attack, robust, extraction, diffusion, generative
Abstract: The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity verification is not critical. Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models. In this paper, we propose a proactive defense method named FaceShield, which introduces novel attack strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates attacks on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG distortion. Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting applicability to GANs and showcasing greater imperceptibility of noise along with enhanced robustness.

Title: Simulating Hard Attention Using Soft Attention

Authors: Andy Yang, Lena Strobl, David Chiang, Dana Angluin
Subjects: cs.LG, cs.CL, cs.FL
Abstract URL: https://arxiv.org/abs/2412.09925
Pdf URL: https://arxiv.org/pdf/2412.09925
Copy Paste: [[2412.09925]] Simulating Hard Attention Using Soft Attention(https://arxiv.org/abs/2412.09925)
Keywords: transformer
Abstract: We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several variants of linear temporal logic, whose formulas have been previously been shown to be computable using hard attention transformers. We demonstrate how soft attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate a large subclass of average-hard attention transformers, those that have what we call the uniform-tieless property.

Title: SCRUBD: Smart Contracts Reentrancy and Unhandled Exceptions Vulnerability Dataset

Authors: Chavhan Sujeet Yashavant, MitrajSinh Chavda, Saurabh Kumar, Amey Karkare, Angshuman Karmakar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.09935
Pdf URL: https://arxiv.org/pdf/2412.09935
Copy Paste: [[2412.09935]] SCRUBD: Smart Contracts Reentrancy and Unhandled Exceptions Vulnerability Dataset(https://arxiv.org/abs/2412.09935)
Keywords: attack, steal
Abstract: Smart Contracts (SCs) handle transactions in the Ethereum blockchain worth millions of United States dollars, making them a lucrative target for attackers seeking to exploit vulnerabilities and steal funds. The Ethereum community has developed a rich set of tools to detect vulnerabilities in SCs, including reentrancy (RE) and unhandled exceptions (UX). A dataset of SCs labelled with vulnerabilities is needed to evaluate the tools' efficacy. Existing SC datasets with labelled vulnerabilities have limitations, such as covering only a limited range of vulnerability scenarios and containing incorrect labels. As a result, there is a lack of a standardized dataset to compare the performances of these tools. SCRUBD aims to fill this gap. We present a dataset of real-world SCs and synthesized SCs labelled with RE and UX. The real-world SC dataset is labelled through crowdsourcing, followed by manual inspection by an expert, and covers both RE and UX vulnerabilities. On the other hand, the synthesized dataset is carefully crafted to cover various RE scenarios only. Using SCRUBD we compared the performance of six popular vulnerability detection tools. Based on our study, we found that Slither outperforms other tools on a crowdsourced dataset in detecting RE vulnerabilities, while Sailfish outperforms other tools on a manually synthesized dataset for detecting RE. For UX vulnerabilities, Slither outperforms all other tools.

Title: CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models

Authors: Dongyu Yao, Keling Yao, Junhong Zhou, Yinghao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09936
Pdf URL: https://arxiv.org/pdf/2412.09936
Copy Paste: [[2412.09936]] CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models(https://arxiv.org/abs/2412.09936)
Keywords: robust
Abstract: The obesity phenomenon, known as the heavy issue, is a leading cause of preventable chronic diseases worldwide. Traditional calorie estimation tools often rely on specific data formats or complex pipelines, limiting their practicality in real-world scenarios. Recently, vision-language models (VLMs) have excelled in understanding real-world contexts and enabling conversational interactions, making them ideal for downstream tasks such as ingredient analysis. However, applying VLMs to calorie estimation requires domain-specific data and alignment strategies. To this end, we curated CalData, a 330K image-text pair dataset tailored for ingredient recognition and calorie estimation, combining a large-scale recipe dataset with detailed nutritional instructions for robust vision-language training. Built upon this dataset, we present CaLoRAify, a novel VLM framework aligning ingredient recognition and calorie estimation via training with visual-text pairs. During inference, users only need a single monocular food image to estimate calories while retaining the flexibility of agent-based conversational interaction. With Low-rank Adaptation (LoRA) and Retrieve-augmented Generation (RAG) techniques, our system enhances the performance of foundational VLMs in the vertical domain of calorie estimation. Our code and data are fully open-sourced at this https URL.

Title: Enhancing Nursing and Elderly Care with Large Language Models: An AI-Driven Framework

Authors: Qiao Sun, Jiexin Xie, Nanyang Ye, Qinying Gu, Shijie Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09946
Pdf URL: https://arxiv.org/pdf/2412.09946
Copy Paste: [[2412.09946]] Enhancing Nursing and Elderly Care with Large Language Models: An AI-Driven Framework(https://arxiv.org/abs/2412.09946)
Keywords: large language model
Abstract: This paper explores the application of large language models (LLMs) in nursing and elderly care, focusing on AI-driven patient monitoring and interaction. We introduce a novel Chinese nursing dataset and implement incremental pre-training (IPT) and supervised fine-tuning (SFT) techniques to enhance LLM performance in specialized tasks. Using LangChain, we develop a dynamic nursing assistant capable of real-time care and personalized interventions. Experimental results demonstrate significant improvements, paving the way for AI-driven solutions to meet the growing demands of healthcare in aging populations.

Title: Towards Fair Graph Neural Networks via Graph Counterfactual without Sensitive Attributes

Authors: Xuemin Wang, Tianlong Gu, Xuguang Bao, Liang Chang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09947
Pdf URL: https://arxiv.org/pdf/2412.09947
Copy Paste: [[2412.09947]] Towards Fair Graph Neural Networks via Graph Counterfactual without Sensitive Attributes(https://arxiv.org/abs/2412.09947)
Keywords: privacy, fair
Abstract: Graph-structured data is ubiquitous in today's connected world, driving extensive research in graph analysis. Graph Neural Networks (GNNs) have shown great success in this field, leading to growing interest in developing fair GNNs for critical applications. However, most existing fair GNNs focus on statistical fairness notions, which may be insufficient when dealing with statistical anomalies. Hence, motivated by causal theory, there has been growing attention to mitigating root causes of unfairness utilizing graph counterfactuals. Unfortunately, existing methods for generating graph counterfactuals invariably require the sensitive attribute. Nevertheless, in many real-world applications, it is usually infeasible to obtain sensitive attributes due to privacy or legal issues, which challenge existing methods. In this paper, we propose a framework named Fairwos (improving Fairness without sensitive attributes). In particular, we first propose a mechanism to generate pseudo-sensitive attributes to remedy the problem of missing sensitive attributes, and then design a strategy for finding graph counterfactuals from the real dataset. To train fair GNNs, we propose a method to ensure that the embeddings from the original data are consistent with those from the graph counterfactuals, and dynamically adjust the weight of each pseudo-sensitive attribute to balance its contribution to fairness and utility. Furthermore, we theoretically demonstrate that minimizing the relation between these pseudo-sensitive attributes and the prediction can enable the fairness of GNNs. Experimental results on six real-world datasets show that our approach outperforms state-of-the-art methods in balancing utility and fairness.

Title: Llama 3 Meets MoE: Efficient Upcycling

Authors: Aditya Vavre, Ethan He, Dennis Liu, Zijie Yan, June Yang, Nima Tajbakhsh, Ashwath Aithal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09952
Pdf URL: https://arxiv.org/pdf/2412.09952
Copy Paste: [[2412.09952]] Llama 3 Meets MoE: Efficient Upcycling(https://arxiv.org/abs/2412.09952)
Keywords: large language model
Abstract: Scaling large language models (LLMs) significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1\%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $\textbf{2%}$ improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of $\textbf{46.8%}$ during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.

Title: $\textrm{A}^{\textrm{2}}$RNet: Adversarial Attack Resilient Network for Robust Infrared and Visible Image Fusion

Authors: Jiawei Li, Hongwei Yu, Jiansheng Chen, Xinlong Ding, Jinlong Wang, Jinyuan Liu, Bochao Zou, Huimin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09954
Pdf URL: https://arxiv.org/pdf/2412.09954
Copy Paste: [[2412.09954]] $\textrm{A}^{\textrm{2}}$RNet: Adversarial Attack Resilient Network for Robust Infrared and Visible Image Fusion(https://arxiv.org/abs/2412.09954)
Keywords: attack, robust, transformer
Abstract: Infrared and visible image fusion (IVIF) is a crucial technique for enhancing visual performance by integrating unique information from different modalities into one fused image. Exiting methods pay more attention to conducting fusion with undisturbed data, while overlooking the impact of deliberate interference on the effectiveness of fusion results. To investigate the robustness of fusion models, in this paper, we propose a novel adversarial attack resilient network, called $\textrm{A}^{\textrm{2}}$RNet. Specifically, we develop an adversarial paradigm with an anti-attack loss function to implement adversarial attacks and training. It is constructed based on the intrinsic nature of IVIF and provide a robust foundation for future research advancements. We adopt a Unet as the pipeline with a transformer-based defensive refinement module (DRM) under this paradigm, which guarantees fused image quality in a robust coarse-to-fine manner. Compared to previous works, our method mitigates the adverse effects of adversarial perturbations, consistently maintaining high-fidelity fusion results. Furthermore, the performance of downstream tasks can also be well maintained under adversarial attacks. Code is available at this https URL.

Title: Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization

Authors: Xinhao Zhong, Shuoyang Sun, Xulin Gu, Zhaoyang Xu, Yaowei Wang, Jianlong Wu, Bin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09959
Pdf URL: https://arxiv.org/pdf/2412.09959
Copy Paste: [[2412.09959]] Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization(https://arxiv.org/abs/2412.09959)
Keywords: diffusion
Abstract: Dataset distillation offers an efficient way to reduce memory and computational costs by optimizing a smaller dataset with performance comparable to the full-scale original. However, for large datasets and complex deep networks (e.g., ImageNet-1K with ResNet-101), the extensive optimization space limits performance, reducing its practicality. Recent approaches employ pre-trained diffusion models to generate informative images directly, avoiding pixel-level optimization and achieving notable results. However, these methods often face challenges due to distribution shifts between pre-trained models and target datasets, along with the need for multiple distillation steps across varying settings. To address these issues, we propose a novel framework orthogonal to existing diffusion-based distillation methods, leveraging diffusion models for selection rather than generation. Our method starts by predicting noise generated by the diffusion model based on input images and text prompts (with or without label text), then calculates the corresponding loss for each pair. With the loss differences, we identify distinctive regions of the original images. Additionally, we perform intra-class clustering and ranking on selected patches to maintain diversity constraints. This streamlined framework enables a single-step distillation process, and extensive experiments demonstrate that our approach outperforms state-of-the-art methods across various metrics.

Title: END$^2$: Robust Dual-Decoder Watermarking Framework Against Non-Differentiable Distortions

Authors: Nan Sun, Han Fang, Yuxing Lu, Chengxin Zhao, Hefei Ling
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09960
Pdf URL: https://arxiv.org/pdf/2412.09960
Copy Paste: [[2412.09960]] END$^2$: Robust Dual-Decoder Watermarking Framework Against Non-Differentiable Distortions(https://arxiv.org/abs/2412.09960)
Keywords: robust, watermark
Abstract: DNN-based watermarking methods have rapidly advanced, with the ``Encoder-Noise Layer-Decoder'' (END) framework being the most widely used. To ensure end-to-end training, the noise layer in the framework must be differentiable. However, real-world distortions are often non-differentiable, leading to challenges in end-to-end training. Existing solutions only treat the distortion perturbation as additive noise, which does not fully integrate the effect of distortion in training. To better incorporate non-differentiable distortions into training, we propose a novel dual-decoder architecture (END$^2$). Unlike conventional END architecture, our method employs two structurally identical decoders: the Teacher Decoder, processing pure watermarked images, and the Student Decoder, handling distortion-perturbed images. The gradient is backpropagated only through the Teacher Decoder branch to optimize the encoder thus bypassing the problem of non-differentiability. To ensure resistance to arbitrary distortions, we enforce alignment of the two decoders' feature representations by maximizing the cosine similarity between their intermediate vectors on a hypersphere. Extensive experiments demonstrate that our scheme outperforms state-of-the-art algorithms under various non-differentiable distortions. Moreover, even without the differentiability constraint, our method surpasses baselines with a differentiable noise layer. Our approach is effective and easily implementable across all END architectures, enhancing practicality and generalizability.

Title: EP-CFG: Energy-Preserving Classifier-Free Guidance

Authors: Kai Zhang, Fujun Luan, Sai Bi, Jianming Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09966
Pdf URL: https://arxiv.org/pdf/2412.09966
Copy Paste: [[2412.09966]] EP-CFG: Energy-Preserving Classifier-Free Guidance(https://arxiv.org/abs/2412.09966)
Keywords: robust, diffusion
Abstract: Classifier-free guidance (CFG) is widely used in diffusion models but often introduces over-contrast and over-saturation artifacts at higher guidance strengths. We present EP-CFG (Energy-Preserving Classifier-Free Guidance), which addresses these issues by preserving the energy distribution of the conditional prediction during the guidance process. Our method simply rescales the energy of the guided output to match that of the conditional prediction at each denoising step, with an optional robust variant for improved artifact suppression. Through experiments, we show that EP-CFG maintains natural image quality and preserves details across guidance strengths while retaining CFG's semantic alignment benefits, all with minimal computational overhead.

Title: Efficient Large-Scale Traffic Forecasting with Transformers: A Spatial Data Management Perspective

Authors: Yuchen Fang, Yuxuan Liang, Bo Hui, Zezhi Shao, Liwei Deng, Xu Liu, Xinke Jiang, Kai Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09972
Pdf URL: https://arxiv.org/pdf/2412.09972
Copy Paste: [[2412.09972]] Efficient Large-Scale Traffic Forecasting with Transformers: A Spatial Data Management Perspective(https://arxiv.org/abs/2412.09972)
Keywords: interpretability, transformer
Abstract: Road traffic forecasting is crucial in real-world intelligent transportation scenarios like traffic dispatching and path planning in city management and personal traveling. Spatio-temporal graph neural networks (STGNNs) stand out as the mainstream solution in this task. Nevertheless, the quadratic complexity of remarkable dynamic spatial modeling-based STGNNs has become the bottleneck over large-scale traffic data. From the spatial data management perspective, we present a novel Transformer framework called PatchSTG to efficiently and dynamically model spatial dependencies for large-scale traffic forecasting with interpretability and fidelity. Specifically, we design a novel irregular spatial patching to reduce the number of points involved in the dynamic calculation of Transformer. The irregular spatial patching first utilizes the leaf K-dimensional tree (KDTree) to recursively partition irregularly distributed traffic points into leaf nodes with a small capacity, and then merges leaf nodes belonging to the same subtree into occupancy-equaled and non-overlapped patches through padding and backtracking. Based on the patched data, depth and breadth attention are used interchangeably in the encoder to dynamically learn local and global spatial knowledge from points in a patch and points with the same index of patches. Experimental results on four real world large-scale traffic datasets show that our PatchSTG achieves train speed and memory utilization improvements up to $10\times$ and $4\times$ with the state-of-the-art performance.

Title: SUMI-IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints

Authors: Ziqi Sheng, Wei Lu, Xiangyang Luo, Jiantao Zhou, Xiaochun Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09981
Pdf URL: https://arxiv.org/pdf/2412.09981
Copy Paste: [[2412.09981]] SUMI-IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints(https://arxiv.org/abs/2412.09981)
Keywords: protect, extraction
Abstract: Image forgery localization (IFL) is a crucial technique for preventing tampered image misuse and protecting social safety. However, due to the rapid development of image tampering technologies, extracting more comprehensive and accurate forgery clues remains an urgent challenge. To address these challenges, we introduce a novel information-theoretic IFL framework named SUMI-IFL that imposes sufficiency-view and minimality-view constraints on forgery feature representation. First, grounded in the theoretical analysis of mutual information, the sufficiency-view constraint is enforced on the feature extraction network to ensure that the latent forgery feature contains comprehensive forgery clues. Considering that forgery clues obtained from a single aspect alone may be incomplete, we construct the latent forgery feature by integrating several individual forgery features from multiple perspectives. Second, based on the information bottleneck, the minimality-view constraint is imposed on the feature reasoning network to achieve an accurate and concise forgery feature representation that counters the interference of task-unrelated features. Extensive experiments show the superior performance of SUMI-IFL to existing state-of-the-art methods, not only on in-dataset comparisons but also on cross-dataset comparisons.

Title: SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video

Authors: Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, Munchurl Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09982
Pdf URL: https://arxiv.org/pdf/2412.09982
Copy Paste: [[2412.09982]] SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video(https://arxiv.org/abs/2412.09982)
Keywords: robust
Abstract: Synthesizing novel views from in-the-wild monocular videos is challenging due to scene dynamics and the lack of multi-view cues. To address this, we propose SplineGS, a COLMAP-free dynamic 3D Gaussian Splatting (3DGS) framework for high-quality reconstruction and fast rendering from monocular videos. At its core is a novel Motion-Adaptive Spline (MAS) method, which represents continuous dynamic 3D Gaussian trajectories using cubic Hermite splines with a small number of control points. For MAS, we introduce a Motion-Adaptive Control points Pruning (MACP) method to model the deformation of each dynamic 3D Gaussian across varying motions, progressively pruning control points while maintaining dynamic modeling integrity. Additionally, we present a joint optimization strategy for camera parameter estimation and 3D Gaussian attributes, leveraging photometric and geometric consistency. This eliminates the need for Structure-from-Motion preprocessing and enhances SplineGS's robustness in real-world conditions. Experiments show that SplineGS significantly outperforms state-of-the-art methods in novel view synthesis quality for dynamic scenes from monocular videos, achieving thousands times faster rendering speed.

Title: Small Language Model as Data Prospector for Large Language Model

Authors: Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, Min Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09990
Pdf URL: https://arxiv.org/pdf/2412.09990
Copy Paste: [[2412.09990]] Small Language Model as Data Prospector for Large Language Model(https://arxiv.org/abs/2412.09990)
Keywords: large language model
Abstract: The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, \cite{li2023one} proposed \texttt{NUGGETS}, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose \texttt{SuperNUGGETS}, an improved variant of \texttt{NUGGETS} optimised for efficiency and performance. Our \texttt{SuperNUGGETS} uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of \texttt{SuperNUGGETS} only decreases by 1-2% compared to \texttt{NUGGETS}, but the efficiency can be increased by a factor of 58. Compared to the original \texttt{NUGGETS}, our \texttt{SuperNUGGETS} has a higher utility value due to the significantly lower resource consumption.

Title: A Comparative Study of LLMs, NMT Models, and Their Combination in Persian-English Idiom Translation

Authors: Sara Rezaeimanesh, Faezeh Hosseini, Yadollah Yaghoobzadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09993
Pdf URL: https://arxiv.org/pdf/2412.09993
Copy Paste: [[2412.09993]] A Comparative Study of LLMs, NMT Models, and Their Combination in Persian-English Idiom Translation(https://arxiv.org/abs/2412.09993)
Keywords: large language model
Abstract: Large language models (LLMs) have shown superior capabilities in translating figurative language compared to neural machine translation (NMT) systems. However, the impact of different prompting methods and LLM-NMT combinations on idiom translation has yet to be thoroughly investigated. This paper introduces two parallel datasets of sentences containing idiomatic expressions for Persian$\rightarrow$English and English$\rightarrow$Persian translations, with Persian idioms sampled from our PersianIdioms resource, a collection of 2,200 idioms and their meanings. Using these datasets, we evaluate various open- and closed-source LLMs, NMT models, and their combinations. Translation quality is assessed through idiom translation accuracy and fluency. We also find that automatic evaluation methods like LLM-as-a-judge, BLEU and BERTScore are effective for comparing different aspects of model performance. Our experiments reveal that Claude-3.5-Sonnet delivers outstanding results in both translation directions. For English$\rightarrow$Persian, combining weaker LLMs with Google Translate improves results, while Persian$\rightarrow$English translations benefit from single prompts for simpler models and complex prompts for advanced ones.

Title: Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Authors: Chang-Bin Zhang, Yujie Zhong, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10028
Pdf URL: https://arxiv.org/pdf/2412.10028
Copy Paste: [[2412.10028]] Mr. DETR: Instructive Multi-Route Training for Detection Transformers(https://arxiv.org/abs/2412.10028)
Keywords: transformer
Abstract: Existing methods enhance the training of detection transformers by incorporating an auxiliary one-to-many assignment. In this work, we treat the model as a multi-task framework, simultaneously performing one-to-one and one-to-many predictions. We investigate the roles of each component in the transformer decoder across these two training targets, including self-attention, cross-attention, and feed-forward network. Our empirical results demonstrate that any independent component in the decoder can effectively learn both targets simultaneously, even when other components are shared. This finding leads us to propose a multi-route training mechanism, featuring a primary route for one-to-one prediction and two auxiliary training routes for one-to-many prediction. We enhance the training mechanism with a novel instructive self-attention that dynamically and flexibly guides object queries for one-to-many prediction. The auxiliary routes are removed during inference, ensuring no impact on model architecture or inference cost. We conduct extensive experiments on various baselines, achieving consistent improvements as shown in Figure 1.

Title: Object-Focused Data Selection for Dense Prediction Tasks

Authors: Niclas Popp, Dan Zhang, Jan Hendrik Metzen, Matthias Hein, Lukas Schott
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10032
Pdf URL: https://arxiv.org/pdf/2412.10032
Copy Paste: [[2412.10032]] Object-Focused Data Selection for Dense Prediction Tasks(https://arxiv.org/abs/2412.10032)
Keywords: segmentation
Abstract: Dense prediction tasks such as object detection and segmentation require high-quality labels at pixel level, which are costly to obtain. Recent advances in foundation models have enabled the generation of autolabels, which we find to be competitive but not yet sufficient to fully replace human annotations, especially for more complex datasets. Thus, we consider the challenge of selecting a representative subset of images for labeling from a large pool of unlabeled images under a constrained annotation budget. This task is further complicated by imbalanced class distributions, as rare classes are often underrepresented in selected subsets. We propose object-focused data selection (OFDS) which leverages object-level representations to ensure that the selected image subsets semantically cover the target classes, including rare ones. We validate OFDS on PASCAL VOC and Cityscapes for object detection and semantic segmentation tasks. Our experiments demonstrate that prior methods which employ image-level representations fail to consistently outperform random selection. In contrast, OFDS consistently achieves state-of-the-art performance with substantial improvements over all baselines in scenarios with imbalanced class distributions. Moreover, we demonstrate that pre-training with autolabels on the full datasets before fine-tuning on human-labeled subsets selected by OFDS further enhances the final performance.

Title: Timealign: A multi-modal object detection method for time misalignment fusing in autonomous driving

Authors: Zhihang Song, Lihui Peng, Jianming Hu, Danya Yao, Yi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10033
Pdf URL: https://arxiv.org/pdf/2412.10033
Copy Paste: [[2412.10033]] Timealign: A multi-modal object detection method for time misalignment fusing in autonomous driving(https://arxiv.org/abs/2412.10033)
Keywords: robust
Abstract: The multi-modal perception methods are thriving in the autonomous driving field due to their better usage of complementary data from different sensors. Such methods depend on calibration and synchronization between sensors to get accurate environmental information. There have already been studies about space-alignment robustness in autonomous driving object detection process, however, the research for time-alignment is relatively few. As in reality experiments, LiDAR point clouds are more challenging for real-time data transfer, our study used historical frames of LiDAR to better align features when the LiDAR data lags exist. We designed a Timealign module to predict and combine LiDAR features with observation to tackle such time misalignment based on SOTA GraphBEV framework.

Title: SuperMark: Robust and Training-free Image Watermarking via Diffusion-based Super-Resolution

Authors: Runyi Hu, Jie Zhang, Yiming Li, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10049
Pdf URL: https://arxiv.org/pdf/2412.10049
Copy Paste: [[2412.10049]] SuperMark: Robust and Training-free Image Watermarking via Diffusion-based Super-Resolution(https://arxiv.org/abs/2412.10049)
Keywords: protect, attack, robust, extraction, watermark, diffusion
Abstract: In today's digital landscape, the blending of AI-generated and authentic content has underscored the need for copyright protection and content authentication. Watermarking has become a vital tool to address these challenges, safeguarding both generated and real content. Effective watermarking methods must withstand various distortions and attacks. Current deep watermarking techniques often use an encoder-noise layer-decoder architecture and include distortions to enhance robustness. However, they struggle to balance robustness and fidelity and remain vulnerable to adaptive attacks, despite extensive training. To overcome these limitations, we propose SuperMark, a robust, training-free watermarking framework. Inspired by the parallels between watermark embedding/extraction in watermarking and the denoising/noising processes in diffusion models, SuperMark embeds the watermark into initial Gaussian noise using existing techniques. It then applies pre-trained Super-Resolution (SR) models to denoise the watermarked noise, producing the final watermarked image. For extraction, the process is reversed: the watermarked image is inverted back to the initial watermarked noise via DDIM Inversion, from which the embedded watermark is extracted. This flexible framework supports various noise injection methods and diffusion-based SR models, enabling enhanced customization. The robustness of the DDIM Inversion process against perturbations allows SuperMark to achieve strong resilience to distortions while maintaining high fidelity. Experiments demonstrate that SuperMark achieves fidelity comparable to existing methods while significantly improving robustness. Under standard distortions, it achieves an average watermark extraction accuracy of 99.46%, and 89.29% under adaptive attacks. Moreover, SuperMark shows strong transferability across datasets, SR models, embedding methods, and resolutions.

Title: TSGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views

Authors: Liang Zhao, Zehan Bao, Yi Xie, Hong Chen, Yaohui Chen, Weifu Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10051
Pdf URL: https://arxiv.org/pdf/2412.10051
Copy Paste: [[2412.10051]] TSGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views(https://arxiv.org/abs/2412.10051)
Keywords: segmentation
Abstract: Recent advances in Gaussian Splatting have significantly advanced the field, achieving both panoptic and interactive segmentation of 3D scenes. However, existing methodologies often overlook the critical need for reconstructing specified targets with complex structures from sparse views. To address this issue, we introduce TSGaussian, a novel framework that combines semantic constraints with depth priors to avoid geometry degradation in challenging novel view synthesis tasks. Our approach prioritizes computational resources on designated targets while minimizing background allocation. Bounding boxes from YOLOv9 serve as prompts for Segment Anything Model to generate 2D mask predictions, ensuring semantic accuracy and cost efficiency. TSGaussian effectively clusters 3D gaussians by introducing a compact identity encoding for each Gaussian ellipsoid and incorporating 3D spatial consistency regularization. Leveraging these modules, we propose a pruning strategy to effectively reduce redundancy in 3D gaussians. Extensive experiments demonstrate that TSGaussian outperforms state-of-the-art methods on three standard datasets and a new challenging dataset we collected, achieving superior results in novel view synthesis of specific objects. Code is available at: this https URL.

Title: Unsupervised Named Entity Disambiguation for Low Resource Domains

Authors: Debarghya Datta, Soumajit Pramanik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10054
Pdf URL: https://arxiv.org/pdf/2412.10054
Copy Paste: [[2412.10054]] Unsupervised Named Entity Disambiguation for Low Resource Domains(https://arxiv.org/abs/2412.10054)
Keywords: robust
Abstract: In the ever-evolving landscape of natural language processing and information retrieval, the need for robust and domain-specific entity linking algorithms has become increasingly apparent. It is crucial in a considerable number of fields such as humanities, technical writing and biomedical sciences to enrich texts with semantics and discover more knowledge. The use of Named Entity Disambiguation (NED) in such domains requires handling noisy texts, low resource settings and domain-specific KBs. Existing approaches are mostly inappropriate for such scenarios, as they either depend on training data or are not flexible enough to work with domain-specific KBs. Thus in this work, we present an unsupervised approach leveraging the concept of Group Steiner Trees (GST), which can identify the most relevant candidates for entity disambiguation using the contextual similarities across candidate entities for all the mentions present in a document. We outperform the state-of-the-art unsupervised methods by more than 40\% (in avg.) in terms of Precision@1 across various domain-specific datasets.

Title: GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

Authors: Zhikai Lei, Tianyi Liang, Hanglei Hu, Jin Zhang, Yunhua Zhou, Yunfan Shao, Linyang Li, Chenchui Li, Changbo Wang, Hang Yan, Qipeng Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10056
Pdf URL: https://arxiv.org/pdf/2412.10056
Copy Paste: [[2412.10056]] GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?(https://arxiv.org/abs/2412.10056)
Keywords: large language model
Abstract: Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.

Title: Quaffure: Real-Time Quasi-Static Neural Hair Simulation

Authors: Tuur Stuyck, Gene Wei-Chin Lin, Egor Larionov, Hsiao-yu Chen, Aljaz Bozic, Nikolaos Sarafianos, Doug Roble
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.10061
Pdf URL: https://arxiv.org/pdf/2412.10061
Copy Paste: [[2412.10061]] Quaffure: Real-Time Quasi-Static Neural Hair Simulation(https://arxiv.org/abs/2412.10061)
Keywords: robust
Abstract: Realistic hair motion is crucial for high-quality avatars, but it is often limited by the computational resources available for real-time applications. To address this challenge, we propose a novel neural approach to predict physically plausible hair deformations that generalizes to various body poses, shapes, and hairstyles. Our model is trained using a self-supervised loss, eliminating the need for expensive data generation and storage. We demonstrate our method's effectiveness through numerous results across a wide range of pose and shape variations, showcasing its robust generalization capabilities and temporally smooth results. Our approach is highly suitable for real-time applications with an inference time of only a few milliseconds on consumer hardware and its ability to scale to predicting the drape of 1000 grooms in 0.3 seconds.

Title: Text2Cypher: Bridging Natural Language and Graph Databases

Authors: Makbule Gulcin Ozsoy, Leila Messallem, Jon Besga, Gianandrea Minneci
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10064
Pdf URL: https://arxiv.org/pdf/2412.10064
Copy Paste: [[2412.10064]] Text2Cypher: Bridging Natural Language and Graph Databases(https://arxiv.org/abs/2412.10064)
Keywords: large language model
Abstract: Knowledge graphs use nodes, relationships, and properties to represent arbitrarily complex data. When stored in a graph database, the Cypher query language enables efficient modeling and querying of knowledge graphs. However, using Cypher requires specialized knowledge, which can present a challenge for non-expert users. Our work Text2Cypher aims to bridge this gap by translating natural language queries into Cypher query language and extending the utility of knowledge graphs to non-technical expert users. While large language models (LLMs) can be used for this purpose, they often struggle to capture complex nuances, resulting in incomplete or incorrect outputs. Fine-tuning LLMs on domain-specific datasets has proven to be a more promising approach, but the limited availability of high-quality, publicly available Text2Cypher datasets makes this challenging. In this work, we show how we combined, cleaned and organized several publicly available datasets into a total of 44,387 instances, enabling effective fine-tuning and evaluation. Models fine-tuned on this dataset showed significant performance gains, with improvements in Google-BLEU and Exact Match scores over baseline models, highlighting the importance of high-quality datasets and fine-tuning in improving Text2Cypher performance.

Title: Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA

Authors: George Arthur Baker, Ankush Raut, Sagi Shaier, Lawrence E Hunter, Katharina von der Wense
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10079
Pdf URL: https://arxiv.org/pdf/2412.10079
Copy Paste: [[2412.10079]] Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA(https://arxiv.org/abs/2412.10079)
Keywords: extraction
Abstract: Previous work finds that recent long-context language models fail to make equal use of information in the middle of their inputs, preferring pieces of information located at the tail ends which creates an undue bias in situations where we would like models to be equally capable of using different parts of the input. Thus far, the problem has mainly only been considered in settings with single pieces of critical information, leading us to question what happens when multiple necessary pieces of information are spread out over the inputs. Here, we demonstrate the effects of the "lost in the middle" problem in the multi-hop question answering setting -- in which multiple reasoning "hops" over disconnected documents are required -- and show that performance degrades not only with respect to the distance of information from the edges of the context, but also between pieces of information. Additionally, we experiment with means of alleviating the problem by reducing superfluous document contents through knowledge graph triple extraction and summarization, and prompting models to reason more thoroughly using chain-of-thought prompting.

Title: RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector

Authors: Zhensheng Wang, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10104
Pdf URL: https://arxiv.org/pdf/2412.10104
Copy Paste: [[2412.10104]] RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector(https://arxiv.org/abs/2412.10104)
Keywords: large language model
Abstract: The real estate market relies heavily on structured data, such as property details, market trends, and price fluctuations. However, the lack of specialized Tabular Question Answering datasets in this domain limits the development of automated question-answering systems. To fill this gap, we introduce RETQA, the first large-scale open-domain Chinese Tabular Question Answering dataset for Real Estate. RETQA comprises 4,932 tables and 20,762 question-answer pairs across 16 sub-fields within three major domains: property information, real estate company finance information and land auction information. Compared with existing tabular question answering datasets, RETQA poses greater challenges due to three key factors: long-table structures, open-domain retrieval, and multi-domain queries. To tackle these challenges, we propose the SLUTQA framework, which integrates large language models with spoken language understanding tasks to enhance retrieval and answering accuracy. Extensive experiments demonstrate that SLUTQA significantly improves the performance of large language models on RETQA by in-context learning. RETQA and SLUTQA provide essential resources for advancing tabular question answering research in the real estate domain, addressing critical challenges in open-domain and long-table question-answering. The dataset and code are publicly available at \url{this https URL}.

Title: The Art of Deception: Color Visual Illusions and Diffusion Models

Authors: Alex Gomez-Villa, Kai Wang, Alejandro C. Parraga, Bartlomiej Twardowski, Jesus Malo, Javier Vazquez-Corral, Joost van de Weijer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10122
Pdf URL: https://arxiv.org/pdf/2412.10122
Copy Paste: [[2412.10122]] The Art of Deception: Color Visual Illusions and Diffusion Models(https://arxiv.org/abs/2412.10122)
Keywords: diffusion
Abstract: Visual illusions in humans arise when interpreting out-of-distribution stimuli: if the observer is adapted to certain statistics, perception of outliers deviates from reality. Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions. This revelation raises profound questions about the nature of visual information. Why are two independent systems, both human brains and ANNs, susceptible to the same illusions? Should any ANN be capable of perceiving visual illusions? Are these perceptions a feature or a flaw? In this work, we study how visual illusions are encoded in diffusion models. Remarkably, we show that they present human-like brightness/color shifts in their latent space. We use this fact to demonstrate that diffusion models can predict visual illusions. Furthermore, we also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models. We validate this ability through psychophysical experiments that show how our model-generated illusions also fool humans.

Title: Feature Selection for Latent Factor Models

Authors: Rittwika Kansabanik, Adrian Barbu
Subjects: cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2412.10128
Pdf URL: https://arxiv.org/pdf/2412.10128
Copy Paste: [[2412.10128]] Feature Selection for Latent Factor Models(https://arxiv.org/abs/2412.10128)
Keywords: generative
Abstract: Feature selection is crucial for pinpointing relevant features in high-dimensional datasets, mitigating the 'curse of dimensionality,' and enhancing machine learning performance. Traditional feature selection methods for classification use data from all classes to select features for each class. This paper explores feature selection methods that select features for each class separately, using class models based on low-rank generative methods and introducing a signal-to-noise ratio (SNR) feature selection criterion. This novel approach has theoretical true feature recovery guarantees under certain assumptions and is shown to outperform some existing feature selection methods on standard classification datasets.

Title: ASLoRA: Adaptive Sharing Low-Rank Adaptation Across Layers

Authors: Junyan Hu, Xue Xiao, Mengqi Zhang, Xiao Chen, Zhaochun Ren, Zhumin Chen, Pengjie Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10135
Pdf URL: https://arxiv.org/pdf/2412.10135
Copy Paste: [[2412.10135]] ASLoRA: Adaptive Sharing Low-Rank Adaptation Across Layers(https://arxiv.org/abs/2412.10135)
Keywords: large language model
Abstract: As large language models (LLMs) grow in size, traditional full fine-tuning becomes increasingly impractical due to its high computational and storage costs. Although popular parameter-efficient fine-tuning methods, such as LoRA, have significantly reduced the number of tunable parameters, there is still room for further optimization. In this work, we propose ASLoRA, a cross-layer parameter-sharing strategy combining global sharing with partial adaptive sharing. Specifically, we share the low-rank matrix A across all layers and adaptively merge matrix B during training. This sharing mechanism not only mitigates overfitting effectively but also captures inter-layer dependencies, significantly enhancing the model's representational capability. We conduct extensive experiments on various NLP tasks, showing that ASLoRA outperforms LoRA while using less than 25% of the parameters, highlighting its flexibility and superior parameter efficiency. Furthermore, in-depth analyses of the adaptive sharing strategy confirm its significant advantages in enhancing both model flexibility and task adaptability.

Title: Can LLMs Convert Graphs to Text-Attributed Graphs?

Authors: Zehong Wang, Sidney Liu, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10136
Pdf URL: https://arxiv.org/pdf/2412.10136
Copy Paste: [[2412.10136]] Can LLMs Convert Graphs to Text-Attributed Graphs?(https://arxiv.org/abs/2412.10136)
Keywords: large language model
Abstract: Graphs are ubiquitous data structures found in numerous real-world applications, such as drug discovery, recommender systems, and social network analysis. Graph neural networks (GNNs) have become a popular tool to learn node embeddings through message passing on these structures. However, a significant challenge arises when applying GNNs to multiple graphs with different feature spaces, as existing GNN architectures are not designed for cross-graph feature alignment. To address this, recent approaches introduce text-attributed graphs, where each node is associated with a textual description, enabling the use of a shared textual encoder to project nodes from different graphs into a unified feature space. While promising, this method relies heavily on the availability of text-attributed data, which can be difficult to obtain in practice. To bridge this gap, we propose a novel method named Topology-Aware Node description Synthesis (TANS), which leverages large language models (LLMs) to automatically convert existing graphs into text-attributed graphs. The key idea is to integrate topological information with each node's properties, enhancing the LLMs' ability to explain how graph topology influences node semantics. We evaluate our TANS on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs. Notably, on text-free graphs, our method significantly outperforms existing approaches that manually design node features, showcasing the potential of LLMs for preprocessing graph-structured data, even in the absence of textual information. The code and data are available at this https URL.

Title: ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Authors: Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, Jieping Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10138
Pdf URL: https://arxiv.org/pdf/2412.10138
Copy Paste: [[2412.10138]] ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL(https://arxiv.org/abs/2412.10138)
Keywords: robust, large language model
Abstract: Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by large language models (LLMs), the latest state-of-the-art techniques are still trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open-source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi-task supervised fine-tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT-based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model's understanding of SQL syntax and improves its ability to generate high-quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL-related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.

Title: UN-DETR: Promoting Objectness Learning via Joint Supervision for Unknown Object Detection

Authors: Haomiao Liu, Hao Xu, Chuhuai Yue, Bo Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10176
Pdf URL: https://arxiv.org/pdf/2412.10176
Copy Paste: [[2412.10176]] UN-DETR: Promoting Objectness Learning via Joint Supervision for Unknown Object Detection(https://arxiv.org/abs/2412.10176)
Keywords: transformer
Abstract: Unknown Object Detection (UOD) aims to identify objects of unseen categories, differing from the traditional detection paradigm limited by the closed-world assumption. A key component of UOD is learning a generalized representation, i.e. objectness for both known and unknown categories to distinguish and localize objects from the background in a class-agnostic manner. However, previous methods obtain supervision signals for learning objectness in isolation from either localization or classification information, leading to poor performance for UOD. To address this issue, we propose a transformer-based UOD framework, UN-DETR. Based on this, we craft Instance Presence Score (IPS) to represent the probability of an object's presence. For the purpose of information complementarity, IPS employs a strategy of joint supervised learning, integrating attributes representing general objectness from the positional and the categorical latent space as supervision signals. To enhance IPS learning, we introduce a one-to-many assignment strategy to incorporate more supervision. Then, we propose Unbiased Query Selection to provide premium initial query vectors for the decoder. Additionally, we propose an IPS-guided post-process strategy to filter redundant boxes and correct classification predictions for known and unknown objects. Finally, we pretrain the entire UN-DETR in an unsupervised manner, in order to obtain objectness prior. Our UN-DETR is comprehensively evaluated on multiple UOD and known detection benchmarks, demonstrating its effectiveness and achieving state-of-the-art performance.

Title: SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

Authors: Hung Nguyen, Quang Qui-Vinh Nguyen, Khoi Nguyen, Rang Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10178
Pdf URL: https://arxiv.org/pdf/2412.10178
Copy Paste: [[2412.10178]] SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models(https://arxiv.org/abs/2412.10178)
Keywords: diffusion
Abstract: Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. While significant advances have been made in image-based virtual try-ons, extending these successes to video often results in frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequence. To address these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we introduce ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the \dataname~dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments show that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. Data and code are available at this https URL

Title: Ultra-High Resolution Segmentation via Boundary-Enhanced Patch-Merging Transformer

Authors: Haopeng Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10181
Pdf URL: https://arxiv.org/pdf/2412.10181
Copy Paste: [[2412.10181]] Ultra-High Resolution Segmentation via Boundary-Enhanced Patch-Merging Transformer(https://arxiv.org/abs/2412.10181)
Keywords: transformer, segmentation
Abstract: Segmentation of ultra-high resolution (UHR) images is a critical task with numerous applications, yet it poses significant challenges due to high spatial resolution and rich fine details. Recent approaches adopt a dual-branch architecture, where a global branch learns long-range contextual information and a local branch captures fine details. However, they struggle to handle the conflict between global and local information while adding significant extra computational cost. Inspired by the human visual system's ability to rapidly orient attention to important areas with fine details and filter out irrelevant information, we propose a novel UHR segmentation method called Boundary-enhanced Patch-merging Transformer (BPT). BPT consists of two key components: (1) Patch-Merging Transformer (PMT) for dynamically allocating tokens to informative regions to acquire global and local representations, and (2) Boundary-Enhanced Module (BEM) that leverages boundary information to enrich fine details. Extensive experiments on multiple UHR image segmentation benchmarks demonstrate that our BPT outperforms previous state-of-the-art methods without introducing extra computational overhead. Codes will be released to facilitate research.

Title: BiCert: A Bilinear Mixed Integer Programming Formulation for Precise Certified Bounds Against Data Poisoning Attacks

Authors: Tobias Lorenz, Marta Kwiatkowska, Mario Fritz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10186
Pdf URL: https://arxiv.org/pdf/2412.10186
Copy Paste: [[2412.10186]] BiCert: A Bilinear Mixed Integer Programming Formulation for Precise Certified Bounds Against Data Poisoning Attacks(https://arxiv.org/abs/2412.10186)
Keywords: defense, attack, robust
Abstract: Data poisoning attacks pose one of the biggest threats to modern AI systems, necessitating robust defenses. While extensive efforts have been made to develop empirical defenses, attackers continue to evolve, creating sophisticated methods to circumvent these measures. To address this, we must move beyond empirical defenses and establish provable certification methods that guarantee robustness. This paper introduces a novel certification approach, BiCert, using Bilinear Mixed Integer Programming (BMIP) to compute sound deterministic bounds that provide such provable robustness. Using BMIP, we compute the reachable set of parameters that could result from training with potentially manipulated data. A key element to make this computation feasible is to relax the reachable parameter set to a convex set between training iterations. At test time, this parameter set allows us to predict all possible outcomes, guaranteeing robustness. BiCert is more precise than previous methods, which rely solely on interval and polyhedral bounds. Crucially, our approach overcomes the fundamental limitation of prior approaches where parameter bounds could only grow, often uncontrollably. We show that BiCert's tighter bounds eliminate a key source of divergence issues, resulting in more stable training and higher certified accuracy.

Title: Simple Guidance Mechanisms for Discrete Diffusion Models

Authors: Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P. de Almeida, Alexander Rush, Thomas Pierrot, Volodymyr Kuleshov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10193
Pdf URL: https://arxiv.org/pdf/2412.10193
Copy Paste: [[2412.10193]] Simple Guidance Mechanisms for Discrete Diffusion Models(https://arxiv.org/abs/2412.10193)
Keywords: diffusion
Abstract: Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation.

Title: From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection

Authors: Haowei Wang, Rupeng Zhang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, Qing Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10198
Pdf URL: https://arxiv.org/pdf/2412.10198
Copy Paste: [[2412.10198]] From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection(https://arxiv.org/abs/2412.10198)
Keywords: secure, security, privacy, attack, robust, large language model
Abstract: Tool-calling has changed Large Language Model (LLM) applications by integrating external tools, significantly enhancing their functionality across diverse tasks. However, this integration also introduces new security vulnerabilities, particularly in the tool scheduling mechanisms of LLM, which have not been extensively studied. To fill this gap, we present ToolCommander, a novel framework designed to exploit vulnerabilities in LLM tool-calling systems through adversarial tool injection. Our framework employs a well-designed two-stage attack strategy. Firstly, it injects malicious tools to collect user queries, then dynamically updates the injected tools based on the stolen information to enhance subsequent attacks. These stages enable ToolCommander to execute privacy theft, launch denial-of-service attacks, and even manipulate business competition by triggering unscheduled tool-calling. Notably, the ASR reaches 91.67% for privacy theft and hits 100% for denial-of-service and unscheduled tool calling in certain cases. Our work demonstrates that these vulnerabilities can lead to severe consequences beyond simple misuse of tool-calling systems, underscoring the urgent need for robust defensive strategies to secure LLM Tool-calling systems.

Title: Integrative Analysis of Financial Market Sentiment Using CNN and GRU for Risk Prediction and Alert Systems

Authors: You Wu, Mengfang Sun, Hongye Zheng, Jinxin Hu, Yingbin Liang, Zhenghao Lin
Subjects: cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2412.10199
Pdf URL: https://arxiv.org/pdf/2412.10199
Copy Paste: [[2412.10199]] Integrative Analysis of Financial Market Sentiment Using CNN and GRU for Risk Prediction and Alert Systems(https://arxiv.org/abs/2412.10199)
Keywords: robust, extraction
Abstract: This document presents an in-depth examination of stock market sentiment through the integration of Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU), enabling precise risk alerts. The robust feature extraction capability of CNN is utilized to preprocess and analyze extensive network text data, identifying local features and patterns. The extracted feature sequences are then input into the GRU model to understand the progression of emotional states over time and their potential impact on future market sentiment and risk. This approach addresses the order dependence and long-term dependencies inherent in time series data, resulting in a detailed analysis of stock market sentiment and effective early warnings of future risks.

Title: Retrieval-Augmented Semantic Parsing: Using Large Language Models to Improve Generalization

Authors: Xiao Zhang, Qianru Meng, Johan Bos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10207
Pdf URL: https://arxiv.org/pdf/2412.10207
Copy Paste: [[2412.10207]] Retrieval-Augmented Semantic Parsing: Using Large Language Models to Improve Generalization(https://arxiv.org/abs/2412.10207)
Keywords: robust, large language model
Abstract: Open-domain semantic parsing remains a challenging task, as models often rely on heuristics and struggle to handle unseen concepts. In this paper, we investigate the potential of large language models (LLMs) for this task and introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective approach that integrates external lexical knowledge into the parsing process. Our experiments not only show that LLMs outperform previous encoder-decoder baselines for semantic parsing, but that RASP further enhances their ability to predict unseen concepts, nearly doubling the performance of previous models on out-of-distribution concepts. These findings highlight the promise of leveraging large language models and retrieval mechanisms for robust and open-domain semantic parsing.

Title: Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Authors: Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10208
Pdf URL: https://arxiv.org/pdf/2412.10208
Copy Paste: [[2412.10208]] Efficient Generative Modeling with Residual Vector Quantization-Based Tokens(https://arxiv.org/abs/2412.10208)
Keywords: diffusion, generative
Abstract: We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at this https URL

Title: GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion

Authors: Jiapeng Tang, Davide Davoli, Tobias Kirschstein, Liam Schoneveld, Matthias Niessner
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.10209
Pdf URL: https://arxiv.org/pdf/2412.10209
Copy Paste: [[2412.10209]] GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion(https://arxiv.org/abs/2412.10209)
Keywords: diffusion
Abstract: We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34\% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.

Title: Learning Complex Non-Rigid Image Edits from Multimodal Conditioning

Authors: Nikolai Warner, Jack Kolb, Meera Hahn, Vighnesh Birodkar, Jonathan Huang, Irfan Essa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10219
Pdf URL: https://arxiv.org/pdf/2412.10219
Copy Paste: [[2412.10219]] Learning Complex Non-Rigid Image Edits from Multimodal Conditioning(https://arxiv.org/abs/2412.10219)
Keywords: robust, diffusion
Abstract: In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects. Combining the weak supervision from noisy captions, with robust 2D pose improves the quality of person-object interactions.

Title: SPT: Sequence Prompt Transformer for Interactive Image Segmentation

Authors: Senlin Cheng, Haopeng Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10224
Pdf URL: https://arxiv.org/pdf/2412.10224
Copy Paste: [[2412.10224]] SPT: Sequence Prompt Transformer for Interactive Image Segmentation(https://arxiv.org/abs/2412.10224)
Keywords: transformer, segmentation
Abstract: Interactive segmentation aims to extract objects of interest from an image based on user-provided clicks. In real-world applications, there is often a need to segment a series of images featuring the same target object. However, existing methods typically process one image at a time, failing to consider the sequential nature of the images. To overcome this limitation, we propose a novel method called Sequence Prompt Transformer (SPT), the first to utilize sequential image information for interactive segmentation. Our model comprises two key components: (1) Sequence Prompt Transformer (SPT) for acquiring information from sequence of images, clicks and masks to improve accurate. (2) Top-k Prompt Selection (TPS) selects precise prompts for SPT to further enhance the segmentation effect. Additionally, we create the ADE20K-Seq benchmark to better evaluate model performance. We evaluate our approach on multiple benchmark datasets and show that our model surpasses state-of-the-art methods across all datasets.

Title: SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians

Authors: Siyun Liang, Sen Wang, Kunyi Li, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10231
Pdf URL: https://arxiv.org/pdf/2412.10231
Copy Paste: [[2412.10231]] SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians(https://arxiv.org/abs/2412.10231)
Keywords: segmentation
Abstract: 3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While the vanilla Gaussian Splatting representation is mainly designed for view synthesis, more recent works investigated how to extend it with scene understanding and language features. However, existing methods lack a detailed comprehension of scenes, limiting their ability to segment and interpret complex structures. To this end, We introduce SuperGSeg, a novel approach that fosters cohesive, context-aware scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural Gaussians to learn instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of what we call Super-Gaussians. Super-Gaussians facilitate the distillation of 2D language features into 3D space. Through Super-Gaussians, our method enables high-dimensional language feature rendering without extreme increases in GPU memory. Extensive experiments demonstrate that SuperGSeg outperforms prior works on both open-vocabulary object localization and semantic segmentation tasks.

Title: Efficient Continual Pre-training of LLMs for Low-resource Languages

Authors: Arijit Nag, Soumen Chakrabarti, Animesh Mukherjee, Niloy Ganguly
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10244
Pdf URL: https://arxiv.org/pdf/2412.10244
Copy Paste: [[2412.10244]] Efficient Continual Pre-training of LLMs for Low-resource Languages(https://arxiv.org/abs/2412.10244)
Keywords: large language model
Abstract: Open-source Large Language models (OsLLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost. To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary. We experiment with the recent Llama-3 model and nine Indian languages with diverse scripts and extent of resource availability. For evaluation, we use IndicGenBench, a generation task benchmark dataset for Indic languages. We experiment with various CPT corpora and augmented vocabulary size and offer insights across language families.

Title: Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts

Authors: Hazel Kim, Adel Bibi, Philip Torr, Yarin Gal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10246
Pdf URL: https://arxiv.org/pdf/2412.10246
Copy Paste: [[2412.10246]] Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts(https://arxiv.org/abs/2412.10246)
Keywords: robust, large language model
Abstract: Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel approach to detecting model hallucination through systematic analysis of information flow across model layers when processing inputs with insufficient or ambiguous context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics ($\mathcal{L}$I) provides robust indicators of model reliability, accounting for both information gain and loss during computation. $\mathcal{L}$I improves model reliability by immediately integrating with universal LLMs without additional training or architectural modifications.

Title: Targeted Angular Reversal of Weights (TARS) for Knowledge Removal in Large Language Models

Authors: Harry J. Davies, Giorgos Iacovides, Danilo P. Mandic
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10257
Pdf URL: https://arxiv.org/pdf/2412.10257
Copy Paste: [[2412.10257]] Targeted Angular Reversal of Weights (TARS) for Knowledge Removal in Large Language Models(https://arxiv.org/abs/2412.10257)
Keywords: security, large language model
Abstract: The sheer scale of data required to train modern large language models (LLMs) poses significant risks, as models are likely to gain knowledge of sensitive topics such as bio-security, as well the ability to replicate copyrighted works. Methods designed to remove such knowledge must do so from all prompt directions, in a multi-lingual capacity and without degrading general model performance. To this end, we introduce the targeted angular reversal (TARS) method of knowledge removal from LLMs. The TARS method firstly leverages the LLM in combination with a detailed prompt to aggregate information about a selected concept in the internal representation space of the LLM. It then refines this approximate concept vector to trigger the concept token with high probability, by perturbing the approximate concept vector with noise and transforming it into token scores with the language model head. The feedforward weight vectors in the LLM which operate directly on the internal representation space, and have the highest cosine similarity with this targeting vector, are then replaced by a reversed targeting vector, thus limiting the ability of the concept to propagate through the model. The modularity of the TARS method allows for a sequential removal of concepts from Llama 3.1 8B, such as the famous literary detective Sherlock Holmes, and the planet Saturn. It is demonstrated that the probability of triggering target concepts can be reduced to 0.00 with as few as 1 TARS edit, whilst simultaneously removing the knowledge bi-directionally. Moreover, knowledge is shown to be removed across all languages despite only being targeted in English. Importantly, TARS has minimal impact on the general model capabilities, as after removing 5 diverse concepts in a modular fashion, there is minimal KL divergence in the next token probabilities of the LLM on large corpora of Wikipedia text (median of 0.002).

Title: MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization

Authors: Shuaiting Li, Chengxuan Wang, Juncan Deng, Zeyu Wang, Zewen Ye, Zongsheng Wang, Haibin Shen, Kejie Huang
Subjects: cs.CV, cs.AR
Abstract URL: https://arxiv.org/abs/2412.10261
Pdf URL: https://arxiv.org/pdf/2412.10261
Copy Paste: [[2412.10261]] MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization(https://arxiv.org/abs/2412.10261)
Keywords: segmentation
Abstract: Vector quantization(VQ) is a hardware-friendly DNN compression method that can reduce the storage cost and weight-loading datawidth of hardware accelerators. However, conventional VQ techniques lead to significant accuracy loss because the important weights are not well preserved. To tackle this problem, a novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords. At the algorithm level, our approach removes the less important weights through N:M pruning and then minimizes the vector clustering error between the remaining weights and codewords by the masked k-means algorithm. Only distances between the unpruned weights and the codewords are computed, which are then used to update the codewords. At the architecture level, our accelerator implements vector quantization on an EWS (Enhanced weight stationary) CNN accelerator and proposes a sparse systolic array design to maximize the benefits brought by masked vector quantization.\\ Our algorithm is validated on various models for image classification, object detection, and segmentation tasks. Experimental results demonstrate that MVQ not only outperforms conventional vector quantization methods at comparable compression ratios but also reduces FLOPs. Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3$\times$ and reduces the size of the systolic array by 55\% when compared with the base EWS accelerator. Compared to the previous sparse accelerators, MVQ achieves 1.73$\times$ higher energy efficiency.

Title: Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication

Authors: Alireza Furutanpey, Pantelis A. Frangoudis, Patrik Szabo, Schahram Dustdar
Subjects: cs.LG, cs.DC, cs.NI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.10265
Pdf URL: https://arxiv.org/pdf/2412.10265
Copy Paste: [[2412.10265]] Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication(https://arxiv.org/abs/2412.10265)
Keywords: security, attack, robust, generative
Abstract: This paper investigates the adversarial robustness of Deep Neural Networks (DNNs) using Information Bottleneck (IB) objectives for task-oriented communication systems. We empirically demonstrate that while IB-based approaches provide baseline resilience against attacks targeting downstream tasks, the reliance on generative models for task-oriented communication introduces new vulnerabilities. Through extensive experiments on several datasets, we analyze how bottleneck depth and task complexity influence adversarial robustness. Our key findings show that Shallow Variational Bottleneck Injection (SVBI) provides less adversarial robustness compared to Deep Variational Information Bottleneck (DVIB) approaches, with the gap widening for more complex tasks. Additionally, we reveal that IB-based objectives exhibit stronger robustness against attacks focusing on salient pixels with high intensity compared to those perturbing many pixels with lower intensity. Lastly, we demonstrate that task-oriented communication systems that rely on generative models to extract and recover salient information have an increased attack surface. The results highlight important security considerations for next-generation communication systems that leverage neural networks for goal-oriented compression.

Title: Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media

Authors: Jiaqing Yuan, Ruijie Xi, Munindar P. Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10266
Pdf URL: https://arxiv.org/pdf/2412.10266
Copy Paste: [[2412.10266]] Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media(https://arxiv.org/abs/2412.10266)
Keywords: robust, interpretability, generative, large language model
Abstract: Stance detection is crucial for fostering a human-centric Web by analyzing user-generated content to identify biases and harmful narratives that undermine trust. With the development of Large Language Models (LLMs), existing approaches treat stance detection as a classification problem, providing robust methodologies for modeling complex group interactions and advancing capabilities in natural language tasks. However, these methods often lack interpretability, limiting their ability to offer transparent and understandable justifications for predictions. This study adopts a generative approach, where stance predictions include explicit, interpretable rationales, and integrates them into smaller language models through single-task and multitask learning. We find that incorporating reasoning into stance detection enables the smaller model (FlanT5) to outperform GPT-3.5's zero-shot performance, achieving an improvement of up to 9.57%. Moreover, our results show that reasoning capabilities enhance multitask learning performance but may reduce effectiveness in single-task settings. Crucially, we demonstrate that faithful rationales improve rationale distillation into SLMs, advancing efforts to build interpretable, trustworthy systems for addressing discrimination, fostering trust, and promoting equitable engagement on social media.

Title: Benchmarking Linguistic Diversity of Large Language Models

Authors: Yanzhu Guo, Guokan Shang, Chloé Clavel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10271
Pdf URL: https://arxiv.org/pdf/2412.10271
Copy Paste: [[2412.10271]] Benchmarking Linguistic Diversity of Large Language Models(https://arxiv.org/abs/2412.10271)
Keywords: large language model
Abstract: The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We propose a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth case study for syntactic diversity. Finally, we analyze how different development and deployment choices impact the linguistic diversity of LLM outputs.

Title: Probabilistic Inverse Cameras: Image to 3D via Multiview Geometry

Authors: Rishabh Kabra, Drew A. Hudson, Sjoerd van Steenkiste, Joao Carreira, Niloy J. Mitra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10273
Pdf URL: https://arxiv.org/pdf/2412.10273
Copy Paste: [[2412.10273]] Probabilistic Inverse Cameras: Image to 3D via Multiview Geometry(https://arxiv.org/abs/2412.10273)
Keywords: diffusion
Abstract: We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" models the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation in a multiview image format to coordinate the generation of multiple target views simultaneously. We facilitate correspondence between views by assuming fixed target camera poses relative to the source camera, and constructing a predictable distribution of geometric features per target. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats SoTA baselines such as CAT3D and One-2-3-45 on held-out objects from ObjaverseXL, as well as real-world objects ranging from Google Scanned Objects, Amazon Berkeley Objects, to the Digital Twin Catalog.

Title: TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

Authors: Xingrui Wang, Xin Li, Yaosi Hu, Hanxin Zhu, Chen Hou, Cuiling Lan, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10275
Pdf URL: https://arxiv.org/pdf/2412.10275
Copy Paste: [[2412.10275]] TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation(https://arxiv.org/abs/2412.10275)
Keywords: diffusion
Abstract: Text-driven Image to Video Generation (TI2V) aims to generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how to identify the target objects and ensure the consistency between the movement trajectory and the textual description. (ii) how to improve the subjective quality of generated videos. To tackle the above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending to achieve precise control and high-quality video generation based on textual-described motion for different objects. Concretely, we enable our TIV-Diffuion model to perceive the textual-described objects and their motion trajectory by incorporating the fused textual and visual knowledge through scale-offset modulation. Moreover, to mitigate the problems of object disappearance and misaligned objects and motion, we introduce an object-centric textual-visual alignment module, which reduces the risk of misaligned objects/motion by decoupling the objects in the reference image and aligning textual features with each object individually. Based on the above innovations, our TIV-Diffusion achieves state-of-the-art high-quality video generation compared with existing TI2V methods.

Title: One world, one opinion? The superstar effect in LLM responses

Authors: Sofie Goethals, Lauren Rhue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10281
Pdf URL: https://arxiv.org/pdf/2412.10281
Copy Paste: [[2412.10281]] One world, one opinion? The superstar effect in LLM responses(https://arxiv.org/abs/2412.10281)
Keywords: large language model
Abstract: As large language models (LLMs) are shaping the way information is shared and accessed online, their opinions have the potential to influence a wide audience. This study examines who the LLMs view as the most prominent figures across various fields, using prompts in ten different languages to explore the influence of linguistic diversity. Our findings reveal low diversity in responses, with a small number of figures dominating recognition across languages (also known as the "superstar effect"). These results highlight the risk of narrowing global knowledge representation when LLMs retrieve subjective information.

Title: Still "Talking About Large Language Models": Some Clarifications

Authors: Murray Shanahan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10291
Pdf URL: https://arxiv.org/pdf/2412.10291
Copy Paste: [[2412.10291]] Still "Talking About Large Language Models": Some Clarifications(https://arxiv.org/abs/2412.10291)
Keywords: large language model
Abstract: My paper "Talking About Large Language Models" has more than once been interpreted as advocating a reductionist stance towards large language models. But the paper was not intended that way, and I do not endorse such positions. This short note situates the paper in the context of a larger philosophical project that is concerned with the (mis)use of words rather than metaphysics, in the spirit of Wittgenstein's later writing.

Title: Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

Authors: Yu-Jhe Li, Xinyang Zhang, Kun Wan, Lantao Yu, Ajinkya Kale, Xin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10292
Pdf URL: https://arxiv.org/pdf/2412.10292
Copy Paste: [[2412.10292]] Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation(https://arxiv.org/abs/2412.10292)
Keywords: segmentation
Abstract: We tackle the challenge of open-vocabulary segmentation, where we need to identify objects from a wide range of categories in different environments, using text prompts as our input. To overcome this challenge, existing methods often use multi-modal models like CLIP, which combine image and text features in a shared embedding space to bridge the gap between limited and extensive vocabulary recognition, resulting in a two-stage approach: In the first stage, a mask generator takes an input image to generate mask proposals, and the in the second stage the target mask is picked based on the query. However, the expected target mask may not exist in the generated mask proposals, which leads to an unexpected output mask. In our work, we propose a novel approach named Prompt-guided Mask Proposal (PMP) where the mask generator takes the input text prompts and generates masks guided by these prompts. Compared with mask proposals generated without input prompts, masks generated by PMP are better aligned with the input prompts. To realize PMP, we designed a cross-attention mechanism between text tokens and query tokens which is capable of generating prompt-guided mask proposals after each decoding. We combined our PMP with several existing works employing a query-based segmentation backbone and the experiments on five benchmark datasets demonstrate the effectiveness of this approach, showcasing significant improvements over the current two-stage models (1% ~ 3% absolute performance gain in terms of mIOU). The steady improvement in performance across these benchmarks indicates the effective generalization of our proposed lightweight prompt-aware method.

Title: Coherent 3D Scene Diffusion From a Single RGB Image

Authors: Manuel Dahnert, Angela Dai, Norman Müller, Matthias Nießner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10294
Pdf URL: https://arxiv.org/pdf/2412.10294
Copy Paste: [[2412.10294]] Coherent 3D Scene Diffusion From a Single RGB Image(https://arxiv.org/abs/2412.10294)
Keywords: diffusion, generative
Abstract: We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.

Title: BrushEdit: All-In-One Image Inpainting and Editing

Authors: Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Ying Shan, Qiang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10316
Pdf URL: https://arxiv.org/pdf/2412.10316
Copy Paste: [[2412.10316]] BrushEdit: All-In-One Image Inpainting and Editing(https://arxiv.org/abs/2412.10316)
Keywords: diffusion, large language model
Abstract: Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.

Title: SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Authors: Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10319
Pdf URL: https://arxiv.org/pdf/2412.10319
Copy Paste: [[2412.10319]] SCBench: A KV Cache-Centric Analysis of Long-Context Methods(https://arxiv.org/abs/2412.10319)
Keywords: robust
Abstract: Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. this https URL.

Title: AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Authors: Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.10321
Pdf URL: https://arxiv.org/pdf/2412.10321
Copy Paste: [[2412.10321]] AdvPrefix: An Objective for Nuanced LLM Jailbreaks(https://arxiv.org/abs/2412.10321)
Keywords: attack, large language model
Abstract: Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix "Sure, here is (harmful request)". While straightforward, this objective has two limitations: limited control over model behaviors, often resulting in incomplete or unrealistic responses, and a rigid format that hinders optimization. To address these limitations, we introduce AdvPrefix, a new prefix-forcing objective that enables more nuanced control over model behavior while being easy to optimize. Our objective leverages model-dependent prefixes, automatically selected based on two criteria: high prefilling attack success rates and low negative log-likelihood. It can further simplify optimization by using multiple prefixes for a single user request. AdvPrefix can integrate seamlessly into existing jailbreak attacks to improve their performance for free. For example, simply replacing GCG attack's target prefixes with ours on Llama-3 improves nuanced attack success rates from 14% to 80%, suggesting that current alignment struggles to generalize to unseen prefixes. Our work demonstrates the importance of jailbreak objectives in achieving nuanced jailbreaks.

Title: Generative AI in Medicine

Authors: Divya Shanmugam, Monica Agrawal, Rajiv Movva, Irene Y. Chen, Marzyeh Ghassemi, Emma Pierson
Subjects: cs.LG, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2412.10337
Pdf URL: https://arxiv.org/pdf/2412.10337
Copy Paste: [[2412.10337]] Generative AI in Medicine(https://arxiv.org/abs/2412.10337)
Keywords: security, privacy, interpretability, generative
Abstract: The increased capabilities of generative AI have dramatically expanded its possible use cases in medicine. We provide a comprehensive overview of generative AI use cases for clinicians, patients, clinical trial organizers, researchers, and trainees. We then discuss the many challenges -- including maintaining privacy and security, improving transparency and interpretability, upholding equity, and rigorously evaluating models -- which must be overcome to realize this potential, and the open research directions they give rise to.

Title: XYScanNet: An Interpretable State Space Model for Perceptual Image Deblurring

Authors: Hanzhou Liu, Chengkai Liu, Jiacong Xu, Peng Jiang, Mi Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10338
Pdf URL: https://arxiv.org/pdf/2412.10338
Copy Paste: [[2412.10338]] XYScanNet: An Interpretable State Space Model for Perceptual Image Deblurring(https://arxiv.org/abs/2412.10338)
Keywords: transformer
Abstract: Deep state-space models (SSMs), like recent Mamba architectures, are emerging as a promising alternative to CNN and Transformer networks. Existing Mamba-based restoration methods process the visual data by leveraging a flatten-and-scan strategy that converts image patches into a 1D sequence before scanning. However, this scanning paradigm ignores local pixel dependencies and introduces spatial misalignment by positioning distant pixels incorrectly adjacent, which reduces local noise-awareness and degrades image sharpness in low-level vision tasks. To overcome these issues, we propose a novel slice-and-scan strategy that alternates scanning along intra- and inter-slices. We further design a new Vision State Space Module (VSSM) for image deblurring, and tackle the inefficiency challenges of the current Mamba-based vision module. Building upon this, we develop XYScanNet, an SSM architecture integrated with a lightweight feature fusion module for enhanced image deblurring. XYScanNet, maintains competitive distortion metrics and significantly improves perceptual performance. Experimental results show that XYScanNet enhances KID by $17\%$ compared to the nearest competitor. Our code will be released soon.

Title: A Universal Degradation-based Bridging Technique for Domain Adaptive Semantic Segmentation

Authors: Wangkai Li, Rui Sun, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10339
Pdf URL: https://arxiv.org/pdf/2412.10339
Copy Paste: [[2412.10339]] A Universal Degradation-based Bridging Technique for Domain Adaptive Semantic Segmentation(https://arxiv.org/abs/2412.10339)
Keywords: diffusion, segmentation
Abstract: Semantic segmentation often suffers from significant performance degradation when the trained network is applied to a different domain. To address this issue, unsupervised domain adaptation (UDA) has been extensively studied. Existing methods introduce the domain bridging techniques to mitigate substantial domain gap, which construct intermediate domains to facilitate the gradual transfer of knowledge across different domains. However, these strategies often require dataset-specific designs and may generate unnatural intermediate distributions that lead to semantic shift. In this paper, we propose DiDA, a universal degradation-based bridging technique formalized as a diffusion forward process. DiDA consists of two key modules: (1) Degradation-based Intermediate Domain Construction, which creates continuous intermediate domains through simple image degradation operations to encourage learning domain-invariant features as domain differences gradually diminish; (2) Semantic Shift Compensation, which leverages a diffusion encoder to encode and compensate for semantic shift information with degraded time-steps, preserving discriminative representations in the intermediate domains. As a plug-and-play solution, DiDA supports various degradation operations and seamlessly integrates with existing UDA methods. Extensive experiments on prevalent synthetic-to-real semantic segmentation benchmarks demonstrate that DiDA consistently improves performance across different settings and achieves new state-of-the-art results when combined with existing methods.

Title: Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

Authors: Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10342
Pdf URL: https://arxiv.org/pdf/2412.10342
Copy Paste: [[2412.10342]] Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining(https://arxiv.org/abs/2412.10342)
Keywords: large language model
Abstract: Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based agents built on Large Language Models (LLMs) often require frequent updates due to platform-specific APIs, visual agents leveraging Multimodal Large Language Models (MLLMs) offer enhanced adaptability by interacting directly with Graphical User Interfaces (GUIs). However, these agents face significant challenges in visual perception, particularly when handling high-resolution, visually complex digital environments. This paper introduces Iris, a foundational visual agent that addresses these challenges through two key innovations: Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL). ISC dynamically identifies and prioritizes visually dense regions using a edge detection algorithm, enabling efficient processing by allocating more computational resources to areas with higher information density. SRDL enhances the agent's ability to handle complex tasks by leveraging a dual-learning loop, where improvements in referring (describing UI elements) reinforce grounding (locating elements) and vice versa, all without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations, outperforming methods using 10x more training data. These improvements further translate to significant gains in both web and OS agent downstream tasks.

Title: VibrantVS: A high-resolution multi-task transformer for forest canopy height estimation

Authors: Tony Chang, Kiarie Ndegwa, Andreas Gros, Vincent A. Landau, Luke J. Zachmann, Bogdan State, Mitchell A. Gritts, Colton W. Miller, Nathan E. Rutenbeck, Scott Conway, Guy Bayes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10351
Pdf URL: https://arxiv.org/pdf/2412.10351
Copy Paste: [[2412.10351]] VibrantVS: A high-resolution multi-task transformer for forest canopy height estimation(https://arxiv.org/abs/2412.10351)
Keywords: transformer
Abstract: This paper explores the application of a novel multi-task vision transformer (ViT) model for the estimation of canopy height models (CHMs) using 4-band National Agriculture Imagery Program (NAIP) imagery across the western United States. We compare the effectiveness of this model in terms of accuracy and precision aggregated across ecoregions and class heights versus three other benchmark peer-reviewed models. Key findings suggest that, while other benchmark models can provide high precision in localized areas, the VibrantVS model has substantial advantages across a broad reach of ecoregions in the western United States with higher accuracy, higher precision, the ability to generate updated inference at a cadence of three years or less, and high spatial resolution. The VibrantVS model provides significant value for ecological monitoring and land management decisions for wildfire mitigation.

Title: Robust image classification with multi-modal large language models

Authors: Francesco Villani, Igor Maljkovic, Dario Lazzaro, Angelo Sotgiu, Antonio Emanuele Cinà, Fabio Roli
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10353
Pdf URL: https://arxiv.org/pdf/2412.10353
Copy Paste: [[2412.10353]] Robust image classification with multi-modal large language models(https://arxiv.org/abs/2412.10353)
Keywords: defense, robust, large language model
Abstract: Deep Neural Networks are vulnerable to adversarial examples, i.e., carefully crafted input samples that can cause models to make incorrect predictions with high confidence. To mitigate these vulnerabilities, adversarial training and detection-based defenses have been proposed to strengthen models in advance. However, most of these approaches focus on a single data modality, overlooking the relationships between visual patterns and textual descriptions of the input. In this paper, we propose a novel defense, Multi-Shield, designed to combine and complement these defenses with multi-modal information to further enhance their robustness. Multi-Shield leverages multi-modal large language models to detect adversarial examples and abstain from uncertain classifications when there is no alignment between textual and visual representations of the input. Extensive evaluations on CIFAR-10 and ImageNet datasets, using robust and non-robust image classification models, demonstrate that Multi-Shield can be easily integrated to detect and reject adversarial examples, outperforming the original defenses.

Title: UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

Authors: Muhammad Uzair Khattak, Shahina Kunhimon, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10372
Pdf URL: https://arxiv.org/pdf/2412.10372
Copy Paste: [[2412.10372]] UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities(https://arxiv.org/abs/2412.10372)
Keywords: large language model
Abstract: Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. Similarly, most models remain specific to a single or limited number of medical imaging domains, again restricting their applicability to other modalities. To address this gap, we introduce UniMed, a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs across six diverse imaging modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus. UniMed is developed using a data-collection framework that leverages Large Language Models (LLMs) to transform modality-specific classification datasets into image-text formats while incorporating existing image-text data from the medical domain, facilitating scalable VLM pretraining. Using UniMed, we trained UniMed-CLIP, a unified VLM for six modalities that significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs, achieving notable gains in zero-shot evaluations. For instance, UniMed-CLIP improves over BiomedCLIP (trained on proprietary data) by an absolute gain of +12.61, averaged over 21 datasets, while using 3x less training data. To facilitate future research, we release UniMed dataset, training codes, and models at this https URL.