2025-01-03

Title: A Breadth-First Catalog of Text Processing, Speech Processing and Multimodal Research in South Asian Languages

Authors: Pranav Gupta
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00029
Pdf URL: https://arxiv.org/pdf/2501.00029
Copy Paste: [[2501.00029]] A Breadth-First Catalog of Text Processing, Speech Processing and Multimodal Research in South Asian Languages(https://arxiv.org/abs/2501.00029)
Keywords: large language model
Abstract: We review the recent literature (January 2022- October 2024) in South Asian languages on text-based language processing, multimodal models, and speech processing, and provide a spotlight analysis focused on 21 low-resource South Asian languages, namely Saraiki, Assamese, Balochi, Bhojpuri, Bodo, Burmese, Chhattisgarhi, Dhivehi, Gujarati, Kannada, Kashmiri, Konkani, Khasi, Malayalam, Meitei, Nepali, Odia, Pashto, Rajasthani, Sindhi, and Telugu. We identify trends, challenges, and future research directions, using a step-wise approach that incorporates relevance classification and clustering based on large language models (LLMs). Our goal is to provide a breadth-first overview of the recent developments in South Asian language technologies to NLP researchers interested in working with South Asian languages.

Title: Distilling Large Language Models for Efficient Clinical Information Extraction

Authors: Karthik S. Vedula, Annika Gupta, Akshay Swaminathan, Ivan Lopez, Suhana Bedi, Nigam H. Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00031
Pdf URL: https://arxiv.org/pdf/2501.00031
Copy Paste: [[2501.00031]] Distilling Large Language Models for Efficient Clinical Information Extraction(https://arxiv.org/abs/2501.00031)
Keywords: extraction, large language model
Abstract: Large language models (LLMs) excel at clinical information extraction but their computational demands limit practical deployment. Knowledge distillation--the process of transferring knowledge from larger to smaller models--offers a potential solution. We evaluate the performance of distilled BERT models, which are approximately 1,000 times smaller than modern LLMs, for clinical named entity recognition (NER) tasks. We leveraged state-of-the-art LLMs (Gemini and OpenAI models) and medical ontologies (RxNorm and SNOMED) as teacher labelers for medication, disease, and symptom extraction. We applied our approach to over 3,300 clinical notes spanning five publicly available datasets, comparing distilled BERT models against both their teacher labelers and BERT models fine-tuned on human labels. External validation was conducted using clinical notes from the MedAlign dataset. For disease extraction, F1 scores were 0.82 (teacher model), 0.89 (BioBERT trained on human labels), and 0.84 (BioBERT-distilled). For medication, F1 scores were 0.84 (teacher model), 0.91 (BioBERT-human), and 0.87 (BioBERT-distilled). For symptoms: F1 score of 0.73 (teacher model) and 0.68 (BioBERT-distilled). Distilled BERT models had faster inference (12x, 4x, 8x faster than GPT-4o, o1-mini, and Gemini Flash respectively) and lower costs (85x, 101x, 2x cheaper than GPT-4o, o1-mini, and Gemini Flash respectively). On the external validation dataset, the distilled BERT model achieved F1 scores of 0.883 (medication), 0.726 (disease), and 0.699 (symptom). Distilled BERT models were up to 101x cheaper and 12x faster than state-of-the-art LLMs while achieving similar performance on NER tasks. Distillation offers a computationally efficient and scalable alternative to large LLMs for clinical information extraction.

Title: Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs

Authors: Dibakar Gope, David Mansell, Danny Loh, Ian Bratt
Subjects: cs.LG, cs.AI, cs.AR, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00032
Pdf URL: https://arxiv.org/pdf/2501.00032
Copy Paste: [[2501.00032]] Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs(https://arxiv.org/abs/2501.00032)
Keywords: large language model
Abstract: Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to this http URL-based solution. The optimized kernels are available at this https URL.

Title: Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications

Authors: Krisvarish V, Priyadarshini T, K P Abhishek Sri Saai, Vaidehi Vijayakumar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00042
Pdf URL: https://arxiv.org/pdf/2501.00042
Copy Paste: [[2501.00042]] Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications(https://arxiv.org/abs/2501.00042)
Keywords: transformer
Abstract: This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model's performance near that of the original model. Recently, new architectures of transformers were presented, focused on parameter efficiency and computational optimization; however, such models usually require considerable resources in terms of hardware when deployed in real-world applications on edge devices. This approach addresses this concern by halving embedding size and applying targeted techniques such as parameter pruning and quantization to optimize the memory footprint with minimum sacrifices in terms of accuracy. Experimental results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models. This work compared our model with existing compelling architectures, such as MobileBERT and DistilBERT, and proved its feasibility in the domain of resource-friendly deep learning architectures, mainly for applications in real-time and in resource-constrained applications.

Title: Learning in Multiple Spaces: Few-Shot Network Attack Detection with Metric-Fused Prototypical Networks

Authors: Fernando Martinez-Lopez, Lesther Santana, Mohamed Rahouti
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00050
Pdf URL: https://arxiv.org/pdf/2501.00050
Copy Paste: [[2501.00050]] Learning in Multiple Spaces: Few-Shot Network Attack Detection with Metric-Fused Prototypical Networks(https://arxiv.org/abs/2501.00050)
Keywords: attack, robust
Abstract: Network intrusion detection systems face significant challenges in identifying emerging attack patterns, especially when limited data samples are available. To address this, we propose a novel Multi-Space Prototypical Learning (MSPL) framework tailored for few-shot attack detection. The framework operates across multiple metric spaces-Euclidean, Cosine, Chebyshev, and Wasserstein distances-integrated through a constrained weighting scheme to enhance embedding robustness and improve pattern recognition. By leveraging Polyak-averaged prototype generation, the framework stabilizes the learning process and effectively adapts to rare and zero-day attacks. Additionally, an episodic training paradigm ensures balanced representation across diverse attack classes, enabling robust generalization. Experimental results on benchmark datasets demonstrate that MSPL outperforms traditional approaches in detecting low-profile and novel attack types, establishing it as a robust solution for zero-day attack detection.

Title: DDD-GenDT: Dynamic Data-driven Generative Digital Twin Framework

Authors: Yu-Zheng Lin, Qinxuan Shi, Zhanglong Yang, Banafsheh Saber Latibari, Sicong Shao, Soheil Salehi, Pratik Satam
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2501.00051
Pdf URL: https://arxiv.org/pdf/2501.00051
Copy Paste: [[2501.00051]] DDD-GenDT: Dynamic Data-driven Generative Digital Twin Framework(https://arxiv.org/abs/2501.00051)
Keywords: generative
Abstract: Digital twin (DT) technology has emerged as a transformative approach to simulate, predict, and optimize the behavior of physical systems, with applications that span manufacturing, healthcare, climate science, and more. However, the development of DT models often faces challenges such as high data requirements, integration complexity, and limited adaptability to dynamic changes in physical systems. This paper presents a new method inspired by dynamic data-driven applications systems (DDDAS), called the dynamic data-driven generative of digital twins framework (DDD-GenDT), which combines the physical system with LLM, allowing LLM to act as DT to interact with the physical system operating status and generate the corresponding physical behaviors. We apply DDD-GenDT to the computer numerical control (CNC) machining process, and we use the spindle current measurement data in the NASA milling wear data set as an example to enable LLMs to forecast the physical behavior from historical data and interact with current observations. Experimental results show that in the zero-shot prediction setting, the LLM-based DT can adapt to the change in the system, and the average RMSE of the GPT-4 prediction is 0.479A, which is 4.79% of the maximum spindle motor current measurement of 10A, with little training data and instructions required. Furthermore, we analyze the performance of DDD-GenDT in this specific application and their potential to construct digital twins. We also discuss the limitations and challenges that may arise in practical implementations.

Title: AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors

Authors: Mengnan Zhao, Lihe Zhang, Xingyi Yang, Tianhang Zheng, Baocai Yin
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00054
Pdf URL: https://arxiv.org/pdf/2501.00054
Copy Paste: [[2501.00054]] AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors(https://arxiv.org/abs/2501.00054)
Keywords: security, diffusion
Abstract: Security concerns surrounding text-to-image diffusion models have driven researchers to unlearn inappropriate concepts through fine-tuning. Recent fine-tuning methods typically align the prediction distributions of unsafe prompts with those of predefined text anchors. However, these techniques exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts. In this paper, we systematically analyze the impact of diverse text anchors on unlearning performance. Guided by this analysis, we propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue. These adversarial anchors are crafted to closely resemble the embeddings of undesirable concepts to maintain overall model performance, while selectively excluding defining attributes of these concepts for effective erasure. Extensive experiments demonstrate that AdvAnchor outperforms state-of-the-art methods. Our code is publicly available at this https URL.

Title: LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

Authors: Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, Qingsong Wen
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00055
Pdf URL: https://arxiv.org/pdf/2501.00055
Copy Paste: [[2501.00055]] LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models(https://arxiv.org/abs/2501.00055)
Keywords: attack, large language model
Abstract: While safety-aligned large language models (LLMs) are increasingly used as the cornerstone for powerful systems such as multi-agent frameworks to solve complex real-world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade-offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token-level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM-Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.

Title: VisTabNet: Adapting Vision Transformers for Tabular Data

Authors: Witold Wydmański, Ulvi Movsum-zada, Jacek Tabor, Marek Śmieja
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2501.00057
Pdf URL: https://arxiv.org/pdf/2501.00057
Copy Paste: [[2501.00057]] VisTabNet: Adapting Vision Transformers for Tabular Data(https://arxiv.org/abs/2501.00057)
Keywords: transformer
Abstract: Although deep learning models have had great success in natural language processing and computer vision, we do not observe comparable improvements in the case of tabular data, which is still the most common data type used in biological, industrial and financial applications. In particular, it is challenging to transfer large-scale pre-trained models to downstream tasks defined on small tabular datasets. To address this, we propose VisTabNet -- a cross-modal transfer learning method, which allows for adapting Vision Transformer (ViT) with pre-trained weights to process tabular data. By projecting tabular inputs to patch embeddings acceptable by ViT, we can directly apply a pre-trained Transformer Encoder to tabular inputs. This approach eliminates the conceptual cost of designing a suitable architecture for processing tabular data, while reducing the computational cost of training the model from scratch. Experimental results on multiple small tabular datasets (less than 1k samples) demonstrate VisTabNet's superiority, outperforming both traditional ensemble methods and recent deep learning models. The proposed method goes beyond conventional transfer learning practice and shows that pre-trained image models can be transferred to solve tabular problems, extending the boundaries of transfer learning.

Title: Large Language Models for Mathematical Analysis

Authors: Ziye Chen, Hao Qi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00059
Pdf URL: https://arxiv.org/pdf/2501.00059
Copy Paste: [[2501.00059]] Large Language Models for Mathematical Analysis(https://arxiv.org/abs/2501.00059)
Keywords: large language model
Abstract: Mathematical problem-solving is a key field in artificial intelligence (AI) and a critical benchmark for evaluating the capabilities of large language models (LLMs). While extensive research has focused on mathematical problem-solving, most existing work and datasets concentrate on computational tasks, leaving gaps in areas like mathematical analysis, which demands rigorous proofs and formal reasoning. We developed the DEMI-MathAnalysis dataset, comprising proof-based problems from mathematical analysis topics such as Sequences and Limits, Infinite Series, and Convex Functions. We also designed a guiding framework to rigorously enhance LLMs' ability to solve these problems. Through fine-tuning LLMs on this dataset and employing our framework, we observed significant improvements in their capability to generate logical, complete, and elegant proofs. This work addresses critical gaps in mathematical reasoning and contributes to advancing trustworthy AI capable of handling formalized mathematical language. The code is publicly accessible at LLMs for Mathematical Analysis.

Title: ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis

Authors: James P. Beno
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00062
Pdf URL: https://arxiv.org/pdf/2501.00062
Copy Paste: [[2501.00062]] ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis(https://arxiv.org/abs/2501.00062)
Keywords: transformer, large language model
Abstract: Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.74 macro F1 vs. 79.29 ELECTRA Base FT, 79.52 GPT-4o-mini) and yielded the lowest cost/performance ratio (\$0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.77) at much less cost (\$0.38 vs. \$1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.

Title: "Generative Models for Financial Time Series Data: Enhancing Signal-to-Noise Ratio and Addressing Data Scarcity in A-Share Market

Authors: Guangming Che
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00063
Pdf URL: https://arxiv.org/pdf/2501.00063
Copy Paste: [[2501.00063]] "Generative Models for Financial Time Series Data: Enhancing Signal-to-Noise Ratio and Addressing Data Scarcity in A-Share Market(https://arxiv.org/abs/2501.00063)
Keywords: robust, diffusion, generative
Abstract: The financial industry is increasingly seeking robust methods to address the challenges posed by data scarcity and low signal-to-noise ratios, which limit the application of deep learning techniques in stock market analysis. This paper presents two innovative generative model-based approaches to synthesize stock data, specifically tailored for different scenarios within the A-share market in China. The first method, a sector-based synthesis approach, enhances the signal-to-noise ratio of stock data by classifying the characteristics of stocks from various sectors in China's A-share market. This method employs an Approximate Non-Local Total Variation algorithm to smooth the generated data, a bandpass filtering method based on Fourier Transform to eliminate noise, and Denoising Diffusion Implicit Models to accelerate sampling speed. The second method, a recursive stock data synthesis approach based on pattern recognition, is designed to synthesize data for stocks with short listing periods and limited comparable companies. It leverages pattern recognition techniques and Markov models to learn and generate variable-length stock sequences, while introducing a sub-time-level data augmentation method to alleviate data scarcity this http URL validate the effectiveness of these methods through extensive experiments on various datasets, including those from the main board, STAR Market, Growth Enterprise Market Board, Beijing Stock Exchange, NASDAQ, NYSE, and AMEX. The results demonstrate that our synthesized data not only improve the performance of predictive models but also enhance the signal-to-noise ratio of individual stock signals in price trading strategies. Furthermore, the introduction of sub-time-level data significantly improves the quality of synthesized data.

Title: On Adversarial Robustness of Language Models in Transfer Learning

Authors: Bohdan Turbal, Anastasiia Mazur, Jiaxu Zhao, Mykola Pechenizkiy
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00066
Pdf URL: https://arxiv.org/pdf/2501.00066
Copy Paste: [[2501.00066]] On Adversarial Robustness of Language Models in Transfer Learning(https://arxiv.org/abs/2501.00066)
Keywords: security, attack, robust
Abstract: We investigate the adversarial robustness of LLMs in transfer learning scenarios. Through comprehensive experiments on multiple datasets (MBIB Hate Speech, MBIB Political Bias, MBIB Gender Bias) and various model architectures (BERT, RoBERTa, GPT-2, Gemma, Phi), we reveal that transfer learning, while improving standard performance metrics, often leads to increased vulnerability to adversarial attacks. Our findings demonstrate that larger models exhibit greater resilience to this phenomenon, suggesting a complex interplay between model size, architecture, and adaptation methods. Our work highlights the crucial need for considering adversarial robustness in transfer learning scenarios and provides insights into maintaining model security without compromising performance. These findings have significant implications for the development and deployment of LLMs in real-world applications where both performance and robustness are paramount.

Title: Adversarial Negotiation Dynamics in Generative Language Models

Authors: Arinbjörn Kolbeinsson, Benedikt Kolbeinsson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00069
Pdf URL: https://arxiv.org/pdf/2501.00069
Copy Paste: [[2501.00069]] Adversarial Negotiation Dynamics in Generative Language Models(https://arxiv.org/abs/2501.00069)
Keywords: secure, security, robust, generative
Abstract: Generative language models are increasingly used for contract drafting and enhancement, creating a scenario where competing parties deploy different language models against each other. This introduces not only a game-theory challenge but also significant concerns related to AI safety and security, as the language model employed by the opposing party can be unknown. These competitive interactions can be seen as adversarial testing grounds, where models are effectively red-teamed to expose vulnerabilities such as generating biased, harmful or legally problematic text. Despite the importance of these challenges, the competitive robustness and safety of these models in adversarial settings remain poorly understood. In this small study, we approach this problem by evaluating the performance and vulnerabilities of major open-source language models in head-to-head competitions, simulating real-world contract negotiations. We further explore how these adversarial interactions can reveal potential risks, informing the development of more secure and reliable models. Our findings contribute to the growing body of research on AI safety, offering insights into model selection and optimisation in competitive legal contexts and providing actionable strategies for mitigating risks.

Title: ICLR: In-Context Learning of Representations

Authors: Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, Hidenori Tanaka
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00070
Pdf URL: https://arxiv.org/pdf/2501.00070
Copy Paste: [[2501.00070]] ICLR: In-Context Learning of Representations(https://arxiv.org/abs/2501.00070)
Keywords: large language model
Abstract: Recent work has demonstrated that semantics specified by pretraining data influence how representations of different concepts are organized in a large language model (LLM). However, given the open-ended nature of LLMs, e.g., their ability to in-context learn, we can ask whether models alter these pretraining semantics to adopt alternative, context-specified ones. Specifically, if we provide in-context exemplars wherein a concept plays a different role than what the pretraining data suggests, do models reorganize their representations in accordance with these novel semantics? To answer this question, we take inspiration from the theory of conceptual role semantics and define a toy "graph tracing" task wherein the nodes of the graph are referenced via concepts seen during training (e.g., apple, bird, etc.) and the connectivity of the graph is defined via some predefined structure (e.g., a square grid). Given exemplars that indicate traces of random walks on the graph, we analyze intermediate representations of the model and find that as the amount of context is scaled, there is a sudden re-organization from pretrained semantic representations to in-context representations aligned with the graph structure. Further, we find that when reference concepts have correlations in their semantics (e.g., Monday, Tuesday, etc.), the context-specified graph structure is still present in the representations, but is unable to dominate the pretrained structure. To explain these results, we analogize our task to energy minimization for a predefined graph topology, providing evidence towards an implicit optimization process to infer context-specified semantics. Overall, our findings indicate scaling context-size can flexibly re-organize model representations, possibly unlocking novel capabilities.

Title: Open-Book Neural Algorithmic Reasoning

Authors: Hefei Li, Chao Peng, Chenyang Xu, Zhengfeng Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00072
Pdf URL: https://arxiv.org/pdf/2501.00072
Copy Paste: [[2501.00072]] Open-Book Neural Algorithmic Reasoning(https://arxiv.org/abs/2501.00072)
Keywords: robust
Abstract: Neural algorithmic reasoning is an emerging area of machine learning that focuses on building neural networks capable of solving complex algorithmic tasks. Recent advancements predominantly follow the standard supervised learning paradigm -- feeding an individual problem instance into the network each time and training it to approximate the execution steps of a classical algorithm. We challenge this mode and propose a novel open-book learning framework. In this framework, whether during training or testing, the network can access and utilize all instances in the training dataset when reasoning for a given instance. Empirical evaluation is conducted on the challenging CLRS Algorithmic Reasoning Benchmark, which consists of 30 diverse algorithmic tasks. Our open-book learning framework exhibits a significant enhancement in neural reasoning capabilities. Further, we notice that there is recent literature suggesting that multi-task training on CLRS can improve the reasoning accuracy of certain tasks, implying intrinsic connections between different algorithmic tasks. We delve into this direction via the open-book framework. When the network reasons for a specific task, we enable it to aggregate information from training instances of other tasks in an attention-based manner. We show that this open-book attention mechanism offers insights into the inherent relationships among various tasks in the benchmark and provides a robust tool for interpretable multi-task training.

Title: Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

Authors: Chunsheng Zuo, Pavel Guerzhoy, Michael Guerzhoy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00073
Pdf URL: https://arxiv.org/pdf/2501.00073
Copy Paste: [[2501.00073]] Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings(https://arxiv.org/abs/2501.00073)
Keywords: transformer
Abstract: Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.

Title: A Novel Framework for Learning Stochastic Representations for Sequence Generation and Recognition

Authors: Jungsik Hwang, Ahmadreza Ahmadi
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2501.00076
Pdf URL: https://arxiv.org/pdf/2501.00076
Copy Paste: [[2501.00076]] A Novel Framework for Learning Stochastic Representations for Sequence Generation and Recognition(https://arxiv.org/abs/2501.00076)
Keywords: robust
Abstract: The ability to generate and recognize sequential data is fundamental for autonomous systems operating in dynamic environments. Inspired by the key principles of the brain-predictive coding and the Bayesian brain-we propose a novel stochastic Recurrent Neural Network with Parametric Biases (RNNPB). The proposed model incorporates stochasticity into the latent space using the reparameterization trick used in variational autoencoders. This approach enables the model to learn probabilistic representations of multidimensional sequences, capturing uncertainty and enhancing robustness against overfitting. We tested the proposed model on a robotic motion dataset to assess its performance in generating and recognizing temporal patterns. The experimental results showed that the stochastic RNNPB model outperformed its deterministic counterpart in generating and recognizing motion sequences. The results highlighted the proposed model's capability to quantify and adjust uncertainty during both learning and inference. The stochasticity resulted in a continuous latent space representation, facilitating stable motion generation and enhanced generalization when recognizing novel sequences. Our approach provides a biologically inspired framework for modeling temporal patterns and advances the development of robust and adaptable systems in artificial intelligence and robotics.

Title: Machine Learning-Based Security Policy Analysis

Authors: Krish Jain, Joann Sum, Pranav Kapoor, Amir Eaman
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2501.00085
Pdf URL: https://arxiv.org/pdf/2501.00085
Copy Paste: [[2501.00085]] Machine Learning-Based Security Policy Analysis(https://arxiv.org/abs/2501.00085)
Keywords: security, robust
Abstract: Security-Enhanced Linux (SELinux) is a robust security mechanism that enforces mandatory access controls (MAC), but its policy language's complexity creates challenges for policy analysis and management. This research investigates the automation of SELinux policy analysis using graph-based techniques combined with machine learning approaches to detect policy anomalies. The study addresses two key questions: Can SELinux policy analysis be automated through graph analysis, and how do different anomaly detection models compare in analyzing SELinux policies? We will be comparing different machine learning models by evaluating their effectiveness in detecting policy violations and anomalies. Our approach utilizes Neo4j for graph representation of policies, with Node2vec transforming these graph structures into meaningful vector embeddings that can be processed by our machine learning models. In our results, the MLP Neural Network consistently demonstrated superior performance across different dataset sizes, achieving 95% accuracy with balanced precision and recall metrics, while both Random Forest and SVM models showed competitive but slightly lower performance in detecting policy violations. This combination of graph-based modeling and machine learning provides a more sophisticated and automated approach to understanding and analyzing complex SELinux policies compared to traditional manual analysis methods.

Title: LTX-Video: Realtime Video Latent Diffusion

Authors: Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, Ofir Bibi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00103
Pdf URL: https://arxiv.org/pdf/2501.00103
Copy Paste: [[2501.00103]] LTX-Video: Realtime Video Latent Diffusion(https://arxiv.org/abs/2501.00103)
Keywords: diffusion, transformer
Abstract: We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.

Title: Text-to-Image GAN with Pretrained Representations

Authors: Xiaozhou You, Jian Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00116
Pdf URL: https://arxiv.org/pdf/2501.00116
Copy Paste: [[2501.00116]] Text-to-Image GAN with Pretrained Representations(https://arxiv.org/abs/2501.00116)
Keywords: diffusion
Abstract: Generating desired images conditioned on given text descriptions has received lots of attention. Recently, diffusion models and autoregressive models have demonstrated their outstanding expressivity and gradually replaced GAN as the favored architectures for text-to-image synthesis. However, they still face some obstacles: slow inference speed and expensive training costs. To achieve more powerful and faster text-to-image synthesis under complex scenes, we propose TIGER, a text-to-image GAN with pretrained representations. To be specific, we propose a vision-empowered discriminator and a high-capacity generator. (i) The vision-empowered discriminator absorbs the complex scene understanding ability and the domain generalization ability from pretrained vision models to enhance model performance. Unlike previous works, we explore stacking multiple pretrained models in our discriminator to collect multiple different representations. (ii) The high-capacity generator aims to achieve effective text-image fusion while increasing the model capacity. The high-capacity generator consists of multiple novel high-capacity fusion blocks (HFBlock). And the HFBlock contains several deep fusion modules and a global fusion module, which play different roles to benefit our model. Extensive experiments demonstrate the outstanding performance of our proposed TIGER both on standard and zero-shot text-to-image synthesis tasks. On the standard text-to-image synthesis task, TIGER achieves state-of-the-art performance on two challenging datasets, which obtain a new FID 5.48 (COCO) and 9.38 (CUB). On the zero-shot text-to-image synthesis task, we achieve comparable performance with fewer model parameters, smaller training data size and faster inference speed. Additionally, more experiments and analyses are conducted in the Supplementary Material.

Title: PQD: Post-training Quantization for Efficient Diffusion Models

Authors: Jiaojiao Ye, Zhen Wang, Linnan Jiang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00124
Pdf URL: https://arxiv.org/pdf/2501.00124
Copy Paste: [[2501.00124]] PQD: Post-training Quantization for Efficient Diffusion Models(https://arxiv.org/abs/2501.00124)
Keywords: diffusion, generative
Abstract: Diffusionmodels(DMs)havedemonstratedremarkableachievements in synthesizing images of high fidelity and diversity. However, the extensive computational requirements and slow generative speed of diffusion models have limited their widespread adoption. In this paper, we propose a novel post-training quantization for diffusion models (PQD), which is a time-aware optimization framework for diffusion models based on post-training quantization. The proposed framework optimizes the inference process by selecting representative samples and conducting time-aware calibration. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a few FID change on ImageNet for unconditional image generation. Our approach demonstrates compatibility and can also be applied to 512x512 text-guided image generation for the first time.

Title: Detection-Fusion for Knowledge Graph Extraction from Videos

Authors: Taniya Das, Louis Mahon, Thomas Lukasiewicz
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00136
Pdf URL: https://arxiv.org/pdf/2501.00136
Copy Paste: [[2501.00136]] Detection-Fusion for Knowledge Graph Extraction from Videos(https://arxiv.org/abs/2501.00136)
Keywords: extraction
Abstract: One of the challenging tasks in the field of video understanding is extracting semantic content from video inputs. Most existing systems use language models to describe videos in natural language sentences, but this has several major shortcomings. Such systems can rely too heavily on the language model component and base their output on statistical regularities in natural language text rather than on the visual contents of the video. Additionally, natural language annotations cannot be readily processed by a computer, are difficult to evaluate with performance metrics and cannot be easily translated into a different natural language. In this paper, we propose a method to annotate videos with knowledge graphs, and so avoid these problems. Specifically, we propose a deep-learning-based model for this task that first predicts pairs of individuals and then the relations between them. Additionally, we propose an extension of our model for the inclusion of background knowledge in the construction of knowledge graphs.

Title: Minimalist Vision with Freeform Pixels

Authors: Jeremy Klotz, Shree K. Nayar
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.00142
Pdf URL: https://arxiv.org/pdf/2501.00142
Copy Paste: [[2501.00142]] Minimalist Vision with Freeform Pixels(https://arxiv.org/abs/2501.00142)
Keywords: privacy
Abstract: A minimalist vision system uses the smallest number of pixels needed to solve a vision task. While traditional cameras use a large grid of square pixels, a minimalist camera uses freeform pixels that can take on arbitrary shapes to increase their information content. We show that the hardware of a minimalist camera can be modeled as the first layer of a neural network, where the subsequent layers are used for inference. Training the network for any given task yields the shapes of the camera's freeform pixels, each of which is implemented using a photodetector and an optical mask. We have designed minimalist cameras for monitoring indoor spaces (with 8 pixels), measuring room lighting (with 8 pixels), and estimating traffic flow (with 8 pixels). The performance demonstrated by these systems is on par with a traditional camera with orders of magnitude more pixels. Minimalist vision has two major advantages. First, it naturally tends to preserve the privacy of individuals in the scene since the captured information is inadequate for extracting visual details. Second, since the number of measurements made by a minimalist camera is very small, we show that it can be fully self-powered, i.e., function without an external power supply or a battery.

Title: Temporal reasoning for timeline summarisation in social media

Authors: Jiayu Song, Mahmud Akhter, Dana Atzil Slonim, Maria Liakata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00152
Pdf URL: https://arxiv.org/pdf/2501.00152
Copy Paste: [[2501.00152]] Temporal reasoning for timeline summarisation in social media(https://arxiv.org/abs/2501.00152)
Keywords: large language model
Abstract: This paper explores whether enhancing temporal reasoning capabilities in Large Language Models (LLMs) can improve the quality of timeline summarization, the task of summarising long texts containing sequences of events, particularly social media threads . We introduce \textit{NarrativeReason}, a novel dataset focused on temporal relationships among sequential events within narratives, distinguishing it from existing temporal reasoning datasets that primarily address pair-wise event relationships. Our approach then combines temporal reasoning with timeline summarization through a knowledge distillation framework, where we first fine-tune a teacher model on temporal reasoning tasks and then distill this knowledge into a student model while simultaneously training it for the task of timeline summarization. Experimental results demonstrate that our model achieves superior performance on mental health-related timeline summarization tasks, which involve long social media threads with repetitions of events and a mix of emotions, highlighting the importance of leveraging temporal reasoning to improve timeline summarisation.

Title: Measuring Large Language Models Capacity to Annotate Journalistic Sourcing

Authors: Subramaniam Vincent, Phoebe Wang, Zhan Shi, Sahas Koka, Yi Fang
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2501.00164
Pdf URL: https://arxiv.org/pdf/2501.00164
Copy Paste: [[2501.00164]] Measuring Large Language Models Capacity to Annotate Journalistic Sourcing(https://arxiv.org/abs/2501.00164)
Keywords: large language model
Abstract: Since the launch of ChatGPT in late 2022, the capacities of Large Language Models and their evaluation have been in constant discussion and evaluation both in academic research and in the industry. Scenarios and benchmarks have been developed in several areas such as law, medicine and math (Bommasani et al., 2023) and there is continuous evaluation of model variants. One area that has not received sufficient scenario development attention is journalism, and in particular journalistic sourcing and ethics. Journalism is a crucial truth-determination function in democracy (Vincent, 2023), and sourcing is a crucial pillar to all original journalistic output. Evaluating the capacities of LLMs to annotate stories for the different signals of sourcing and how reporters justify them is a crucial scenario that warrants a benchmark approach. It offers potential to build automated systems to contrast more transparent and ethically rigorous forms of journalism with everyday fare. In this paper we lay out a scenario to evaluate LLM performance on identifying and annotating sourcing in news stories on a five-category schema inspired from journalism studies (Gans, 2004). We offer the use case, our dataset and metrics and as the first step towards systematic benchmarking. Our accuracy findings indicate LLM-based approaches have more catching to do in identifying all the sourced statements in a story, and equally, in matching the type of sources. An even harder task is spotting source justifications.

Title: Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection

Authors: Hongrui Shi, Valentin Radu, Po Yang
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2501.00170
Pdf URL: https://arxiv.org/pdf/2501.00170
Copy Paste: [[2501.00170]] Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection(https://arxiv.org/abs/2501.00170)
Keywords: privacy, federate
Abstract: With the rapid expansion of edge devices, such as IoT devices, where crucial data needed for machine learning applications is generated, it becomes essential to promote their participation in privacy-preserving Federated Learning (FL) systems. The best way to achieve this desiderate is by reducing their training workload to match their constrained computational resources. While prior FL research has address the workload constrains by introducing lightweight models on the edge, limited attention has been given to optimizing on-device training efficiency through reducing the amount of data need during training. In this work, we propose FedFT-EDS, a novel approach that combines Fine-Tuning of partial client models with Entropy-based Data Selection to reduce training workloads on edge devices. By actively selecting the most informative local instances for learning, FedFT-EDS reduces training data significantly in FL and demonstrates that not all user data is equally beneficial for FL on all rounds. Our experiments on CIFAR-10 and CIFAR-100 show that FedFT-EDS uses only 50% user data while improving the global model performance compared to baseline methods, FedAvg and FedProx. Importantly, FedFT-EDS improves client learning efficiency by up to 3 times, using one third of training time on clients to achieve an equivalent performance to the baselines. This work highlights the importance of data selection in FL and presents a promising pathway to scalable and efficient Federate Learning.

Title: The Text Classification Pipeline: Starting Shallow going Deeper

Authors: Marco Siino, Ilenia Tinnirello, Marco La Cascia
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2501.00174
Pdf URL: https://arxiv.org/pdf/2501.00174
Copy Paste: [[2501.00174]] The Text Classification Pipeline: Starting Shallow going Deeper(https://arxiv.org/abs/2501.00174)
Keywords: extraction
Abstract: Text Classification (TC) stands as a cornerstone within the realm of Natural Language Processing (NLP), particularly when viewed through the lens of computer science and engineering. The past decade has seen deep learning revolutionize TC, propelling advancements in text retrieval, categorization, information extraction, and summarization. The scholarly literature is rich with datasets, models, and evaluation criteria, with English being the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others. The efficacy of TC models relies heavily on their ability to capture intricate textual relationships and nonlinear correlations, necessitating a comprehensive examination of the entire TC pipeline. This monograph provides an in-depth exploration of the TC pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of TC models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, current results and future trends. Each chapter meticulously examines these stages, presenting technical innovations and significant recent findings. The work critically assesses various classification strategies, offering comparative analyses, examples, case studies, and experimental evaluations. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of TC.

Title: TrajLearn: Trajectory Prediction Learning using Deep Generative Models

Authors: Amirhossein Nadiri, Jing Li, Ali Faraji, Ghadeer Abuoda, Manos Papagelis
Subjects: cs.LG, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2501.00184
Pdf URL: https://arxiv.org/pdf/2501.00184
Copy Paste: [[2501.00184]] TrajLearn: Trajectory Prediction Learning using Deep Generative Models(https://arxiv.org/abs/2501.00184)
Keywords: generative
Abstract: Trajectory prediction aims to estimate an entity's future path using its current position and historical movement data, benefiting fields like autonomous navigation, robotics, and human movement analytics. Deep learning approaches have become key in this area, utilizing large-scale trajectory datasets to model movement patterns, but face challenges in managing complex spatial dependencies and adapting to dynamic environments. To address these challenges, we introduce TrajLearn, a novel model for trajectory prediction that leverages generative modeling of higher-order mobility flows based on hexagonal spatial representation. TrajLearn predicts the next $k$ steps by integrating a customized beam search for exploring multiple potential paths while maintaining spatial continuity. We conducted a rigorous evaluation of TrajLearn, benchmarking it against leading state-of-the-art approaches and meaningful baselines. The results indicate that TrajLearn achieves significant performance gains, with improvements of up to ~40% across multiple real-world trajectory datasets. In addition, we evaluated different prediction horizons (i.e., various values of $k$), conducted resolution sensitivity analysis, and performed ablation studies to assess the impact of key model components. Furthermore, we developed a novel algorithm to generate mixed-resolution maps by hierarchically subdividing hexagonal regions into finer segments within a specified observation area. This approach supports selective detailing, applying finer resolution to areas of interest or high activity (e.g., urban centers) while using coarser resolution for less significant regions (e.g., rural areas), effectively reducing data storage requirements and computational overhead. We promote reproducibility and adaptability by offering complete code, data, and detailed documentation with flexible configuration options for various applications.

Title: Interactive cybersecurity training system based on simulation environments

Authors: Dmytro Tymoshchuk, Vasyl Yatskiv, Vitaliy Tymoshchuk, Nataliya Yatskiv
Subjects: cs.CR, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2501.00186
Pdf URL: https://arxiv.org/pdf/2501.00186
Copy Paste: [[2501.00186]] Interactive cybersecurity training system based on simulation environments(https://arxiv.org/abs/2501.00186)
Keywords: security
Abstract: Rapid progress in the development of information technology has led to a significant increase in the number and complexity of cyber threats. Traditional methods of cybersecurity training based on theoretical knowledge do not provide a sufficient level of practical skills to effectively counter real threats. The article explores the possibilities of integrating simulation environments into the cybersecurity training process as an effective approach to improving the quality of training. The article presents the architecture of a simulation environment based on a cluster of KVM hypervisors, which allows creating scalable and flexible platforms at minimal cost. The article describes the implementation of various scenarios using open source software tools such as pfSense, OPNsense, Security Onion, Kali Linux, Parrot Security OS, Ubuntu Linux, Oracle Linux, FreeBSD, and others, which create realistic conditions for practical training.

Title: MLLM-as-a-Judge for Image Safety without Human Labeling

Authors: Zhenting Wang, Shuming Hu, Shiyu Zhao, Xiaowen Lin, Felix Juefei-Xu, Zhuowei Li, Ligong Han, Harihar Subramanyam, Li Chen, Jianfa Chen, Nan Jiang, Lingjuan Lyu, Shiqing Ma, Dimitris N. Metaxas, Ankit Jain
Subjects: cs.CV, cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00192
Pdf URL: https://arxiv.org/pdf/2501.00192
Copy Paste: [[2501.00192]] MLLM-as-a-Judge for Image Safety without Human Labeling(https://arxiv.org/abs/2501.00192)
Keywords: large language model
Abstract: Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.

Title: A Pseudo-random Number Generator for Multi-Sequence Generation with Programmable Statistics

Authors: Jianan Wu, Ahmet Yusuf Salim, Eslam Elmitwalli, Selçuk Köse, Zeljko Ignjatovic
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2501.00193
Pdf URL: https://arxiv.org/pdf/2501.00193
Copy Paste: [[2501.00193]] A Pseudo-random Number Generator for Multi-Sequence Generation with Programmable Statistics(https://arxiv.org/abs/2501.00193)
Keywords: security
Abstract: Pseudo-random number generators (PRNGs) are essential in a wide range of applications, from cryptography to statistical simulations and optimization algorithms. While uniform randomness is crucial for security-critical areas like cryptography, many domains, such as simulated annealing and CMOS-based Ising Machines, benefit from controlled or non-uniform randomness to enhance solution exploration and optimize performance. This paper presents a hardware PRNG that can simultaneously generate multiple uncorrelated sequences with programmable statistics tailored to specific application needs. Designed in 65nm process, the PRNG occupies an area of approximately 0.0013mm^2 and has an energy consumption of 0.57pJ/bit. Simulations confirm the PRNG's effectiveness in modulating the statistical distribution while demonstrating high-quality randomness properties.

Title: Towards Unraveling and Improving Generalization in World Models

Authors: Qiaoyi Fang, Weiyu Du, Hang Wang, Junshan Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00195
Pdf URL: https://arxiv.org/pdf/2501.00195
Copy Paste: [[2501.00195]] Towards Unraveling and Improving Generalization in World Models(https://arxiv.org/abs/2501.00195)
Keywords: robust
Abstract: World models have recently emerged as a promising approach to reinforcement learning (RL), achieving state-of-the-art performance across a wide range of visual control tasks. This work aims to obtain a deep understanding of the robustness and generalization capabilities of world models. Thus motivated, we develop a stochastic differential equation formulation by treating the world model learning as a stochastic dynamical system, and characterize the impact of latent representation errors on robustness and generalization, for both cases with zero-drift representation errors and with non-zero-drift representation errors. Our somewhat surprising findings, based on both theoretic and experimental studies, reveal that for the case with zero drift, modest latent representation errors can in fact function as implicit regularization and hence result in improved robustness. We further propose a Jacobian regularization scheme to mitigate the compounding error propagation effects of non-zero drift, thereby enhancing training stability and robustness. Our experimental studies corroborate that this regularization approach not only stabilizes training but also accelerates convergence and improves accuracy of long-horizon prediction.

Title: GPT-4 on Clinic Depression Assessment: An LLM-Based Pilot Study

Authors: Giuliano Lorenzoni, Pedro Elkind Velmovitsky, Paulo Alencar, Donald Cowan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00199
Pdf URL: https://arxiv.org/pdf/2501.00199
Copy Paste: [[2501.00199]] GPT-4 on Clinic Depression Assessment: An LLM-Based Pilot Study(https://arxiv.org/abs/2501.00199)
Keywords: large language model
Abstract: Depression has impacted millions of people worldwide and has become one of the most prevalent mental disorders. Early mental disorder detection can lead to cost savings for public health agencies and avoid the onset of other major comorbidities. Additionally, the shortage of specialized personnel is a critical issue because clinical depression diagnosis is highly dependent on expert professionals and is time consuming. In this study, we explore the use of GPT-4 for clinical depression assessment based on transcript analysis. We examine the model's ability to classify patient interviews into binary categories: depressed and not depressed. A comparative analysis is conducted considering prompt complexity (e.g., using both simple and complex prompts) as well as varied temperature settings to assess the impact of prompt complexity and randomness on the model's performance. Results indicate that GPT-4 exhibits considerable variability in accuracy and F1-Score across configurations, with optimal performance observed at lower temperature values (0.0-0.2) for complex prompts. However, beyond a certain threshold (temperature >= 0.3), the relationship between randomness and performance becomes unpredictable, diminishing the gains from prompt complexity. These findings suggest that, while GPT-4 shows promise for clinical assessment, the configuration of the prompts and model parameters requires careful calibration to ensure consistent results. This preliminary study contributes to understanding the dynamics between prompt engineering and large language models, offering insights for future development of AI-powered tools in clinical settings.

Title: An Empirical Evaluation of Large Language Models on Consumer Health Questions

Authors: Moaiz Abrar, Yusuf Sermet, Ibrahim Demir
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00208
Pdf URL: https://arxiv.org/pdf/2501.00208
Copy Paste: [[2501.00208]] An Empirical Evaluation of Large Language Models on Consumer Health Questions(https://arxiv.org/abs/2501.00208)
Keywords: large language model
Abstract: This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers by verified experts extracted from the AskDocs subreddit. While LLMs have shown proficiency in clinical question answering (QA) benchmarks, their effectiveness on real-world, consumer-based, medical questions remains less understood. MedRedQA presents unique challenges, such as informal language and the need for precise responses suited to non-specialist queries. To assess model performance, responses were generated using five LLMs: GPT-4o mini, Llama 3.1: 70B, Mistral-123B, Mistral-7B, and Gemini-Flash. A cross-evaluation method was used, where each model evaluated its responses as well as those of others to minimize bias. The results indicated that GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored lowest according to three out of five models' judges. This study highlights the potential and limitations of current LLMs for consumer health medical question answering, indicating avenues for further development.

Title: OciorMVBA: Near-Optimal Error-Free Asynchronous MVBA

Authors: Jinyuan Chen
Subjects: cs.CR, cs.DC, cs.IT
Abstract URL: https://arxiv.org/abs/2501.00214
Pdf URL: https://arxiv.org/pdf/2501.00214
Copy Paste: [[2501.00214]] OciorMVBA: Near-Optimal Error-Free Asynchronous MVBA(https://arxiv.org/abs/2501.00214)
Keywords: secure
Abstract: In this work, we propose an error-free, information-theoretically secure, asynchronous multi-valued validated Byzantine agreement (MVBA) protocol, called OciorMVBA. This protocol achieves MVBA consensus on a message $\boldsymbol{w}$ with expected $O(n |\boldsymbol{w}|\log n + n^2 \log q)$ communication bits, expected $O(n^2)$ messages, expected $O(\log n)$ rounds, and expected $O(\log n)$ common coins, under optimal resilience $n \geq 3t + 1$ in an $n$-node network, where up to $t$ nodes may be dishonest. Here, $q$ denotes the alphabet size of the error correction code used in the protocol. When error correction codes with a constant alphabet size (e.g., Expander Codes) are used, $q$ becomes a constant. An MVBA protocol that guarantees all required properties without relying on any cryptographic assumptions, such as signatures or hashing, except for the common coin assumption, is said to be information-theoretically secure (IT secure). Under the common coin assumption, an MVBA protocol that guarantees all required properties in all executions is said to be error-free. We also propose another error-free, IT-secure, asynchronous MVBA protocol, called OciorMVBArr. This protocol achieves MVBA consensus with expected $O(n |\boldsymbol{w}| + n^2 \log n)$ communication bits, expected $O(1)$ rounds, and expected $O(1)$ common coins, under a relaxed resilience (RR) of $n \geq 5t + 1$. Additionally, we propose a hash-based asynchronous MVBA protocol, called OciorMVBAh. This protocol achieves MVBA consensus with expected $O(n |\boldsymbol{w}| + n^3)$ bits, expected $O(1)$ rounds, and expected $O(1)$ common coins, under optimal resilience $n \geq 3t + 1$.

Title: DecoratingFusion: A LiDAR-Camera Fusion Network with the Combination of Point-level and Feature-level Fusion

Authors: Zixuan Yin, Han Sun, Ningzhong Liu, Huiyu Zhou, Jiaquan Shen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00220
Pdf URL: https://arxiv.org/pdf/2501.00220
Copy Paste: [[2501.00220]] DecoratingFusion: A LiDAR-Camera Fusion Network with the Combination of Point-level and Feature-level Fusion(https://arxiv.org/abs/2501.00220)
Keywords: interpretability
Abstract: Lidars and cameras play essential roles in autonomous driving, offering complementary information for 3D detection. The state-of-the-art fusion methods integrate them at the feature level, but they mostly rely on the learned soft association between point clouds and images, which lacks interpretability and neglects the hard association between them. In this paper, we combine feature-level fusion with point-level fusion, using hard association established by the calibration matrices to guide the generation of object queries. Specifically, in the early fusion stage, we use the 2D CNN features of images to decorate the point cloud data, and employ two independent sparse convolutions to extract the decorated point cloud features. In the mid-level fusion stage, we initialize the queries with a center heatmap and embed the predicted class labels as auxiliary information into the queries, making the initial positions closer to the actual centers of the targets. Extensive experiments conducted on two popular datasets, i.e. KITTI, Waymo, demonstrate the superiority of DecoratingFusion.

Title: Extracting effective solutions hidden in large language models via generated comprehensive specialists: case studies in developing electronic devices

Authors: Hikari Tomita, Nobuhiro Nakamura, Shoichi Ishida, Toshio Kamiya, Kei Terayama
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00224
Pdf URL: https://arxiv.org/pdf/2501.00224
Copy Paste: [[2501.00224]] Extracting effective solutions hidden in large language models via generated comprehensive specialists: case studies in developing electronic devices(https://arxiv.org/abs/2501.00224)
Keywords: extraction, large language model
Abstract: Recently, many studies have increasingly explored the use of large language models (LLMs) to generate research ideas and scientific hypotheses. However, real-world research and development often require solving complex, interdisciplinary challenges where solutions may not be readily found through existing knowledge related to the problem. Therefore, it is desirable to leverage the vast, comprehensive knowledge of LLMs to generate effective, breakthrough solutions by integrating various perspectives from other disciplines. Here, we propose SELLM (Solution Enumeration via comprehensive List and LLM), a framework leveraging LLMs and structured guidance using MECE (Mutually Exclusive, Collectively Exhaustive) principles, such as International Patent Classification (IPC) and the periodic table of elements. SELLM systematically constructs comprehensive expert agents from the list to generate cross-disciplinary and effective solutions. To evaluate SELLM's practicality, we applied it to two challenges: improving light extraction in organic light-emitting diode (OLED) lighting and developing electrodes for next-generation memory materials. The results demonstrate that SELLM significantly facilitates the generation of effective solutions compared to cases without specific customization or effort, showcasing the potential of SELLM to enable LLMs to generate effective solutions even for challenging problems.

Title: Federated Deep Subspace Clustering

Authors: Yupei Zhang, Ruojia Feng, Yifei Wang, Xuequn Shang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2501.00230
Pdf URL: https://arxiv.org/pdf/2501.00230
Copy Paste: [[2501.00230]] Federated Deep Subspace Clustering(https://arxiv.org/abs/2501.00230)
Keywords: protect, federate
Abstract: This paper introduces FDSC, a private-protected subspace clustering (SC) approach with federated learning (FC) schema. In each client, there is a deep subspace clustering network accounting for grouping the isolated data, composed of a encode network, a self-expressive layer, and a decode network. FDSC is achieved by uploading the encode network to communicate with other clients in the server. Besides, FDSC is also enhanced by preserving the local neighborhood relationship in each client. With the effects of federated learning and locality preservation, the learned data features from the encoder are boosted so as to enhance the self-expressiveness learning and result in better clustering performance. Experiments test FDSC on public datasets and compare with other clustering methods, demonstrating the effectiveness of FDSC.

Title: Zero-Shot Strategies for Length-Controllable Summarization

Authors: Fabian Retkowski, Alexander Waibel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00233
Pdf URL: https://arxiv.org/pdf/2501.00233
Copy Paste: [[2501.00233]] Zero-Shot Strategies for Length-Controllable Summarization(https://arxiv.org/abs/2501.00233)
Keywords: large language model
Abstract: Large language models (LLMs) struggle with precise length control, particularly in zero-shot settings. We conduct a comprehensive study evaluating LLMs' length control capabilities across multiple measures and propose practical methods to improve controllability. Our experiments with LLaMA 3 reveal stark differences in length adherence across measures and highlight inherent biases of the model. To address these challenges, we introduce a set of methods: length approximation, target adjustment, sample filtering, and automated revisions. By combining these methods, we demonstrate substantial improvements in length compliance while maintaining or enhancing summary quality, providing highly effective zero-shot strategies for precise length control without the need for model fine-tuning or architectural changes. With our work, we not only advance our understanding of LLM behavior in controlled text generation but also pave the way for more reliable and adaptable summarization systems in real-world applications.

Title: Make Domain Shift a Catastrophic Forgetting Alleviator in Class-Incremental Learning

Authors: Wei Chen, Yi Zhou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00237
Pdf URL: https://arxiv.org/pdf/2501.00237
Copy Paste: [[2501.00237]] Make Domain Shift a Catastrophic Forgetting Alleviator in Class-Incremental Learning(https://arxiv.org/abs/2501.00237)
Keywords: robust
Abstract: In the realm of class-incremental learning (CIL), alleviating the catastrophic forgetting problem is a pivotal challenge. This paper discovers a counter-intuitive observation: by incorporating domain shift into CIL tasks, the forgetting rate is significantly reduced. Our comprehensive studies demonstrate that incorporating domain shift leads to a clearer separation in the feature distribution across tasks and helps reduce parameter interference during the learning process. Inspired by this observation, we propose a simple yet effective method named DisCo to deal with CIL tasks. DisCo introduces a lightweight prototype pool that utilizes contrastive learning to promote distinct feature distributions for the current task relative to previous ones, effectively mitigating interference across tasks. DisCo can be easily integrated into existing state-of-the-art class-incremental learning methods. Experimental results show that incorporating our method into various CIL methods achieves substantial performance improvements, validating the benefits of our approach in enhancing class-incremental learning by separating feature representation and reducing interference. These findings illustrate that DisCo can serve as a robust fashion for future research in class-incremental learning.

Title: Exploring Variability in Fine-Tuned Models for Text Classification with DistilBERT

Authors: Giuliano Lorenzoni, Ivens Portugal, Paulo Alencar, Donald Cowan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00241
Pdf URL: https://arxiv.org/pdf/2501.00241
Copy Paste: [[2501.00241]] Exploring Variability in Fine-Tuned Models for Text Classification with DistilBERT(https://arxiv.org/abs/2501.00241)
Keywords: large language model
Abstract: This study evaluates fine-tuning strategies for text classification using the DistilBERT model, specifically the distilbert-base-uncased-finetuned-sst-2-english variant. Through structured experiments, we examine the influence of hyperparameters such as learning rate, batch size, and epochs on accuracy, F1-score, and loss. Polynomial regression analyses capture foundational and incremental impacts of these hyperparameters, focusing on fine-tuning adjustments relative to a baseline model. Results reveal variability in metrics due to hyperparameter configurations, showing trade-offs among performance metrics. For example, a higher learning rate reduces loss in relative analysis (p=0.027) but challenges accuracy improvements. Meanwhile, batch size significantly impacts accuracy and F1-score in absolute regression (p=0.028 and p=0.005) but has limited influence on loss optimization (p=0.170). The interaction between epochs and batch size maximizes F1-score (p=0.001), underscoring the importance of hyperparameter interplay. These findings highlight the need for fine-tuning strategies addressing non-linear hyperparameter interactions to balance performance across metrics. Such variability and metric trade-offs are relevant for tasks beyond text classification, including NLP and computer vision. This analysis informs fine-tuning strategies for large language models and promotes adaptive designs for broader model applicability.

Title: Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Authors: Edwin Arkel Rios, Jansen Christopher Yuanda, Vincent Leon Ghanz, Cheng-Wei Yu, Bo-Cheng Lai, Min-Chun Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00243
Pdf URL: https://arxiv.org/pdf/2501.00243
Copy Paste: [[2501.00243]] Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition(https://arxiv.org/abs/2501.00243)
Keywords: transformer
Abstract: Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10\% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \url{this https URL}

Title: Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking

Authors: Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Shaokai Chen, Mengshu Sun, Binbin Hu, Zhiqiang Zhang, Lei Liang, Wen Zhang, Huajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00244
Pdf URL: https://arxiv.org/pdf/2501.00244
Copy Paste: [[2501.00244]] Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking(https://arxiv.org/abs/2501.00244)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated exceptional performance in text generation within current NLP research. However, the lack of factual accuracy is still a dark cloud hanging over the LLM skyscraper. Structural knowledge prompting (SKP) is a prominent paradigm to integrate external knowledge into LLMs by incorporating structural representations, achieving state-of-the-art results in many knowledge-intensive tasks. However, existing methods often focus on specific problems, lacking a comprehensive exploration of the generalization and capability boundaries of SKP. This paper aims to evaluate and rethink the generalization capability of the SKP paradigm from four perspectives including Granularity, Transferability, Scalability, and Universality. To provide a thorough evaluation, we introduce a novel multi-granular, multi-level benchmark called SUBARU, consisting of 9 different tasks with varying levels of granularity and difficulty.

Title: EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta

Authors: Raymond Bernard, Shaina Raza (PhD), Subhabrata Das (PhD), Rahul Murugan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00257
Pdf URL: https://arxiv.org/pdf/2501.00257
Copy Paste: [[2501.00257]] EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta(https://arxiv.org/abs/2501.00257)
Keywords: robust, large language model
Abstract: Despite the remarkable coherence of Large Language Models (LLMs), existing evaluation methods often suffer from fluency bias and rely heavily on multiple-choice formats, making it difficult to assess factual accuracy and complex reasoning effectively. LLMs thus frequently generate factually inaccurate responses, especially in complex reasoning tasks, highlighting two prominent challenges: (1) the inadequacy of existing methods to evaluate reasoning and factual accuracy effectively, and (2) the reliance on human evaluators for nuanced judgment, as illustrated by Williams and Huckle (2024)[1], who found manual grading indispensable despite automated grading advancements. To address evaluation gaps in open-ended reasoning tasks, we introduce the EQUATOR Evaluator (Evaluation of Question Answering Thoroughness in Open-ended Reasoning). This framework combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment. Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations. In practice, EQUATOR significantly reduces reliance on human evaluators for scoring and improves scalability compared to Williams and Huckle's (2004)[1] methods. Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards. Additionally, we introduce an automated evaluation process leveraging smaller, locally hosted LLMs. We used LLaMA 3.2B, running on the Ollama binaries to streamline our assessments. This work establishes a new paradigm for evaluating LLM performance, emphasizing factual accuracy and reasoning ability, and provides a robust methodological foundation for future research.

Title: Detection and Prevention of Smishing Attacks

Authors: Diksha Goel
Subjects: cs.CR, cs.SI
Abstract URL: https://arxiv.org/abs/2501.00260
Pdf URL: https://arxiv.org/pdf/2501.00260
Copy Paste: [[2501.00260]] Detection and Prevention of Smishing Attacks(https://arxiv.org/abs/2501.00260)
Keywords: attack, steal
Abstract: Phishing is an online identity theft technique where attackers steal users personal information, leading to financial losses for individuals and organizations. With the increasing adoption of smartphones, which provide functionalities similar to desktop computers, attackers are targeting mobile users. Smishing, a phishing attack carried out through Short Messaging Service (SMS), has become prevalent due to the widespread use of SMS-based services. It involves deceptive messages designed to extract sensitive information. Despite the growing number of smishing attacks, limited research focuses on detecting these threats. This work presents a smishing detection model using a content-based analysis approach. To address the challenge posed by slang, abbreviations, and short forms in text communication, the model normalizes these into standard forms. A machine learning classifier is employed to classify messages as smishing or ham. Experimental results demonstrate the model effectiveness, achieving classification accuracies of 97.14% for smishing and 96.12% for ham messages, with an overall accuracy of 96.20%.

Title: Collaborative Approaches to Enhancing Smart Vehicle Cybersecurity by AI-Driven Threat Detection

Authors: Syed Atif Ali, Salwa Din
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00261
Pdf URL: https://arxiv.org/pdf/2501.00261
Copy Paste: [[2501.00261]] Collaborative Approaches to Enhancing Smart Vehicle Cybersecurity by AI-Driven Threat Detection(https://arxiv.org/abs/2501.00261)
Keywords: secure, security, robust
Abstract: The introduction sets the stage for exploring collaborative approaches to bolstering smart vehicle cybersecurity through AI-driven threat detection. As the automotive industry increasingly adopts connected and automated vehicles (CAVs), the need for robust cybersecurity measures becomes paramount. With the emergence of new vulnerabilities and security requirements, the integration of advanced technologies such as 5G networks, blockchain, and quantum computing presents promising avenues for enhancing CAV cybersecurity . Additionally, the roadmap for cybersecurity in autonomous vehicles emphasizes the importance of efficient intrusion detection systems and AI-based techniques, along with the integration of secure hardware, software stacks, and advanced threat intelligence to address cybersecurity challenges in future autonomous vehicles.

Title: Enhancing Wireless Sensor Network Security through Integration with the ServiceNow Cloud Platform

Authors: Syed Atif Ali, Salwa Din
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00264
Pdf URL: https://arxiv.org/pdf/2501.00264
Copy Paste: [[2501.00264]] Enhancing Wireless Sensor Network Security through Integration with the ServiceNow Cloud Platform(https://arxiv.org/abs/2501.00264)
Keywords: secure, security, attack
Abstract: Wireless Sensor Networks (WSNs) continue to experience rapid developments and integration into modern-day applications. Overall, WSNs collect and process relevant data through sensors or nodes and communicate with different networks for superior information management. Nevertheless, a primary concern relative to WSNs is security. Considering the high constraints on throughput, battery, processing power, and memory, typical security procedures present limitations for application in WSNs. This research focuses on the integration of WSNs with the cloud platform, specifically to address these security risks. The cloud platform also adopts a security-driven approach and has attracted many applications across various sectors globally. This research specifically explores how cloud computing could be exploited to impede Denial of Service attacks from endangering WSNs. WSNs are now deployed in various low-powered applications, including disaster management, homeland security, battlefield surveillance, agriculture, and the healthcare industry. WSNs are distinguished from traditional networks by the numerous wireless connected sensors being deployed to conduct an assigned task. In testing scenarios, the size of WSNs ranges from a few to several thousand. The overarching requirements of WSNs include rapid processing of collected data, low-cost installation and maintenance, and low latency in network operations. Given that a substantial amount of WSN applications are used in high-risk and volatile environments, they must effectively address security concerns. This includes the secure movement, storage, and communication of data through networks, an environment in which WSNs are notably vulnerable. The limitations of WSNs have meant that they are predominantly used in unsecured applications despite positive advancements. This study explores methods for integrating the WSN with the cloud.

Title: Outlier-Robust Training of Machine Learning Models

Authors: Rajat Talak, Charis Georgiou, Jingnan Shi, Luca Carlone
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2501.00265
Pdf URL: https://arxiv.org/pdf/2501.00265
Copy Paste: [[2501.00265]] Outlier-Robust Training of Machine Learning Models(https://arxiv.org/abs/2501.00265)
Keywords: robust
Abstract: Robust training of machine learning models in the presence of outliers has garnered attention across various domains. The use of robust losses is a popular approach and is known to mitigate the impact of outliers. We bring to light two literatures that have diverged in their ways of designing robust losses: one using M-estimation, which is popular in robotics and computer vision, and another using a risk-minimization framework, which is popular in deep learning. We first show that a simple modification of the Black-Rangarajan duality provides a unifying view. The modified duality brings out a definition of a robust loss kernel $\sigma$ that is satisfied by robust losses in both the literatures. Secondly, using the modified duality, we propose an Adaptive Alternation Algorithm (AAA) for training machine learning models with outliers. The algorithm iteratively trains the model by using a weighted version of the non-robust loss, while updating the weights at each iteration. The algorithm is augmented with a novel parameter update rule by interpreting the weights as inlier probabilities, and obviates the need for complex parameter tuning. Thirdly, we investigate convergence of the adaptive alternation algorithm to outlier-free optima. Considering arbitrary outliers (i.e., with no distributional assumption on the outliers), we show that the use of robust loss kernels {\sigma} increases the region of convergence. We experimentally show the efficacy of our algorithm on regression, classification, and neural scene reconstruction problems. We release our implementation code: this https URL.

Title: A review of faithfulness metrics for hallucination assessment in Large Language Models

Authors: Ben Malin, Tatiana Kalganova, Nikoloas Boulgouris
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00269
Pdf URL: https://arxiv.org/pdf/2501.00269
Copy Paste: [[2501.00269]] A review of faithfulness metrics for hallucination assessment in Large Language Models(https://arxiv.org/abs/2501.00269)
Keywords: large language model
Abstract: This review examines the means with which faithfulness has been evaluated across open-ended summarization, question-answering and machine translation tasks. We find that the use of LLMs as a faithfulness evaluator is commonly the metric that is most highly correlated with human judgement. The means with which other studies have mitigated hallucinations is discussed, with both retrieval augmented generation (RAG) and prompting framework approaches having been linked with superior faithfulness, whilst other recommendations for mitigation are provided. Research into faithfulness is integral to the continued widespread use of LLMs, as unfaithful responses can pose major risks to many areas whereby LLMs would otherwise be suitable. Furthermore, evaluating open-ended generation provides a more comprehensive measure of LLM performance than commonly used multiple-choice benchmarking, which can help in advancing the trust that can be placed within LLMs.

Title: Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs

Authors: Weijia Xu, Nebojsa Jojic, Sudha Rao, Chris Brockett, Bill Dolan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00273
Pdf URL: https://arxiv.org/pdf/2501.00273
Copy Paste: [[2501.00273]] Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs(https://arxiv.org/abs/2501.00273)
Keywords: large language model
Abstract: With rapid advances in large language models (LLMs), there has been an increasing application of LLMs in creative content ideation and generation. A critical question emerges: can current LLMs provide ideas that are diverse enough to truly bolster the collective creativity? We examine two state-of-the-art LLMs, GPT-4 and LLaMA-3, on story generation and discover that LLM-generated stories often consist of plot elements that are echoed across a number of generations. To quantify this phenomenon, we introduce the Sui Generis score, which estimates how unlikely a plot element is to appear in alternative storylines generated by the same LLM. Evaluating on 100 short stories, we find that LLM-generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations, while the original human-written stories are rarely recreated or even echoed in pieces. Moreover, our human evaluation shows that the ranking of Sui Generis scores among story segments correlates moderately with human judgment of surprise level, even though score computation is completely automatic without relying on human judgment.

Title: LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Authors: Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, Chris Kedzie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00274
Pdf URL: https://arxiv.org/pdf/2501.00274
Copy Paste: [[2501.00274]] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts(https://arxiv.org/abs/2501.00274)
Keywords: large language model
Abstract: This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $\textit{combined}$ to $\textit{predict}$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error $< 0.5$, a $2\times$ improvement over the uncalibrated baseline.

Title: ReFormer: Generating Radio Fakes for Data Augmentation

Authors: Yagna Kaasaragadda, Silvija Kokalj-Filipovic
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2501.00282
Pdf URL: https://arxiv.org/pdf/2501.00282
Copy Paste: [[2501.00282]] ReFormer: Generating Radio Fakes for Data Augmentation(https://arxiv.org/abs/2501.00282)
Keywords: transformer, generative
Abstract: We present ReFormer, a generative AI (GAI) model that can efficiently generate synthetic radio-frequency (RF) data, or RF fakes, statistically similar to the data it was trained on, or with modified statistics, in order to augment datasets collected in real-world experiments. For applications like this, adaptability and scalability are important issues. This is why ReFormer leverages transformer-based autoregressive generation, trained on learned discrete representations of RF signals. By using prompts, such GAI can be made to generate the data which complies with specific constraints or conditions, particularly useful for training channel estimation and modeling. It may also leverage the data from a source system to generate training data for a target system. We show how different transformer architectures and other design choices affect the quality of generated RF fakes, evaluated using metrics such as precision and recall, classification accuracy and signal constellation diagrams.

Title: Dual Diffusion for Unified Image Generation and Understanding

Authors: Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, Peng Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00289
Pdf URL: https://arxiv.org/pdf/2501.00289
Copy Paste: [[2501.00289]] Dual Diffusion for Unified Image Generation and Understanding(https://arxiv.org/abs/2501.00289)
Keywords: diffusion, transformer
Abstract: Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models. We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation that significantly improves on existing diffusion-based multimodal models, and is the first of its kind to support the full suite of vision-language modeling capabilities. Inspired by the multimodal diffusion transformer (MM-DiT) and recent advances in discrete diffusion language modeling, we leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly under a single loss function, which is back-propagated through both branches of the diffusion transformer. The resulting model is highly flexible and capable of a wide range of tasks including image generation, captioning, and visual question answering. Our model attained competitive performance compared to recent unified image understanding and generation models, demonstrating the potential of multimodal diffusion modeling as a promising alternative to autoregressive next-token prediction models.

Title: Research on vehicle detection based on improved YOLOv8 network

Authors: Haocheng Guo, Yaqiong Zhang, Lieyang Chen, Arfat Ahmad Khan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00300
Pdf URL: https://arxiv.org/pdf/2501.00300
Copy Paste: [[2501.00300]] Research on vehicle detection based on improved YOLOv8 network(https://arxiv.org/abs/2501.00300)
Keywords: segmentation
Abstract: The key to ensuring the safe obstacle avoidance function of autonomous driving systems lies in the use of extremely accurate vehicle recognition techniques. However, the variability of the actual road environment and the diverse characteristics of vehicles and pedestrians together constitute a huge obstacle to improving detection accuracy, posing a serious challenge to the realization of this goal. To address the above issues, this paper proposes an improved YOLOv8 vehicle detection method. Specifically, taking the YOLOv8n-seg model as the base model, firstly, the FasterNet network is used to replace the backbone network to achieve the purpose of reducing the computational complexity and memory while improving the detection accuracy and speed; secondly, the feature enhancement is achieved by adding the attention mechanism CBAM to the Neck; and lastly, the loss function CIoU is modified to WIoU, which optimizes the detection box localization while improving the segmentation accuracy. The results show that the improved model achieves 98.3%, 89.1% and 88.4% detection accuracy for car, Person and Motorcycle. Compared with the pre-improvement and YOLOv9 models in six metrics such as Precision.

Title: SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation

Authors: Shi-Feng Peng, Guolei Sun, Yong Li, Hongsong Wang, Guo-Sen Xie
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00303
Pdf URL: https://arxiv.org/pdf/2501.00303
Copy Paste: [[2501.00303]] SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation(https://arxiv.org/abs/2501.00303)
Keywords: segmentation
Abstract: The primary challenge of cross-domain few-shot segmentation (CD-FSS) is the domain disparity between the training and inference phases, which can exist in either the input data or the target classes. Previous models struggle to learn feature representations that generalize to various unknown domains from limited training domain samples. In contrast, the large-scale visual model SAM, pre-trained on tens of millions of images from various domains and classes, possesses excellent generalizability. In this work, we propose a SAM-aware graph prompt reasoning network (GPRN) that fully leverages SAM to guide CD-FSS feature representation learning and improve prediction accuracy. Specifically, we propose a SAM-aware prompt initialization module (SPI) to transform the masks generated by SAM into visual prompts enriched with high-level semantic information. Since SAM tends to divide an object into many sub-regions, this may lead to visual prompts representing the same semantic object having inconsistent or fragmented features. We further propose a graph prompt reasoning (GPR) module that constructs a graph among visual prompts to reason about their interrelationships and enable each visual prompt to aggregate information from similar prompts, thus achieving global semantic consistency. Subsequently, each visual prompt embeds its semantic information into the corresponding mask region to assist in feature representation learning. To refine the segmentation mask during testing, we also design a non-parameter adaptive point selection module (APS) to select representative point prompts from query predictions and feed them back to SAM to refine inaccurate segmentation results. Experiments on four standard CD-FSS datasets demonstrate that our method establishes new state-of-the-art results. Code: this https URL.

Title: diffIRM: A Diffusion-Augmented Invariant Risk Minimization Framework for Spatiotemporal Prediction over Graphs

Authors: Zhaobin Mo, Haotian Xiang, Xuan Di
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00305
Pdf URL: https://arxiv.org/pdf/2501.00305
Copy Paste: [[2501.00305]] diffIRM: A Diffusion-Augmented Invariant Risk Minimization Framework for Spatiotemporal Prediction over Graphs(https://arxiv.org/abs/2501.00305)
Keywords: diffusion
Abstract: Spatiotemporal prediction over graphs (STPG) is challenging, because real-world data suffers from the Out-of-Distribution (OOD) generalization problem, where test data follow different distributions from training ones. To address this issue, Invariant Risk Minimization (IRM) has emerged as a promising approach for learning invariant representations across different environments. However, IRM and its variants are originally designed for Euclidean data like images, and may not generalize well to graph-structure data such as spatiotemporal graphs due to spatial correlations in graphs. To overcome the challenge posed by graph-structure data, the existing graph OOD methods adhere to the principles of invariance existence, or environment diversity. However, there is little research that combines both principles in the STPG problem. A combination of the two is crucial for efficiently distinguishing between invariant features and spurious ones. In this study, we fill in this research gap and propose a diffusion-augmented invariant risk minimization (diffIRM) framework that combines these two principles for the STPG problem. Our diffIRM contains two processes: i) data augmentation and ii) invariant learning. In the data augmentation process, a causal mask generator identifies causal features and a graph-based diffusion model acts as an environment augmentor to generate augmented spatiotemporal graph data. In the invariant learning process, an invariance penalty is designed using the augmented data, and then serves as a regularizer for training the spatiotemporal prediction model. The real-world experiment uses three human mobility datasets, i.e. SafeGraph, PeMS04, and PeMS08. Our proposed diffIRM outperforms baselines.

Title: OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Authors: Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, Xiang Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00321
Pdf URL: https://arxiv.org/pdf/2501.00321
Copy Paste: [[2501.00321]] OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning(https://arxiv.org/abs/2501.00321)
Keywords: extraction
Abstract: Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at this https URL.

Title: OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies

Authors: Runnan Chen, Xiangyu Sun, Zhaoqing Wang, Youquan Liu, Jiepeng Wang, Lingdong Kong, Jiankang Deng, Mingming Gong, Liang Pan, Wenping Wang, Tongliang Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00326
Pdf URL: https://arxiv.org/pdf/2501.00326
Copy Paste: [[2501.00326]] OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies(https://arxiv.org/abs/2501.00326)
Keywords: robust, segmentation
Abstract: Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (this https URL).

Title: Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion

Authors: Hebin Wang, Yangning Li, Yinghui Li, Hai-Tao Zheng, Wenhao Jiang, Hong-Gee Kim
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2501.00330
Pdf URL: https://arxiv.org/pdf/2501.00330
Copy Paste: [[2501.00330]] Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion(https://arxiv.org/abs/2501.00330)
Keywords: generative, large language model
Abstract: The rapid development of multimodal large language models (MLLMs) has brought significant improvements to a wide range of tasks in real-world applications. However, LLMs still exhibit certain limitations in extracting implicit semantic information. In this paper, we apply MLLMs to the Multi-modal Entity Set Expansion (MESE) task, which aims to expand a handful of seed entities with new entities belonging to the same semantic class, and multi-modal information is provided with each entity. We explore the capabilities of MLLMs to understand implicit semantic information at the entity-level granularity through the MESE task, introducing a listwise ranking method LUSAR that maps local scores to global rankings. Our LUSAR demonstrates significant improvements in MLLM's performance on the MESE task, marking the first use of generative MLLM for ESE tasks and extending the applicability of listwise ranking.

Title: MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation

Authors: Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Mahashweta Das, Na Zou
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.00332
Pdf URL: https://arxiv.org/pdf/2501.00332
Copy Paste: [[2501.00332]] MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation(https://arxiv.org/abs/2501.00332)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) are becoming essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses. However, the existing RAG systems frequently struggle with the quality of retrieval documents, as irrelevant or noisy documents degrade performance, increase computational overhead, and undermine response reliability. To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG), a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents. Specifically, MAIN-RAG introduces an adaptive filtering mechanism that dynamically adjusts the relevance filtering threshold based on score distributions, effectively minimizing noise while maintaining high recall of relevant documents. The proposed approach leverages inter-agent consensus to ensure robust document selection without requiring additional training data or fine-tuning. Experimental results across four QA benchmarks demonstrate that MAIN-RAG consistently outperforms traditional RAG approaches, achieving a 2-11% improvement in answer accuracy while reducing the number of irrelevant retrieved documents. Quantitative analysis further reveals that our approach achieves superior response consistency and answer accuracy over baseline methods, offering a competitive and practical alternative to training-based solutions.

Title: Rethinking Layer Removal: Preserving Critical Components with Task-Aware Singular Value Decomposition

Authors: Kainan Liu, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang, Jing Xiao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00339
Pdf URL: https://arxiv.org/pdf/2501.00339
Copy Paste: [[2501.00339]] Rethinking Layer Removal: Preserving Critical Components with Task-Aware Singular Value Decomposition(https://arxiv.org/abs/2501.00339)
Keywords: large language model
Abstract: Layer removal has emerged as a promising approach for compressing large language models (LLMs) by leveraging redundancy within layers to reduce model size and accelerate inference. However, this technique often compromises internal consistency, leading to performance degradation and instability, with varying impacts across different model architectures. In this work, we propose Taco-SVD, a task-aware framework that retains task-critical singular value directions, preserving internal consistency while enabling efficient compression. Unlike direct layer removal, Taco-SVD preserves task-critical transformations to mitigate performance degradation. By leveraging gradient-based attribution methods, Taco-SVD aligns singular values with downstream task objectives. Extensive evaluations demonstrate that Taco-SVD outperforms existing methods in perplexity and task performance across different architectures while ensuring minimal computational overhead.

Title: Chunk-Distilled Language Modeling

Authors: Yanhong Li, Karen Livescu, Jiawei Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00343
Pdf URL: https://arxiv.org/pdf/2501.00343
Copy Paste: [[2501.00343]] Chunk-Distilled Language Modeling(https://arxiv.org/abs/2501.00343)
Keywords: large language model
Abstract: We introduce Chunk-Distilled Language Modeling (CD-LM), an approach to text generation that addresses two challenges in current large language models (LLMs): the inefficiency of token-level generation, and the difficulty of adapting to new data and knowledge. Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step. Our retrieval framework enables flexible construction of model- or domain-specific datastores, either leveraging the internal knowledge of existing models, or incorporating expert insights from human-annotated corpora. This adaptability allows for enhanced control over the language model's distribution without necessitating additional training. We present the CD-LM formulation along with performance metrics demonstrating its ability to improve language model performance and efficiency across a diverse set of downstream tasks. Code and data will be made publicly available.

Title: PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM

Authors: Runnan Chen, Zhaoqing Wang, Jiepeng Wang, Yuexin Ma, Mingming Gong, Wenping Wang, Tongliang Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2501.00352
Pdf URL: https://arxiv.org/pdf/2501.00352
Copy Paste: [[2501.00352]] PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM(https://arxiv.org/abs/2501.00352)
Keywords: segmentation
Abstract: Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (this https URL)

Title: RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

Authors: Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00353
Pdf URL: https://arxiv.org/pdf/2501.00353
Copy Paste: [[2501.00353]] RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions(https://arxiv.org/abs/2501.00353)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at this https URL.

Title: A New Dataset and Methodology for Malicious URL Classification

Authors: Ilan Schvartzman, Roei Sarussi, Maor Ashkenazi, Ido kringel, Yaniv Tocker, Tal Furman Shohet
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2501.00356
Pdf URL: https://arxiv.org/pdf/2501.00356
Copy Paste: [[2501.00356]] A New Dataset and Methodology for Malicious URL Classification(https://arxiv.org/abs/2501.00356)
Keywords: security, defense
Abstract: Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model's capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.

Title: A Novel Shape Guided Transformer Network for Instance Segmentation in Remote Sensing Images

Authors: Dawen Yu, Shunping Ji
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00360
Pdf URL: https://arxiv.org/pdf/2501.00360
Copy Paste: [[2501.00360]] A Novel Shape Guided Transformer Network for Instance Segmentation in Remote Sensing Images(https://arxiv.org/abs/2501.00360)
Keywords: transformer, segmentation
Abstract: Instance segmentation performance in remote sensing images (RSIs) is significantly affected by two issues: how to extract accurate boundaries of objects from remote imaging through the dynamic atmosphere, and how to integrate the mutual information of related object instances scattered over a vast spatial region. In this study, we propose a novel Shape Guided Transformer Network (SGTN) to accurately extract objects at the instance level. Inspired by the global contextual modeling capacity of the self-attention mechanism, we propose an effective transformer encoder termed LSwin, which incorporates vertical and horizontal 1D global self-attention mechanisms to obtain better global-perception capacity for RSIs than the popular local-shifted-window based Swin Transformer. To achieve accurate instance mask segmentation, we introduce a shape guidance module (SGM) to emphasize the object boundary and shape information. The combination of SGM, which emphasizes the local detail information, and LSwin, which focuses on the global context relationships, achieve excellent RSI instance segmentation. Their effectiveness was validated through comprehensive ablation experiments. Especially, LSwin is proved better than the popular ResNet and Swin transformer encoder at the same level of efficiency. Compared to other instance segmentation methods, our SGTN achieves the highest average precision (AP) scores on two single-class public datasets (WHU dataset and BITCC dataset) and a multi-class public dataset (NWPU VHR-10 dataset). Code will be available at this http URL.

Title: SPDZCoder: Teaching LLMs to Synthesize Privacy Computing Code without Massive Training Data

Authors: Xiaoning Dong, Peilin Xin, Wei Xu
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2501.00363
Pdf URL: https://arxiv.org/pdf/2501.00363
Copy Paste: [[2501.00363]] SPDZCoder: Teaching LLMs to Synthesize Privacy Computing Code without Massive Training Data(https://arxiv.org/abs/2501.00363)
Keywords: privacy, large language model
Abstract: Privacy computing receives increasing attention but writing privacy computing code remains challenging for developers due to limited library functions that necessitate extensive function implementation from scratch as well as the data-oblivious requirement which contradicts intuitive thinking and usual practices of programmers. Large language models (LLMs) have demonstrated surprising capabilities in coding tasks and achieved state-of-the-art performance across many benchmarks. However, even with extensive prompting, existing LLMs struggle with code translation task for privacy computing, such as translating Python to MP-SPDZ, due to the scarcity of MP-SPDZ data required for effective pre-training or fine-tuning. To address the limitation, this paper proposes SPDZCoder, a rule-based framework to teach LLMs to synthesize privacy computing code without asking experts to write tons of code and by leveraging the instruction-following and in-context learning ability of LLMs. Specifically, SPDZCoder decouples the translation task into the refactoring stage and the generation stage, which can mitigate the semantic-expressing differences at different levels. In addition, SPDZCoder can further improve its performance by a feedback stage. SPDZCoder does not require fine-tuning since it adopts an in-context learning paradigm of LLMs. To evaluate SPDZCoder, we manually created a benchmark dataset, named SPDZEval, containing six classes of difficult tasks to implement in MP-SPDZ. We conduct experiments on SPDZEval and the experimental results shows that SPDZCoder achieves the state-of-the-art performance in pass@1 and pass@2 across six data splits. Specifically, SPDZCoder achieves an overall correctness of 85.94% and 92.01% in pass@1 and pass@2, respectively, significantly surpassing baselines (at most 30.35% and 49.84% in pass@1 and pass@2, respectively) by a large margin.

Title: Low-Rank Adaptation for Foundation Models: A Comprehensive Review

Authors: Menglin Yang, Jialin Chen, Yifei Zhang, Jiahong Liu, Jiasheng Zhang, Qiyao Ma, Harshit Verma, Qianru Zhang, Min Zhou, Irwin King, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00365
Pdf URL: https://arxiv.org/pdf/2501.00365
Copy Paste: [[2501.00365]] Low-Rank Adaptation for Foundation Models: A Comprehensive Review(https://arxiv.org/abs/2501.00365)
Keywords: robust, large language model
Abstract: The rapid advancement of foundation modelslarge-scale neural networks trained on diverse, extensive datasetshas revolutionized artificial intelligence, enabling unprecedented advancements across domains such as natural language processing, computer vision, and scientific discovery. However, the substantial parameter count of these models, often reaching billions or trillions, poses significant challenges in adapting them to specific downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges, offering a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead. This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models, including recent techniques foundations, emerging frontiers and applications of low-rank adaptation across multiple domains. Finally, this survey discusses key challenges and future research directions in theoretical understanding, scalability, and robustness. This survey serves as a valuable resource for researchers and practitioners working with efficient foundation model adaptation.

Title: Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

Authors: Evelyn Zhang, Bang Xiao, Jiayi Tang, Qianli Ma, Chang Zou, Xuefei Ning, Xuming Hu, Linfeng Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00375
Pdf URL: https://arxiv.org/pdf/2501.00375
Copy Paste: [[2501.00375]] Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free(https://arxiv.org/abs/2501.00375)
Keywords: diffusion, generative
Abstract: Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with methods like feature caching attracting attention due to their effectiveness and simplicity. Nonetheless, simply reusing features computed at previous timesteps causes the features across adjacent timesteps to become similar, reducing the dynamics of features over time and ultimately compromising the quality of generated images. In this paper, we introduce a dynamics-aware token pruning (DaTo) approach that addresses the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamic tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. DaTo combines feature caching with token pruning in a training-free manner, achieving both temporal and token-wise information reuse. Applied to Stable Diffusion on the ImageNet, our approach delivered a 9$\times$ speedup while reducing FID by 0.33, indicating enhanced image quality. On the COCO-30k, we observed a 7$\times$ acceleration coupled with a notable FID reduction of 2.17.

Title: Federated Dropout: Convergence Analysis and Resource Allocation

Authors: Sijing Xie, Dingzhu Wen, Xiaonan Liu, Changsheng You, Tharmalingam Ratnarajah, Kaibin Huang
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2501.00379
Pdf URL: https://arxiv.org/pdf/2501.00379
Copy Paste: [[2501.00379]] Federated Dropout: Convergence Analysis and Resource Allocation(https://arxiv.org/abs/2501.00379)
Keywords: federate
Abstract: Federated Dropout is an efficient technique to overcome both communication and computation bottlenecks for deploying federated learning at the network edge. In each training round, an edge device only needs to update and transmit a sub-model, which is generated by the typical method of dropout in deep learning, and thus effectively reduces the per-round latency. \textcolor{blue}{However, the theoretical convergence analysis for Federated Dropout is still lacking in the literature, particularly regarding the quantitative influence of dropout rate on convergence}. To address this issue, by using the Taylor expansion method, we mathematically show that the gradient variance increases with a scaling factor of $\gamma/(1-\gamma)$, with $\gamma \in [0, \theta)$ denoting the dropout rate and $\theta$ being the maximum dropout rate ensuring the loss function reduction. Based on the above approximation, we provide the convergence analysis for Federated Dropout. Specifically, it is shown that a larger dropout rate of each device leads to a slower convergence rate. This provides a theoretical foundation for reducing the convergence latency by making a tradeoff between the per-round latency and the overall rounds till convergence. Moreover, a low-complexity algorithm is proposed to jointly optimize the dropout rate and the bandwidth allocation for minimizing the loss function in all rounds under a given per-round latency and limited network resources. Finally, numerical results are provided to verify the effectiveness of the proposed algorithm.

Title: Efficient Relational Context Perception for Knowledge Graph Completion

Authors: Wenkai Tu, Guojia Wan, Zhengchun Shang, Bo Du
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00397
Pdf URL: https://arxiv.org/pdf/2501.00397
Copy Paste: [[2501.00397]] Efficient Relational Context Perception for Knowledge Graph Completion(https://arxiv.org/abs/2501.00397)
Keywords: robust, transformer
Abstract: Knowledge Graphs (KGs) provide a structured representation of knowledge but often suffer from challenges of incompleteness. To address this, link prediction or knowledge graph completion (KGC) aims to infer missing new facts based on existing facts in KGs. Previous knowledge graph embedding models are limited in their ability to capture expressive features, especially when compared to deeper, multi-layer models. These approaches also assign a single static embedding to each entity and relation, disregarding the fact that entities and relations can exhibit different behaviors in varying graph contexts. Due to complex context over a fact triple of a KG, existing methods have to leverage complex non-linear context encoder, like transformer, to project entity and relation into low dimensional representations, resulting in high computation cost. To overcome these limitations, we propose Triple Receptance Perception (TRP) architecture to model sequential information, enabling the learning of dynamic context of entities and relations. Then we use tensor decomposition to calculate triple scores, providing robust relational decoding capabilities. This integration allows for more expressive representations. Experiments on benchmark datasets such as YAGO3-10, UMLS, FB15k, and FB13 in link prediction and triple classification tasks demonstrate that our method performs better than several state-of-the-art models, proving the effectiveness of the integration.

Title: Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Authors: Martin Pawelczyk, Lillian Sun, Zhenting Qi, Aounon Kumar, Himabindu Lakkaraju
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00418
Pdf URL: https://arxiv.org/pdf/2501.00418
Copy Paste: [[2501.00418]] Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models(https://arxiv.org/abs/2501.00418)
Keywords: privacy, robust, fair, generative, large language model
Abstract: The rapid proliferation of generative AI, especially large language models, has led to their integration into a variety of applications. A key phenomenon known as weak-to-strong generalization - where a strong model trained on a weak model's outputs surpasses the weak model in task performance - has gained significant attention. Yet, whether critical trustworthiness properties such as robustness, fairness, and privacy can generalize similarly remains an open question. In this work, we study this question by examining if a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model's outputs, a process we term weak-to-strong trustworthiness generalization. To address this, we introduce two foundational training strategies: 1) Weak Trustworthiness Finetuning (Weak TFT), which leverages trustworthiness regularization during the fine-tuning of the weak model, and 2) Weak and Weak-to-Strong Trustworthiness Finetuning (Weak+WTS TFT), which extends regularization to both weak and strong models. Our experimental evaluation on real-world datasets reveals that while some trustworthiness properties, such as fairness, adversarial, and OOD robustness, show significant improvement in transfer when both models were regularized, others like privacy do not exhibit signs of weak-to-strong trustworthiness. As the first study to explore trustworthiness generalization via weak-to-strong generalization, our work provides valuable insights into the potential and limitations of weak-to-strong generalization.

Title: KAE: Kolmogorov-Arnold Auto-Encoder for Representation Learning

Authors: Fangchen Yu, Ruilizhen Hu, Yidong Lin, Yuqi Ma, Zhenghao Huang, Wenye Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00420
Pdf URL: https://arxiv.org/pdf/2501.00420
Copy Paste: [[2501.00420]] KAE: Kolmogorov-Arnold Auto-Encoder for Representation Learning(https://arxiv.org/abs/2501.00420)
Keywords: interpretability
Abstract: The Kolmogorov-Arnold Network (KAN) has recently gained attention as an alternative to traditional multi-layer perceptrons (MLPs), offering improved accuracy and interpretability by employing learnable activation functions on edges. In this paper, we introduce the Kolmogorov-Arnold Auto-Encoder (KAE), which integrates KAN with autoencoders (AEs) to enhance representation learning for retrieval, classification, and denoising tasks. Leveraging the flexible polynomial functions in KAN layers, KAE captures complex data patterns and non-linear relationships. Experiments on benchmark datasets demonstrate that KAE improves latent representation quality, reduces reconstruction errors, and achieves superior performance in downstream tasks such as retrieval, classification, and denoising, compared to standard autoencoders and other KAN variants. These results suggest KAE's potential as a useful tool for representation learning. Our code is available at \url{this https URL}.

Title: Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages

Authors: Or Haim Anidjar, Revital Marbel, Roi Yozevitch
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.00425
Pdf URL: https://arxiv.org/pdf/2501.00425
Copy Paste: [[2501.00425]] Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages(https://arxiv.org/abs/2501.00425)
Keywords: robust, transformer
Abstract: Approaching Speech-to-Text and Automatic Speech Recognition problems in low-resource languages is notoriously challenging due to the scarcity of validated datasets and the diversity of dialects. Arabic, Russian, and Portuguese exemplify these difficulties, being low-resource languages due to the many dialects of these languages across different continents worldwide. Moreover, the variety of accents and pronunciations of such languages complicate ASR models' success. With the increasing popularity of Deep Learning and Transformers, acoustic models like the renowned Wav2Vec2 have achieved superior performance in the Speech Recognition field compared to state-of-the-art approaches. However, despite Wav2Vec2's improved efficiency over traditional methods, its performance significantly declines for under-represented languages, even though it requires significantly less labeled data. This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. To validate our framework's effectiveness, we conducted a detailed experimental evaluation using three datasets from Mozilla's Common Voice project in Arabic, Russian, and Portuguese. Additionally, the framework presented in this paper demonstrates robustness to different diacritics. Ultimately, our approach outperforms two previous baseline models, which are the pre-trained Wav2Vec2 and the well-known Whisper ASR model, resulting in an average relative improvement of 33.9\% in Word Error Rate and a 53.2\% relative improvement in Character Error Rate.

Title: Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents

Authors: Chengbo He, Bochao Zou, Xin Li, Jiansheng Chen, Junliang Xing, Huimin Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00430
Pdf URL: https://arxiv.org/pdf/2501.00430
Copy Paste: [[2501.00430]] Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents(https://arxiv.org/abs/2501.00430)
Keywords: large language model
Abstract: Agents have demonstrated their potential in scientific reasoning tasks through large language models. However, they often face challenges such as insufficient accuracy and degeneration of thought when handling complex reasoning tasks, which impede their performance. To overcome these issues, we propose the Reactive and Reflection agents with Multi-Path Reasoning (RR-MP) Framework, aimed at enhancing the reasoning capabilities of LLMs. Our approach improves scientific reasoning accuracy by employing a multi-path reasoning mechanism where each path consists of a reactive agent and a reflection agent that collaborate to prevent degeneration of thought inherent in single-agent reliance. Additionally, the RR-MP framework does not require additional training; it utilizes multiple dialogue instances for each reasoning path and a separate summarizer to consolidate insights from all paths. This design integrates diverse perspectives and strengthens reasoning across each path. We conducted zero-shot and few-shot evaluations on tasks involving moral scenarios, college-level physics, and mathematics. Experimental results demonstrate that our method outperforms baseline approaches, highlighting the effectiveness and advantages of the RR-MP framework in managing complex scientific reasoning tasks.

Title: OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

Authors: Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00432
Pdf URL: https://arxiv.org/pdf/2501.00432
Copy Paste: [[2501.00432]] OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models(https://arxiv.org/abs/2501.00432)
Keywords: security, large language model
Abstract: Understanding human-to-human interactions, especially in contexts like public security surveillance, is critical for monitoring and maintaining safety. Traditional activity recognition systems are limited by fixed vocabularies, predefined labels, and rigid interaction categories that often rely on choreographed videos and overlook concurrent interactive groups. These limitations make such systems less adaptable to real-world scenarios, where interactions are diverse and unpredictable. In this paper, we propose an open vocabulary human-to-human interaction recognition (OV-HHIR) framework that leverages large language models to generate open-ended textual descriptions of both seen and unseen human interactions in open-world settings without being confined to a fixed vocabulary. Additionally, we create a comprehensive, large-scale human-to-human interaction dataset by standardizing and combining existing public human interaction datasets into a unified benchmark. Extensive experiments demonstrate that our method outperforms traditional fixed-vocabulary classification systems and existing cross-modal language models for video understanding, setting the stage for more intelligent and adaptable visual understanding systems in surveillance and beyond.

Title: Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

Authors: Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao
Subjects: cs.CV, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2501.00437
Pdf URL: https://arxiv.org/pdf/2501.00437
Copy Paste: [[2501.00437]] Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning(https://arxiv.org/abs/2501.00437)
Keywords: diffusion
Abstract: Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at this https URL.

Title: METANOIA: A Lifelong Intrusion Detection and Investigation System for Mitigating Concept Drift

Authors: Jie Ying, Tiantian Zhu, Aohan Zheng, Tieming Chen, Mingqi Lv, Yan Chen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.00438
Pdf URL: https://arxiv.org/pdf/2501.00438
Copy Paste: [[2501.00438]] METANOIA: A Lifelong Intrusion Detection and Investigation System for Mitigating Concept Drift(https://arxiv.org/abs/2501.00438)
Keywords: attack
Abstract: As Advanced Persistent Threat (APT) complexity increases, provenance data is increasingly used for detection. Anomaly-based systems are gaining attention due to their attack-knowledge-agnostic nature and ability to counter zero-day vulnerabilities. However, traditional detection paradigms, which train on offline, limited-size data, often overlook concept drift - unpredictable changes in streaming data distribution over time. This leads to high false positive rates. We propose incremental learning as a new paradigm to mitigate this issue. However, we identify FOUR CHALLENGES while integrating incremental learning as a new paradigm. First, the long-running incremental system must combat catastrophic forgetting (C1) and avoid learning malicious behaviors (C2). Then, the system needs to achieve precise alerts (C3) and reconstruct attack scenarios (C4). We present METANOIA, the first lifelong detection system that mitigates the high false positives due to concept drift. It connects pseudo edges to combat catastrophic forgetting, transfers suspicious states to avoid learning malicious behaviors, filters nodes at the path-level to achieve precise alerts, and constructs mini-graphs to reconstruct attack scenarios. Using state-of-the-art benchmarks, we demonstrate that METANOIA improves precision performance at the window-level, graph-level, and node-level by 30%, 54%, and 29%, respectively, compared to previous approaches.

Title: DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

Authors: Jie Ying, Tiantian Zhu, Mingqi Lv, Tieming Chen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.00446
Pdf URL: https://arxiv.org/pdf/2501.00446
Copy Paste: [[2501.00446]] DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation(https://arxiv.org/abs/2501.00446)
Keywords: security, attack
Abstract: As the scope and impact of cyber threats have expanded, analysts utilize audit logs to hunt threats and investigate attacks. The provenance graphs constructed from kernel logs are increasingly considered as an ideal data source due to their powerful semantic expression and attack historic correlation ability. However, storing provenance graphs with traditional databases faces the challenge of high storage overhead, given the high frequency of kernel events and the persistence of attacks. To address this, we propose Dehydrator, an efficient provenance graph storage system. For the logs generated by auditing frameworks, Dehydrator uses field mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally learns a deep neural network to support batch querying. We have conducted evaluations on seven datasets totaling over one billion log entries. Experimental results show that Dehydrator reduces the storage space by 84.55%. Dehydrator is 7.36 times more efficient than PostgreSQL, 7.16 times than Neo4j, and 16.17 times than Leonard (the work most closely related to Dehydrator, published at Usenix Security'23).

Title: Differentiable Prompt Learning for Vision Language Models

Authors: Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, Jianxi Gao
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2501.00457
Pdf URL: https://arxiv.org/pdf/2501.00457
Copy Paste: [[2501.00457]] Differentiable Prompt Learning for Vision Language Models(https://arxiv.org/abs/2501.00457)
Keywords: large language model
Abstract: Prompt learning is an effective way to exploit the potential of large-scale pre-trained foundational models. Continuous prompts parameterize context tokens in prompts by turning them into differentiable vectors. Deep continuous prompts insert prompts not only in the input but also in the intermediate hidden representations. Manually designed deep continuous prompts exhibit a remarkable improvement compared to the zero-shot pre-trained model on downstream tasks. How to automate the continuous prompt design is an underexplored area, and a fundamental question arises, is manually designed deep prompt strategy optimal? To answer this question, we propose a method dubbed differentiable prompt learning (DPL). The DPL method is formulated as an optimization problem to automatically determine the optimal context length of the prompt to be added to each layer, where the objective is to maximize the performance. We test the DPL method on the pre-trained CLIP. We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence. The performance on the downstream tasks exhibits the superiority of the automatic design: our method boosts the average test accuracy by 2.60% on 11 datasets compared to baseline methods. Besides, our method focuses only on the prompt configuration (i.e. context length for each layer), which means that our method is compatible with the baseline methods that have sophisticated designs to boost the performance. The DPL method can be deployed to large language models or computer vision models at no cost.

Title: SAT-LDM: Provably Generalizable Image Watermarking for Latent Diffusion Models with Self-Augmented Training

Authors: Lu Zhang, Liang Zeng
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2501.00463
Pdf URL: https://arxiv.org/pdf/2501.00463
Copy Paste: [[2501.00463]] SAT-LDM: Provably Generalizable Image Watermarking for Latent Diffusion Models with Self-Augmented Training(https://arxiv.org/abs/2501.00463)
Keywords: protect, robust, watermark, diffusion
Abstract: The proliferation of AI-generated images necessitates effective watermarking to protect intellectual property and identify fake content. While existing training-based watermarking methods show promise, they often struggle with generalization across diverse prompts and tend to produce noticeable artifacts. To this end, we introduce a provably generalizable image watermarking method for Latent Diffusion Models with Self-Augmented Training (SAT-LDM), which aligns the training and testing phases by a free generation distribution to bolster the watermarking module's generalization capabilities. We theoretically consolidate our method by proving that the free generation distribution contributes to its tight generalization bound without the need to collect new data. Extensive experimental results show that SAT-LDM achieves robust watermarking while significantly improving the quality of watermarked images across diverse prompts. Furthermore, we conduct experimental analyses to demonstrate the strong generalization abilities of SAT-LDM. We hope our method offers a practical and convenient solution for securing high-fidelity AI-generated content.

Title: Addressing Challenges in Data Quality and Model Generalization for Malaria Detection

Authors: Kiswendsida Kisito Kabore, Desire Guel
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2501.00464
Pdf URL: https://arxiv.org/pdf/2501.00464
Copy Paste: [[2501.00464]] Addressing Challenges in Data Quality and Model Generalization for Malaria Detection(https://arxiv.org/abs/2501.00464)
Keywords: robust
Abstract: Malaria remains a significant global health burden, particularly in resource-limited regions where timely and accurate diagnosis is critical to effective treatment and control. Deep Learning (DL) has emerged as a transformative tool for automating malaria detection and it offers high accuracy and scalability. However, the effectiveness of these models is constrained by challenges in data quality and model generalization including imbalanced datasets, limited diversity and annotation variability. These issues reduce diagnostic reliability and hinder real-world applicability. This article provides a comprehensive analysis of these challenges and their implications for malaria detection performance. Key findings highlight the impact of data imbalances which can lead to a 20\% drop in F1-score and regional biases which significantly hinder model generalization. Proposed solutions, such as GAN-based augmentation, improved accuracy by 15-20\% by generating synthetic data to balance classes and enhance dataset diversity. Domain adaptation techniques, including transfer learning, further improved cross-domain robustness by up to 25\% in sensitivity. Additionally, the development of diverse global datasets and collaborative data-sharing frameworks is emphasized as a cornerstone for equitable and reliable malaria diagnostics. The role of explainable AI techniques in improving clinical adoption and trustworthiness is also underscored. By addressing these challenges, this work advances the field of AI-driven malaria detection and provides actionable insights for researchers and practitioners. The proposed solutions aim to support the development of accessible and accurate diagnostic tools, particularly for resource-constrained populations.

Title: Dementia Detection using Multi-modal Methods on Audio Data

Authors: Saugat Kannojia, Anirudh Praveen, Danish Vasdev, Saket Nandedkar, Divyansh Mittal, Sarthak Kalankar, Shaurya Johari, Vipul Arora
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00465
Pdf URL: https://arxiv.org/pdf/2501.00465
Copy Paste: [[2501.00465]] Dementia Detection using Multi-modal Methods on Audio Data(https://arxiv.org/abs/2501.00465)
Keywords: generative
Abstract: Dementia is a neurodegenerative disease that causes gradual cognitive impairment, which is very common in the world and undergoes a lot of research every year to prevent and cure it. It severely impacts the patient's ability to remember events and communicate clearly, where most variations of it have no known cure, but early detection can help alleviate symptoms before they become worse. One of the main symptoms of dementia is difficulty in expressing ideas through speech. This paper attempts to talk about a model developed to predict the onset of the disease using audio recordings from patients. An ASR-based model was developed that generates transcripts from the audio files using Whisper model and then applies RoBERTa regression model to generate an MMSE score for the patient. This score can be used to predict the extent to which the cognitive ability of a patient has been affected. We use the PROCESS_V1 dataset for this task, which is introduced through the PROCESS Grand Challenge 2025. The model achieved an RMSE score of 2.6911 which is around 10 percent lower than the described baseline.

Title: Score-Based Metropolis-Hastings Algorithms

Authors: Ahmed Aloui, Ali Hasan, Juncheng Dong, Zihao Wu, Vahid Tarokh
Subjects: cs.LG, stat.CO
Abstract URL: https://arxiv.org/abs/2501.00467
Pdf URL: https://arxiv.org/pdf/2501.00467
Copy Paste: [[2501.00467]] Score-Based Metropolis-Hastings Algorithms(https://arxiv.org/abs/2501.00467)
Keywords: diffusion
Abstract: In this paper, we introduce a new approach for integrating score-based models with the Metropolis-Hastings algorithm. While traditional score-based diffusion models excel in accurately learning the score function from data points, they lack an energy function, making the Metropolis-Hastings adjustment step inaccessible. Consequently, the unadjusted Langevin algorithm is often used for sampling using estimated score functions. The lack of an energy function then prevents the application of the Metropolis-adjusted Langevin algorithm and other Metropolis-Hastings methods, limiting the wealth of other algorithms developed that use acceptance functions. We address this limitation by introducing a new loss function based on the \emph{detailed balance condition}, allowing the estimation of the Metropolis-Hastings acceptance probabilities given a learned score function. We demonstrate the effectiveness of the proposed method for various scenarios, including sampling from heavy-tail distributions.

Title: Exploring Physics-Informed Neural Networks for Crop Yield Loss Forecasting

Authors: Miro Miranda, Marcela Charfuelan, Andreas Dengel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00502
Pdf URL: https://arxiv.org/pdf/2501.00502
Copy Paste: [[2501.00502]] Exploring Physics-Informed Neural Networks for Crop Yield Loss Forecasting(https://arxiv.org/abs/2501.00502)
Keywords: security, explainability, transformer
Abstract: In response to climate change, assessing crop productivity under extreme weather conditions is essential to enhance food security. Crop simulation models, which align with physical processes, offer explainability but often perform poorly. Conversely, machine learning (ML) models for crop modeling are powerful and scalable yet operate as black boxes and lack adherence to crop growths physical principles. To bridge this gap, we propose a novel method that combines the strengths of both approaches by estimating the water use and the crop sensitivity to water scarcity at the pixel level. This approach enables yield loss estimation grounded in physical principles by sequentially solving the equation for crop yield response to water scarcity, using an enhanced loss function. Leveraging Sentinel-2 satellite imagery, climate data, simulated water use data, and pixel-level yield data, our model demonstrates high accuracy, achieving an R2 of up to 0.77, matching or surpassing state-of-the-art models like RNNs and Transformers. Additionally, it provides interpretable and physical consistent outputs, supporting industry, policymakers, and farmers in adapting to extreme weather conditions.

Title: Fine-grained Video-Text Retrieval: A New Benchmark and Method

Authors: Yifan Xu, Xinhao Li, Yichun Yang, Rui Huang, Limin Wang
Subjects: cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00513
Pdf URL: https://arxiv.org/pdf/2501.00513
Copy Paste: [[2501.00513]] Fine-grained Video-Text Retrieval: A New Benchmark and Method(https://arxiv.org/abs/2501.00513)
Keywords: large language model
Abstract: The ability of perceiving fine-grained spatial and temporal information is crucial for video-language retrieval. However, the existing video retrieval benchmarks, such as MSRVTT and MSVD, fail to efficiently evaluate the fine-grained retrieval ability of video-language models (VLMs) due to a lack of detailed annotations. To address this problem, we present FIBER, a FIne-grained BEnchmark for text to video Retrieval, containing 1,000 videos sourced from the FineAction dataset. Uniquely, our FIBER benchmark provides detailed human-annotated spatial annotations and temporal annotations for each video, making it possible to independently evaluate the spatial and temporal bias of VLMs on video retrieval task. Besides, we employ a text embedding method to unlock the capability of fine-grained video-language understanding of Multimodal Large Language Models (MLLMs). Surprisingly, the experiment results show that our Video Large Language Encoder (VLLE) performs comparably to CLIP-based models on traditional benchmarks and has a stronger capability of fine-grained representation with lower spatial-temporal bias. Project page: this https URL.

Title: A Method for Enhancing the Safety of Large Model Generation Based on Multi-dimensional Attack and Defense

Authors: Keke Zhai
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00517
Pdf URL: https://arxiv.org/pdf/2501.00517
Copy Paste: [[2501.00517]] A Method for Enhancing the Safety of Large Model Generation Based on Multi-dimensional Attack and Defense(https://arxiv.org/abs/2501.00517)
Keywords: security, defense, attack, generative
Abstract: Currently, large models are prone to generating harmful content when faced with complex attack instructions, significantly reducing their defensive capabilities. To address this issue, this paper proposes a method based on constructing data aligned with multi-dimensional attack defense to enhance the generative security of large models. The core of our method lies in improving the effectiveness of safe alignment learning for large models by innova-tively increasing the diversity of attack instruction dimensions and the accuracy of generat-ing safe responses. To validate the effectiveness of our method, beyond existing security evaluation benchmarks, we additionally designed new security evaluation benchmarks and conducted comparative experiments using Llama3.2 as the baseline model. The final ex-perimental results demonstrate that our method can significantly improve the generative security of large models under complex instructional attacks, while also maintaining and enhancing the models' general capabilities.

Title: Innovative Silicosis and Pneumonia Classification: Leveraging Graph Transformer Post-hoc Modeling and Ensemble Techniques

Authors: Bao Q. Bui, Tien T.T. Nguyen, Duy M. Le, Cong Tran, Cuong Pham
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00520
Pdf URL: https://arxiv.org/pdf/2501.00520
Copy Paste: [[2501.00520]] Innovative Silicosis and Pneumonia Classification: Leveraging Graph Transformer Post-hoc Modeling and Ensemble Techniques(https://arxiv.org/abs/2501.00520)
Keywords: robust, transformer
Abstract: This paper presents a comprehensive study on the classification and detection of Silicosis-related lung inflammation. Our main contributions include 1) the creation of a newly curated chest X-ray (CXR) image dataset named SVBCX that is tailored to the nuances of lung inflammation caused by distinct agents, providing a valuable resource for silicosis and pneumonia research community; and 2) we propose a novel deep-learning architecture that integrates graph transformer networks alongside a traditional deep neural network module for the effective classification of silicosis and pneumonia. Additionally, we employ the Balanced Cross-Entropy (BalCE) as a loss function to ensure more uniform learning across different classes, enhancing the model's ability to discern subtle differences in lung conditions. The proposed model architecture and loss function selection aim to improve the accuracy and reliability of inflammation detection, particularly in the context of Silicosis. Furthermore, our research explores the efficacy of an ensemble approach that combines the strengths of diverse model architectures. Experimental results on the constructed dataset demonstrate promising outcomes, showcasing substantial enhancements compared to baseline models. The ensemble of models achieves a macro-F1 score of 0.9749 and AUC ROC scores exceeding 0.99 for each class, underscoring the effectiveness of our approach in accurate and robust lung inflammation classification.

Title: Is Segment Anything Model 2 All You Need for Surgery Video Segmentation? A Systematic Evaluation

Authors: Cheng Yuan, Jian Jiang, Kunyi Yang, Lv Wu, Rui Wang, Zi Meng, Haonan Ping, Ziyu Xu, Yifan Zhou, Wanli Song, Hesheng Wang, Qi Dou, Yutong Ban
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00525
Pdf URL: https://arxiv.org/pdf/2501.00525
Copy Paste: [[2501.00525]] Is Segment Anything Model 2 All You Need for Surgery Video Segmentation? A Systematic Evaluation(https://arxiv.org/abs/2501.00525)
Keywords: robust, segmentation
Abstract: Surgery video segmentation is an important topic in the surgical AI field. It allows the AI model to understand the spatial information of a surgical scene. Meanwhile, due to the lack of annotated surgical data, surgery segmentation models suffer from limited performance. With the emergence of SAM2 model, a large foundation model for video segmentation trained on natural videos, zero-shot surgical video segmentation became more realistic but meanwhile remains to be explored. In this paper, we systematically evaluate the performance of SAM2 model in zero-shot surgery video segmentation task. We conducted experiments under different configurations, including different prompting strategies, robustness, etc. Moreover, we conducted an empirical evaluation over the performance, including 9 datasets with 17 different types of surgeries.

Title: Exploiting Boundary Loss for the Hierarchical Panoptic Segmentation of Plants and Leaves

Authors: Madeleine Darbyshire, Elizabeth Sklar, Simon Parsons
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00527
Pdf URL: https://arxiv.org/pdf/2501.00527
Copy Paste: [[2501.00527]] Exploiting Boundary Loss for the Hierarchical Panoptic Segmentation of Plants and Leaves(https://arxiv.org/abs/2501.00527)
Keywords: segmentation
Abstract: Precision agriculture leverages data and machine learning so that farmers can monitor their crops and target interventions precisely. This enables the precision application of herbicide only to weeds, or the precision application of fertilizer only to undernourished crops, rather than to the entire field. The approach promises to maximize yields while minimizing resource use and harm to the surrounding environment. To this end, we propose a hierarchical panoptic segmentation method that simultaneously determines leaf count (as an identifier of plant growth)and locates weeds within an image. In particular, our approach aims to improve the segmentation of smaller instances like the leaves and weeds by incorporating focal loss and boundary loss. Not only does this result in competitive performance, achieving a PQ+ of 81.89 on the standard training set, but we also demonstrate we can improve leaf-counting accuracy with our method. The code is available at this https URL.

Title: Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches

Authors: Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, Surangika Ranathunga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00529
Pdf URL: https://arxiv.org/pdf/2501.00529
Copy Paste: [[2501.00529]] Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches(https://arxiv.org/abs/2501.00529)
Keywords: transformer
Abstract: Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - this https URL

Title: Superposition in Transformers: A Novel Way of Building Mixture of Experts

Authors: Ayoub Ben Chaliah, Hela Dellagi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00530
Pdf URL: https://arxiv.org/pdf/2501.00530
Copy Paste: [[2501.00530]] Superposition in Transformers: A Novel Way of Building Mixture of Experts(https://arxiv.org/abs/2501.00530)
Keywords: transformer, large language model
Abstract: Catastrophic forgetting remains a major challenge when adapting large language models (LLMs) to new tasks or domains. Conventional fine-tuning often overwrites existing knowledge, causing performance degradation on original tasks. We introduce Superposition in Transformers, a novel architecture that leverages autoencoders to superimpose the hidden representations of a base model and a fine-tuned model within a shared parameter space. By using B-spline-based blending coefficients and autoencoders that adaptively reconstruct hidden states based on the input data distribution, our method effectively mitigates catastrophic forgetting and enables a new paradigm of "in-model" superposition. This approach preserves original model capabilities while allowing compact domain-specific expertise to be added, and it supports dynamic switching between model states during inference.

Title: Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs

Authors: Harit Vishwakarma, Alan Mishler, Thomas Cook, Niccolò Dalmasso, Natraj Raman, Sumitra Ganesh
Subjects: cs.LG, cs.AI, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2501.00555
Pdf URL: https://arxiv.org/pdf/2501.00555
Copy Paste: [[2501.00555]] Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs(https://arxiv.org/abs/2501.00555)
Keywords: robust, large language model
Abstract: Large language models (LLMs) are empowering decision-making in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, they often make overconfident, incorrect predictions, which can be risky in high-stakes settings like healthcare and finance. To mitigate these risks, recent works have used conformal prediction (CP), a model-agnostic framework for distribution-free uncertainty quantification. CP transforms a \emph{score function} into prediction sets that contain the true answer with high probability. While CP provides this coverage guarantee for arbitrary scores, the score quality significantly impacts prediction set sizes. Prior works have relied on LLM logits or other heuristic scores, lacking quality guarantees. We address this limitation by introducing CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Furthermore, inspired by the Monty Hall problem, we extend CP's utility beyond uncertainty quantification to improve accuracy. We propose \emph{conformal revision of questions} (CROQ) to revise the problem by narrowing down the available choices to those in the prediction set. The coverage guarantee of CP ensures that the correct choice is in the revised question prompt with high probability, while the smaller number of choices increases the LLM's chances of answering it correctly. Experiments on MMLU, ToolAlpaca, and TruthfulQA datasets with Gemma-2, Llama-3 and Phi-3 models show that CP-OPT significantly reduces set sizes while maintaining coverage, and CROQ improves accuracy over the standard inference, especially when paired with CP-OPT scores. Together, CP-OPT and CROQ offer a robust framework for improving both the safety and accuracy of LLM-driven decision-making.

Title: AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects

Authors: Ahmad Mustapha, Hadi Al-Khansa, Hadi Al-Mubasher, Aya Mourad, Ranam Hamoud, Hasan El-Husseini, Marwah Al-Sakkaf, Mariette Awad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00559
Pdf URL: https://arxiv.org/pdf/2501.00559
Copy Paste: [[2501.00559]] AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects(https://arxiv.org/abs/2501.00559)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable capabilities, not only in generating human-like text, but also in acquiring knowledge. This highlights the need to go beyond the typical Natural Language Processing downstream benchmarks and asses the various aspects of LLMs including knowledge and reasoning. Numerous benchmarks have been developed to evaluate LLMs knowledge, but they predominantly focus on the English language. Given that many LLMs are multilingual, relying solely on benchmarking English knowledge is insufficient. To address this issue, we introduce AraSTEM, a new Arabic multiple-choice question dataset aimed at evaluating LLMs knowledge in STEM subjects. The dataset spans a range of topics at different levels which requires models to demonstrate a deep understanding of scientific Arabic in order to achieve high accuracy. Our findings show that publicly available models of varying sizes struggle with this dataset, and underscores the need for more localized language models. The dataset is freely accessible on Hugging Face.

Title: An Overview and Discussion on Using Large Language Models for Implementation Generation of Solutions to Open-Ended Problems

Authors: Hashmath Shaik, Alex Doboli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00562
Pdf URL: https://arxiv.org/pdf/2501.00562
Copy Paste: [[2501.00562]] An Overview and Discussion on Using Large Language Models for Implementation Generation of Solutions to Open-Ended Problems(https://arxiv.org/abs/2501.00562)
Keywords: large language model
Abstract: Large Language Models offer new opportunities to devise automated implementation generation methods that can tackle problem solving activities beyond traditional methods, which require algorithmic specifications and can use only static domain knowledge, like performance metrics and libraries of basic building blocks. Large Language Models could support creating new methods to support problem solving activities for open-ended problems, like problem framing, exploring possible solving approaches, feature elaboration and combination, more advanced implementation assessment, and handling unexpected situations. This report summarized the current work on Large Language Models, including model prompting, Reinforcement Learning, and Retrieval-Augmented Generation. Future research requirements were also discussed.

Title: Probing Visual Language Priors in VLMs

Authors: Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00569
Pdf URL: https://arxiv.org/pdf/2501.00569
Copy Paste: [[2501.00569]] Probing Visual Language Priors in VLMs(https://arxiv.org/abs/2501.00569)
Keywords: generative
Abstract: Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual language priors present in their training data rather than true visual reasoning. To examine the situation, we introduce ViLP, a visual question answering (VQA) benchmark that pairs each question with three potential answers and three corresponding images: one image whose answer can be inferred from text alone, and two images that demand visual reasoning. By leveraging image generative models, we ensure significant variation in texture, shape, conceptual combinations, hallucinated elements, and proverb-based contexts, making our benchmark images distinctly out-of-distribution. While humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA pairs and images, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on actual visual inputs and have demonstrated their effectiveness in enhancing the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.

Title: KnowRA: Knowledge Retrieval Augmented Method for Document-level Relation Extraction with Comprehensive Reasoning Abilities

Authors: Chengcheng Mai, Yuxiang Wang, Ziyu Gong, Hanxiang Wang, Yihua Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00571
Pdf URL: https://arxiv.org/pdf/2501.00571
Copy Paste: [[2501.00571]] KnowRA: Knowledge Retrieval Augmented Method for Document-level Relation Extraction with Comprehensive Reasoning Abilities(https://arxiv.org/abs/2501.00571)
Keywords: extraction
Abstract: Document-level relation extraction (Doc-RE) aims to extract relations between entities across multiple sentences. Therefore, Doc-RE requires more comprehensive reasoning abilities like humans, involving complex cross-sentence interactions between entities, contexts, and external general knowledge, compared to the sentence-level RE. However, most existing Doc-RE methods focus on optimizing single reasoning ability, but lack the ability to utilize external knowledge for comprehensive reasoning on long documents. To solve these problems, a knowledge retrieval augmented method, named KnowRA, was proposed with comprehensive reasoning to autonomously determine whether to accept external knowledge to assist DocRE. Firstly, we constructed a document graph for semantic encoding and integrated the co-reference resolution model into KnowRA to augment the co-reference reasoning ability. Then, we further expanded the document graph into a document knowledge graph by retrieving the external knowledge base and introduced the axis attention mechanism into KnowRA to improve its common-sense and logical reasoning abilities, respectively. Finally, a knowledge filtering method was presented in the common-sense and co-reference reasoning module to filter out irrelevant knowledge. Extensive experiments conducted on two datasets verified the effectiveness of our method compared to the state-of-the-art baselines. Our code is available at this https URL.

Title: VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Authors: Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, Limin Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00574
Pdf URL: https://arxiv.org/pdf/2501.00574
Copy Paste: [[2501.00574]] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling(https://arxiv.org/abs/2501.00574)
Keywords: large language model
Abstract: Long-context modeling is a critical capability for multimodal large language models (MLLMs), enabling them to process long-form contents with implicit memorization. Despite its advances, handling extremely long videos remains challenging due to the difficulty in maintaining crucial features over extended sequences. This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation and a practical context modeling system VideoChat-Flash tailored for multimodal long-sequence processing. HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level, reducing the compute significantly while preserving essential details. VideoChat-Flash features a multi-stage short-to-long learning scheme, a rich dataset of real-world long videos named LongVid, and an upgraded "Needle-In-A-video-Haystack" (NIAH) for evaluating context capacities. In extensive experiments, VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 7B model scale. It firstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.

Title: Causal Graph Guided Steering of LLM Values via Prompts and Sparse Autoencoders

Authors: Yipeng Kang, Junqi Wang, Yexin Li, Fangwei Zhong, Xue Feng, Mengmeng Wang, Wenming Tu, Quansen Wang, Hengli Li, Zilong Zheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00581
Pdf URL: https://arxiv.org/pdf/2501.00581
Copy Paste: [[2501.00581]] Causal Graph Guided Steering of LLM Values via Prompts and Sparse Autoencoders(https://arxiv.org/abs/2501.00581)
Keywords: large language model
Abstract: As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), often focus on a limited set of values and can be resource-intensive. Furthermore, the correlation between values has been largely overlooked and remains underutilized. Our framework addresses this limitation by mining a causal graph that elucidates the implicit relationships among various values within the LLMs. Leveraging the causal graph, we implement two lightweight mechanisms for value steering: prompt template steering and Sparse Autoencoder feature steering, and analyze the effects of altering one value dimension on others. Extensive experiments conducted on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our steering methods.

Title: Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method

Authors: Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00584
Pdf URL: https://arxiv.org/pdf/2501.00584
Copy Paste: [[2501.00584]] Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method(https://arxiv.org/abs/2501.00584)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) have shown significant progress in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark specifically designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features six core task types across three temporal contexts-past, present, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.

Title: Sidewalk Hazard Detection Using Variational Autoencoder and One-Class SVM

Authors: Edgar Guzman, Robert D. Howe
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2501.00585
Pdf URL: https://arxiv.org/pdf/2501.00585
Copy Paste: [[2501.00585]] Sidewalk Hazard Detection Using Variational Autoencoder and One-Class SVM(https://arxiv.org/abs/2501.00585)
Keywords: robust
Abstract: The unpredictable nature of outdoor settings introduces numerous safety concerns, making hazard detection crucial for safe navigation. This paper introduces a novel system for sidewalk safety navigation utilizing a hybrid approach that combines a Variational Autoencoder (VAE) with a One-Class Support Vector Machine (OCSVM). The system is designed to detect anomalies on sidewalks that could potentially pose walking hazards. A dataset comprising over 15,000 training frames and 5,000 testing frames was collected using video recordings, capturing various sidewalk scenarios, including normal and hazardous conditions. During deployment, the VAE utilizes its reconstruction mechanism to detect anomalies within a frame. Poor reconstruction by the VAE implies the presence of an anomaly, after which the OCSVM is used to confirm whether the anomaly is hazardous or non-hazardous. The proposed VAE model demonstrated strong performance, with a high Area Under the Curve (AUC) of 0.94, effectively distinguishing anomalies that could be potential hazards. The OCSVM is employed to reduce the detection of false hazard anomalies, such as manhole or water valve covers. This approach achieves an accuracy of 91.4%, providing a highly reliable system for distinguishing between hazardous and non-hazardous scenarios. These results suggest that the proposed system offers a robust solution for hazard detection in uncertain environments.

Title: Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation

Authors: M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G"um"uş, Banu Diri, Savaş Yıldırım, "Oner Aytaş
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00593
Pdf URL: https://arxiv.org/pdf/2501.00593
Copy Paste: [[2501.00593]] Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation(https://arxiv.org/abs/2501.00593)
Keywords: robust, large language model
Abstract: Language models have made remarkable advancements in understanding and generating human language, achieving notable success across a wide array of applications. However, evaluating these models remains a significant challenge, particularly for resource-limited languages such as Turkish. To address this gap, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is constructed from a carefully curated dataset comprising 6200 multiple-choice questions across 62 sections, selected from a pool of 280000 questions spanning 67 disciplines and over 800 topics within the Turkish education system. This benchmark provides a transparent, reproducible, and culturally relevant tool for evaluating model performance. It serves as a standard framework for Turkish NLP research, enabling detailed analyses of LLMs' capabilities in processing Turkish text and fostering the development of more robust and accurate language models. In this study, we evaluate state-of-the-art LLMs on TR-MMLU, providing insights into their strengths and limitations for Turkish-specific tasks. Our findings reveal critical challenges, such as the impact of tokenization and fine-tuning strategies, and highlight areas for improvement in model design. By setting a new standard for evaluating Turkish language models, TR-MMLU aims to inspire future innovations and support the advancement of Turkish NLP research.

Title: Unbiased GNN Learning via Fairness-Aware Subgraph Diffusion

Authors: Abdullah Alchihabi, Yuhong Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00595
Pdf URL: https://arxiv.org/pdf/2501.00595
Copy Paste: [[2501.00595]] Unbiased GNN Learning via Fairness-Aware Subgraph Diffusion(https://arxiv.org/abs/2501.00595)
Keywords: fair, diffusion, generative
Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable efficacy in tackling a wide array of graph-related tasks across diverse domains. However, a significant challenge lies in their propensity to generate biased predictions, particularly with respect to sensitive node attributes such as age and gender. These biases, inherent in many machine learning models, are amplified in GNNs due to the message-passing mechanism, which allows nodes to influence each other, rendering the task of making fair predictions notably challenging. This issue is particularly pertinent in critical domains where model fairness holds paramount importance. In this paper, we propose a novel generative Fairness-Aware Subgraph Diffusion (FASD) method for unbiased GNN learning. The method initiates by strategically sampling small subgraphs from the original large input graph, and then proceeds to conduct subgraph debiasing via generative fairness-aware graph diffusion processes based on stochastic differential equations (SDEs). To effectively diffuse unfairness in the input data, we introduce additional adversary bias perturbations to the subgraphs during the forward diffusion process, and train score-based models to predict these applied perturbations, enabling them to learn the underlying dynamics of the biases present in the data. Subsequently, the trained score-based models are utilized to further debias the original subgraph samples through the reverse diffusion process. Finally, FASD induces fair node predictions on the input graph by performing standard GNN learning on the debiased subgraphs. Experimental results demonstrate the superior performance of the proposed method over state-of-the-art Fair GNN baselines across multiple benchmark datasets.

Title: VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Authors: Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00599
Pdf URL: https://arxiv.org/pdf/2501.00599
Copy Paste: [[2501.00599]] VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM(https://arxiv.org/abs/2501.00599)
Keywords: large language model
Abstract: Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.

Title: DreamDrive: Generative 4D Scene Modeling from Street View Images

Authors: Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2501.00601
Pdf URL: https://arxiv.org/pdf/2501.00601
Copy Paste: [[2501.00601]] DreamDrive: Generative 4D Scene Modeling from Street View Images(https://arxiv.org/abs/2501.00601)
Keywords: diffusion, generative
Abstract: Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

Title: STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

Authors: Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu, Boris Ivanovic, Yue Wang, Marco Pavone
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00602
Pdf URL: https://arxiv.org/pdf/2501.00602
Copy Paste: [[2501.00602]] STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes(https://arxiv.org/abs/2501.00602)
Keywords: transformer
Abstract: We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations--parameterized by 3D Gaussians and their velocities--in a single forward pass. Our key design is to aggregate 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., "amodal") reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding.

Title: DiC: Rethinking Conv3x3 Designs in Diffusion Models

Authors: Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, Hanting Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00603
Pdf URL: https://arxiv.org/pdf/2501.00603
Copy Paste: [[2501.00603]] DiC: Rethinking Conv3x3 Designs in Diffusion Models(https://arxiv.org/abs/2501.00603)
Keywords: diffusion, transformer
Abstract: Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: this https URL

Title: Time-Varying Graph Learning for Data with Heavy-Tailed Distribution

Authors: Amirhossein Javaheri, Jiaxi Ying, Daniel P. Palomar, Farokh Marvasti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00606
Pdf URL: https://arxiv.org/pdf/2501.00606
Copy Paste: [[2501.00606]] Time-Varying Graph Learning for Data with Heavy-Tailed Distribution(https://arxiv.org/abs/2501.00606)
Keywords: robust
Abstract: Graph models provide efficient tools to capture the underlying structure of data defined over networks. Many real-world network topologies are subject to change over time. Learning to model the dynamic interactions between entities in such networks is known as time-varying graph learning. Current methodology for learning such models often lacks robustness to outliers in the data and fails to handle heavy-tailed distributions, a common feature in many real-world datasets (e.g., financial data). This paper addresses the problem of learning time-varying graph models capable of efficiently representing heavy-tailed data. Unlike traditional approaches, we incorporate graph structures with specific spectral properties to enhance data clustering in our model. Our proposed method, which can also deal with noise and missing values in the data, is based on a stochastic approach, where a non-negative vector auto-regressive (VAR) model captures the variations in the graph and a Student-t distribution models the signal originating from this underlying time-varying graph. We propose an iterative method to learn time-varying graph topologies within a semi-online framework where only a mini-batch of data is used to update the graph. Simulations with both synthetic and real datasets demonstrate the efficacy of our model in analyzing heavy-tailed data, particularly those found in financial markets.

Title: A Study on Context Length and Efficient Transformers for Biomedical Image Analysis

Authors: Sarah M. Hooper, Hui Xue
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00619
Pdf URL: https://arxiv.org/pdf/2501.00619
Copy Paste: [[2501.00619]] A Study on Context Length and Efficient Transformers for Biomedical Image Analysis(https://arxiv.org/abs/2501.00619)
Keywords: transformer, segmentation
Abstract: Biomedical imaging modalities often produce high-resolution, multi-dimensional images that pose computational challenges for deep neural networks. These computational challenges are compounded when training transformers due to the self-attention operator, which scales quadratically with context length. Recent developments in long-context models have potential to alleviate these difficulties and enable more efficient application of transformers to large biomedical images, although a systematic evaluation on this topic is lacking. In this study, we investigate the impact of context length on biomedical image analysis and we evaluate the performance of recently proposed long-context models. We first curate a suite of biomedical imaging datasets, including 2D and 3D data for segmentation, denoising, and classification tasks. We then analyze the impact of context length on network performance using the Vision Transformer and Swin Transformer by varying patch size and attention window size. Our findings reveal a strong relationship between context length and performance, particularly for pixel-level prediction tasks. Finally, we show that recent long-context models demonstrate significant improvements in efficiency while maintaining comparable performance, though we highlight where gaps remain. This work underscores the potential and challenges of using long-context models in biomedical imaging.

Title: Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting

Authors: Kyle Gao, Liangzhi Li, Hongjie He, Dening Lu, Linlin Xu, Jonathan Li
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2501.00625
Pdf URL: https://arxiv.org/pdf/2501.00625
Copy Paste: [[2501.00625]] Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting(https://arxiv.org/abs/2501.00625)
Keywords: segmentation
Abstract: Recently released open-source pre-trained foundational image segmentation and object detection models (SAM2+GroundingDINO) allow for geometrically consistent segmentation of objects of interest in multi-view 2D images. Users can use text-based or click-based prompts to segment objects of interest without requiring labeled training datasets. Gaussian Splatting allows for the learning of the 3D representation of a scene's geometry and radiance based on 2D images. Combining Google Earth Studio, SAM2+GroundingDINO, 2D Gaussian Splatting, and our improvements in mask refinement based on morphological operations and contour simplification, we created a pipeline to extract the 3D mesh of any building based on its name, address, or geographic coordinates.

Title: Applying Graph Explanation to Operator Fusion

Authors: Keith G. Mills, Muhammad Fetrat Qharabagh, Weichen Qiu, Fred X. Han, Mohammad Salameh, Wei Lu, Shangling Jui, Di Niu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2501.00636
Pdf URL: https://arxiv.org/pdf/2501.00636
Copy Paste: [[2501.00636]] Applying Graph Explanation to Operator Fusion(https://arxiv.org/abs/2501.00636)
Keywords: robust
Abstract: Layer fusion techniques are critical to improving the inference efficiency of deep neural networks (DNN) for deployment. Fusion aims to lower inference costs by reducing data transactions between an accelerator's on-chip buffer and DRAM. This is accomplished by grouped execution of multiple operations like convolution and activations together into single execution units - fusion groups. However, on-chip buffer capacity limits fusion group size and optimizing fusion on whole DNNs requires partitioning into multiple fusion groups. Finding the optimal groups is a complex problem where the presence of invalid solutions hampers traditional search algorithms and demands robust approaches. In this paper we incorporate Explainable AI, specifically Graph Explanation Techniques (GET), into layer fusion. Given an invalid fusion group, we identify the operations most responsible for group invalidity, then use this knowledge to recursively split the original fusion group via a greedy tree-based algorithm to minimize DRAM access. We pair our scheme with common algorithms and optimize DNNs on two types of layer fusion: Line-Buffer Depth First (LBDF) and Branch Requirement Reduction (BRR). Experiments demonstrate the efficacy of our scheme on several popular and classical convolutional neural networks like ResNets and MobileNets. Our scheme achieves over 20% DRAM Access reduction on EfficientNet-B3.

Title: Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation

Authors: Tianfu Wang, Mingyang Xie, Haoming Cai, Sachin Shah, Christopher A. Metzler
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00637
Pdf URL: https://arxiv.org/pdf/2501.00637
Copy Paste: [[2501.00637]] Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation(https://arxiv.org/abs/2501.00637)
Keywords: robust, diffusion
Abstract: Transparent surfaces, such as glass, create complex reflections that obscure images and challenge downstream computer vision applications. We introduce Flash-Split, a robust framework for separating transmitted and reflected light using a single (potentially misaligned) pair of flash/no-flash images. Our core idea is to perform latent-space reflection separation while leveraging the flash cues. Specifically, Flash-Split consists of two stages. Stage 1 separates apart the reflection latent and transmission latent via a dual-branch diffusion model conditioned on an encoded flash/no-flash latent pair, effectively mitigating the flash/no-flash misalignment issue. Stage 2 restores high-resolution, faithful details to the separated latents, via a cross-latent decoding process conditioned on the original images before separation. By validating Flash-Split on challenging real-world scenes, we demonstrate state-of-the-art reflection separation performance and significantly outperform the baseline methods.

Title: Efficient Standardization of Clinical Notes using Large Language Models

Authors: Daniel B. Hier, Michael D. Carrithers, Thanh Son Do, Tayo Obafemi-Ajayi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00644
Pdf URL: https://arxiv.org/pdf/2501.00644
Copy Paste: [[2501.00644]] Efficient Standardization of Clinical Notes using Large Language Models(https://arxiv.org/abs/2501.00644)
Keywords: extraction, large language model
Abstract: Clinician notes are a rich source of patient information but often contain inconsistencies due to varied writing styles, colloquialisms, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder the extraction of meaningful data from electronic health records (EHRs), posing challenges for quality improvement, population health, precision medicine, decision support, and research. We present a large language model approach to standardizing a corpus of 1,618 clinical notes. Standardization corrected an average of $4.9 +/- 1.8$ grammatical errors, $3.3 +/- 5.2$ spelling errors, converted $3.1 +/- 3.0$ non-standard terms to standard terminology, and expanded $15.8 +/- 9.1$ abbreviations and acronyms per note. Additionally, notes were re-organized into canonical sections with standardized headings. This process prepared notes for key concept extraction, mapping to medical ontologies, and conversion to interoperable data formats such as FHIR. Expert review of randomly sampled notes found no significant data loss after standardization. This proof-of-concept study demonstrates that standardization of clinical notes can improve their readability, consistency, and usability, while also facilitating their conversion into interoperable data formats.

Title: SoundBrush: Sound as a Brush for Visual Scene Editing

Authors: Kim Sung-Bin, Kim Jun-Seong, Junseok Ko, Yewon Kim, Tae-Hyun Oh
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.00645
Pdf URL: https://arxiv.org/pdf/2501.00645
Copy Paste: [[2501.00645]] SoundBrush: Sound as a Brush for Visual Scene Editing(https://arxiv.org/abs/2501.00645)
Keywords: diffusion, generative
Abstract: We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at this https URL.

Title: Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models

Authors: Suttisak Wizadwongsa, Jinfan Zhou, Edward Li, Jeong Joon Park
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00651
Pdf URL: https://arxiv.org/pdf/2501.00651
Copy Paste: [[2501.00651]] Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models(https://arxiv.org/abs/2501.00651)
Keywords: transformer, generative
Abstract: Recent AI-based 3D content creation has largely evolved along two paths: feed-forward image-to-3D reconstruction approaches and 3D generative models trained with 2D or 3D supervision. In this work, we show that existing feed-forward reconstruction methods can serve as effective latent encoders for training 3D generative models, thereby bridging these two paradigms. By reusing powerful pre-trained reconstruction models, we avoid computationally expensive encoder network training and obtain rich 3D latent features for generative modeling for free. However, the latent spaces of reconstruction models are not well-suited for generative modeling due to their unstructured nature. To enable flow-based model training on these latent features, we develop post-processing pipelines, including protocols to standardize the features and spatial weighting to concentrate on important regions. We further incorporate a 2D image space perceptual rendering loss to handle the high-dimensional latent spaces. Finally, we propose a multi-stream transformer-based rectified flow architecture to achieve linear scaling and high-quality text-conditioned 3D generation. Our framework leverages the advancements of feed-forward reconstruction models to enhance the scalability of 3D generative modeling, achieving both high computational efficiency and state-of-the-art performance in text-to-3D generation.

Title: Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

Authors: Peihao Wang, Ruisi Cai, Yuehao Wang, Jiajun Zhu, Pragya Srivastava, Zhangyang Wang, Pan Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00658
Pdf URL: https://arxiv.org/pdf/2501.00658
Copy Paste: [[2501.00658]] Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing(https://arxiv.org/abs/2501.00658)
Keywords: robust, transformer
Abstract: Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at this https URL.

Title: Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph

Authors: Kazuki Irie
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00659
Pdf URL: https://arxiv.org/pdf/2501.00659
Copy Paste: [[2501.00659]] Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph(https://arxiv.org/abs/2501.00659)
Keywords: transformer
Abstract: Do autoregressive Transformer language models require explicit positional encodings (PEs)? The answer is "no" as long as they have more than one layer -- they can distinguish sequences with permuted tokens without requiring explicit PEs. This property has been known since early efforts (those contemporary with GPT-2) adopting the Transformer for language modeling. However, this result does not appear to have been well disseminated and was even rediscovered recently. This may be partially due to a sudden growth of the language modeling community after the advent of GPT-2, but perhaps also due to the lack of a clear explanation in prior publications, despite being commonly understood by practitioners in the past. Here we review this long-forgotten explanation why explicit PEs are nonessential for multi-layer autoregressive Transformers (in contrast, one-layer models require PEs to discern order information of their input tokens). We also review the origin of this result, and hope to re-establish it as a common knowledge.

Title: Titans: Learning to Memorize at Test Time

Authors: Ali Behrouz, Peilin Zhong, Vahab Mirrokni
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00663
Pdf URL: https://arxiv.org/pdf/2501.00663
Copy Paste: [[2501.00663]] Titans: Learning to Memorize at Test Time(https://arxiv.org/abs/2501.00663)
Keywords: transformer
Abstract: Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.

Title: Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery

Authors: HanQin Cai, Chandra Kundu, Jialin Liu, Wotao Yin
Subjects: cs.LG, cs.CV, cs.IT, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2501.00677
Pdf URL: https://arxiv.org/pdf/2501.00677
Copy Paste: [[2501.00677]] Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery(https://arxiv.org/abs/2501.00677)
Keywords: robust
Abstract: Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.

Title: IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently

Authors: Florian Dietz, Dietrich Klakow
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00684
Pdf URL: https://arxiv.org/pdf/2501.00684
Copy Paste: [[2501.00684]] IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently(https://arxiv.org/abs/2501.00684)
Keywords: large language model
Abstract: Solving arithmetic tasks is a simple and fundamental skill, yet modern Large Language Models (LLMs) have great difficulty with them. We introduce the Integrated Gated Calculator (IGC), a module that enables LLMs to perform arithmetic by emulating a calculator on the GPU. We finetune a Llama model with our module and test it on the BigBench Arithmetic benchmark, where it beats the State of the Art, outperforming all models on the benchmark, including models almost two orders of magnitude larger. Our approach takes only a single iteration to run and requires no external tools. It performs arithmetic operations entirely inside the LLM without the need to produce intermediate tokens. It is computationally efficient, interpretable, and avoids side-effects on tasks that do not require arithmetic operations. It reliably achieves 98\% to 99\% accuracy across multiple training runs and for all subtasks, including the substantially harder subtask of multiplication, which was previously unsolved.

Title: Labels Generated by Large Language Model Helps Measuring People's Empathy in Vitro

Authors: Md Rakibul Hasan, Yue Yao, Md Zakir Hossain, Aneesh Krishna, Imre Rudas, Shafin Rahman, Tom Gedeon
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00691
Pdf URL: https://arxiv.org/pdf/2501.00691
Copy Paste: [[2501.00691]] Labels Generated by Large Language Model Helps Measuring People's Empathy in Vitro(https://arxiv.org/abs/2501.00691)
Keywords: large language model
Abstract: Large language models (LLMs) have revolutionised numerous fields, with LLM-as-a-service (LLMSaaS) having a strong generalisation ability that offers accessible solutions directly without the need for costly training. In contrast to the widely studied prompt engineering for task solving directly (in vivo), this paper explores its potential in in-vitro applications. These involve using LLM to generate labels to help the supervised training of mainstream models by (1) noisy label correction and (2) training data augmentation with LLM-generated labels. In this paper, we evaluate this approach in the emerging field of empathy computing -- automating the prediction of psychological questionnaire outcomes from inputs like text sequences. Specifically, crowdsourced datasets in this domain often suffer from noisy labels that misrepresent underlying empathy. By leveraging LLM-generated labels to train pre-trained language models (PLMs) like RoBERTa, we achieve statistically significant accuracy improvements over baselines, achieving a state-of-the-art Pearson correlation coefficient of 0.648 on NewsEmp benchmarks. In addition, we bring insightful discussions, including current challenges in empathy computing, data biases in training data and evaluation metric selection. Code and LLM-generated data are available at this https URL (available once the paper is accepted).

Title: Adjoint sharding for very long context training of state space models

Authors: Xingzi Xu, Amir Tavanaei, Kavosh Asadi, Karim Bouyarmane
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00692
Pdf URL: https://arxiv.org/pdf/2501.00692
Copy Paste: [[2501.00692]] Adjoint sharding for very long context training of state space models(https://arxiv.org/abs/2501.00692)
Keywords: extraction, large language model
Abstract: Despite very fast progress, efficiently training large language models (LLMs) in very long contexts remains challenging. Existing methods fall back to training LLMs with short contexts (a maximum of a few thousands tokens in training) and use inference time techniques when evaluating on long contexts (above 1M tokens context window at inference). As opposed to long-context-inference, training on very long context input prompts is quickly limited by GPU memory availability and by the prohibitively long training times it requires on state-of-the-art hardware. Meanwhile, many real-life applications require not only inference but also training/fine-tuning with long context on specific tasks. Such applications include, for example, augmenting the context with various sources of raw reference information for fact extraction, fact summarization, or fact reconciliation tasks. We propose adjoint sharding, a novel technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude, making training on very long context computationally tractable. Adjoint sharding is based on the adjoint method and computes equivalent gradients to backpropagation. We also propose truncated adjoint sharding to speed up the algorithm while maintaining performance. We provide a distributed version, and a paralleled version of adjoint sharding to further speed up training. Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3X with a 1.27B parameter large language model on 1M context length training. This allows to increase the maximum context length during training or fine-tuning of a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.

Title: Knowledge-Guided Prompt Learning for Deepfake Facial Image Detection

Authors: Hao Wang, Cheng Deng, Zhidong Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00700
Pdf URL: https://arxiv.org/pdf/2501.00700
Copy Paste: [[2501.00700]] Knowledge-Guided Prompt Learning for Deepfake Facial Image Detection(https://arxiv.org/abs/2501.00700)
Keywords: generative, large language model
Abstract: Recent generative models demonstrate impressive performance on synthesizing photographic images, which makes humans hardly to distinguish them from pristine ones, especially on realistic-looking synthetic facial images. Previous works mostly focus on mining discriminative artifacts from vast amount of visual data. However, they usually lack the exploration of prior knowledge and rarely pay attention to the domain shift between training categories (e.g., natural and indoor objects) and testing ones (e.g., fine-grained human facial images), resulting in unsatisfactory detection performance. To address these issues, we propose a novel knowledge-guided prompt learning method for deepfake facial image detection. Specifically, we retrieve forgery-related prompts from large language models as expert knowledge to guide the optimization of learnable prompts. Besides, we elaborate test-time prompt tuning to alleviate the domain shift, achieving significant performance improvement and boosting the application in real-world scenarios. Extensive experiments on DeepFakeFaceForensics dataset show that our proposed approach notably outperforms state-of-the-art methods.

Title: NN-ResDMD: Learning Koopman Representations for Complex Dynamics with Spectral Residuals

Authors: Yuanchao Xu, Kaidi Shao, Nikos Logothetis, Zhongwei Shen
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2501.00701
Pdf URL: https://arxiv.org/pdf/2501.00701
Copy Paste: [[2501.00701]] NN-ResDMD: Learning Koopman Representations for Complex Dynamics with Spectral Residuals(https://arxiv.org/abs/2501.00701)
Keywords: robust
Abstract: Analyzing long-term behaviors in high-dimensional nonlinear dynamical systems remains a significant challenge. The Koopman operator framework has emerged as a powerful tool to address this issue by providing a globally linear perspective on nonlinear dynamics. However, existing methods for approximating the Koopman operator and its spectral components, particularly in large-scale systems, often lack robust theoretical guarantees. Residual Dynamic Mode Decomposition (ResDMD) introduces a spectral residual measure to assess the convergence of the estimated Koopman spectrum, which helps filter out spurious spectral components. Nevertheless, it depends on pre-computed spectra, thereby inheriting their inaccuracies. To overcome its limitations, we introduce the Neural Network-ResDMD (NN-ResDMD), a method that directly estimates Koopman spectral components by minimizing the spectral residual. By leveraging neural networks, NN-ResDMD automatically identifies the optimal basis functions of the Koopman invariant subspace, eliminating the need for manual selection and improving the reliability of the analysis. Experiments on physical and biological systems demonstrate that NN-ResDMD significantly improves both accuracy and scalability, making it an effective tool for analyzing complex dynamical systems.

Title: Kolmogorov GAM Networks are all you need!

Authors: Sarah Polson, Vadim Sokolov
Subjects: cs.LG, stat.CO
Abstract URL: https://arxiv.org/abs/2501.00704
Pdf URL: https://arxiv.org/pdf/2501.00704
Copy Paste: [[2501.00704]] Kolmogorov GAM Networks are all you need!(https://arxiv.org/abs/2501.00704)
Keywords: transformer
Abstract: Kolmogorov GAM (K-GAM) networks are shown to be an efficient architecture for training and inference. They are an additive model with an embedding that is independent of the function of interest. They provide an alternative to the transformer architecture. They are the machine learning version of Kolmogorov's Superposition Theorem (KST) which provides an efficient representations of a multivariate function. Such representations have use in machine learning for encoding dictionaries (a.k.a. "look-up" tables). KST theory also provides a representation based on translates of the Köppen function. The goal of our paper is to interpret this representation in a machine learning context for applications in Artificial Intelligence (AI). Our architecture is equivalent to a topological embedding which is independent of the function together with an additive layer that uses a Generalized Additive Model (GAM). This provides a class of learning procedures with far fewer parameters than current deep learning algorithms. Implementation can be parallelizable which makes our algorithms computationally attractive. To illustrate our methodology, we use the Iris data from statistical learning. We also show that our additive model with non-linear embedding provides an alternative to transformer architectures which from a statistical viewpoint are kernel smoothers. Additive KAN models therefore provide a natural alternative to transformers. Finally, we conclude with directions for future research.

Title: Everywhere Attack: Attacking Locally and Globally to Boost Targeted Transferability

Authors: Hui Zeng, Sanshuai Cui, Biwei Chen, Anjie Peng
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2501.00707
Pdf URL: https://arxiv.org/pdf/2501.00707
Copy Paste: [[2501.00707]] Everywhere Attack: Attacking Locally and Globally to Boost Targeted Transferability(https://arxiv.org/abs/2501.00707)
Keywords: attack
Abstract: Adversarial examples' (AE) transferability refers to the phenomenon that AEs crafted with one surrogate model can also fool other models. Notwithstanding remarkable progress in untargeted transferability, its targeted counterpart remains challenging. This paper proposes an everywhere scheme to boost targeted transferability. Our idea is to attack a victim image both globally and locally. We aim to optimize 'an army of targets' in every local image region instead of the previous works that optimize a high-confidence target in the image. Specifically, we split a victim image into non-overlap blocks and jointly mount a targeted attack on each block. Such a strategy mitigates transfer failures caused by attention inconsistency between surrogate and victim models and thus results in stronger transferability. Our approach is method-agnostic, which means it can be easily combined with existing transferable attacks for even higher transferability. Extensive experiments on ImageNet demonstrate that the proposed approach universally improves the state-of-the-art targeted attacks by a clear margin, e.g., the transferability of the widely adopted Logit attack can be improved by 28.8%-300%.We also evaluate the crafted AEs on a real-world platform: Google Cloud Vision. Results further support the superiority of the proposed method.

Title: KAN KAN Buff Signed Graph Neural Networks?

Authors: Muhieddine Shebaro, Jelena Tešić
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00709
Pdf URL: https://arxiv.org/pdf/2501.00709
Copy Paste: [[2501.00709]] KAN KAN Buff Signed Graph Neural Networks?(https://arxiv.org/abs/2501.00709)
Keywords: interpretability
Abstract: Graph Representation Learning aims to create embeddings for nodes and edges, capturing their features and interconnections. Graph Neural Networks (GNNs) have excelled in this task, leveraging neural networks to model complex graph relationships. Recently, the Kolmogorov-Arnold Neural Network (KAN) emerged as an alternative to Multi-Layer Perceptron (MLP), showing improved accuracy and interpretability with fewer parameters. While KANs have been integrated into unsigned GNNs, their application in signed GNNs remains unexplored. This paper integrates KAN into Signed Graph Convolutional Networks (SGCNs) to evaluate its performance on signed graphs where edges have positive or negative signs. We empirically assess KAN-enhanced SGCNs (KASGCN) on downstream tasks such as signed community detection and link sign prediction to enhance the embedding quality in signed networks. Considering the variability in the results indicated by the relatively large standard deviation, KASGCN demonstrates competitive performance with, or similar to, the vanilla SGCN in the evaluated downstream tasks, and its effectiveness is context-dependent (signed graph and parameters...etc.).

Title: Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

Authors: Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00712
Pdf URL: https://arxiv.org/pdf/2501.00712
Copy Paste: [[2501.00712]] Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding(https://arxiv.org/abs/2501.00712)
Keywords: robust, transformer
Abstract: Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con$\textbf{T}$extualized equivari$\textbf{A}$nt $\textbf{P}$osition $\textbf{E}$mbedding ($\textbf{TAPE}$), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments shows that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.

Title: CODEOFCONDUCT at Multilingual Counterspeech Generation: A Context-Aware Model for Robust Counterspeech Generation in Low-Resource Languages

Authors: Michael Bennie, Bushi Xiao, Chryseis Xinyi Liu, Demi Zhang, Jian Meng, Alayo Tripp
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00713
Pdf URL: https://arxiv.org/pdf/2501.00713
Copy Paste: [[2501.00713]] CODEOFCONDUCT at Multilingual Counterspeech Generation: A Context-Aware Model for Robust Counterspeech Generation in Low-Resource Languages(https://arxiv.org/abs/2501.00713)
Keywords: robust
Abstract: This paper introduces a context-aware model for robust counterspeech generation, which achieved significant success in the MCG-COLING-2025 shared task. Our approach particularly excelled in low-resource language settings. By leveraging a simulated annealing algorithm fine-tuned on multilingual datasets, the model generates factually accurate responses to hate speech. We demonstrate state-of-the-art performance across four languages (Basque, English, Italian, and Spanish), with our system ranking first for Basque, second for Italian, and third for both English and Spanish. Notably, our model swept all three top positions for Basque, highlighting its effectiveness in low-resource scenarios. Evaluation of the shared task employs both traditional metrics (BLEU, ROUGE, BERTScore, Novelty) and JudgeLM based on LLM. We present a detailed analysis of our results, including an empirical evaluation of the model performance and comprehensive score distributions across evaluation metrics. This work contributes to the growing body of research on multilingual counterspeech generation, offering insights into developing robust models that can adapt to diverse linguistic and cultural contexts in the fight against online hate speech.

Title: DDD: Discriminative Difficulty Distance for plant disease diagnosis

Authors: Yuji Arima, Satoshi Kagiwada, Hitoshi Iyatomi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00734
Pdf URL: https://arxiv.org/pdf/2501.00734
Copy Paste: [[2501.00734]] DDD: Discriminative Difficulty Distance for plant disease diagnosis(https://arxiv.org/abs/2501.00734)
Keywords: robust
Abstract: Recent studies on plant disease diagnosis using machine learning (ML) have highlighted concerns about the overestimated diagnostic performance due to inappropriate data partitioning, where training and test datasets are derived from the same source (domain). Plant disease diagnosis presents a challenging classification task, characterized by its fine-grained nature, vague symptoms, and the extensive variability of image features within each domain. In this study, we propose the concept of Discriminative Difficulty Distance (DDD), a novel metric designed to quantify the domain gap between training and test datasets while assessing the classification difficulty of test data. DDD provides a valuable tool for identifying insufficient diversity in training data, thus supporting the development of more diverse and robust datasets. We investigated multiple image encoders trained on different datasets and examined whether the distances between datasets, measured using low-dimensional representations generated by the encoders, are suitable as a DDD metric. The study utilized 244,063 plant disease images spanning four crops and 34 disease classes collected from 27 domains. As a result, we demonstrated that even if the test images are from different crops or diseases than those used to train the encoder, incorporating them allows the construction of a distance measure for a dataset that strongly correlates with the difficulty of diagnosis indicated by the disease classifier developed independently. Compared to the base encoder, pre-trained only on ImageNet21K, the correlation higher by 0.106 to 0.485, reaching a maximum of 0.909.

Title: RORem: Training a Robust Object Remover with Human-in-the-Loop

Authors: Ruibin Li, Tao Yang, Song Guo, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00740
Pdf URL: https://arxiv.org/pdf/2501.00740
Copy Paste: [[2501.00740]] RORem: Training a Robust Object Remover with Human-in-the-Loop(https://arxiv.org/abs/2501.00740)
Keywords: robust, diffusion
Abstract: Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18\%. The dataset, source code and trained model are available at this https URL.

Title: Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

Authors: Xiyang Hu
Subjects: cs.CL, cs.AI, cs.GT, cs.IR, econ.TH
Abstract URL: https://arxiv.org/abs/2501.00745
Pdf URL: https://arxiv.org/pdf/2501.00745
Copy Paste: [[2501.00745]] Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines(https://arxiv.org/abs/2501.00745)
Keywords: security, defense, attack, fair, large language model
Abstract: The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM's ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.

Title: DIVE: Diversified Iterative Self-Improvement

Authors: Yiwei Qin, Yixiu Liu, Pengfei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00747
Pdf URL: https://arxiv.org/pdf/2501.00747
Copy Paste: [[2501.00747]] DIVE: Diversified Iterative Self-Improvement(https://arxiv.org/abs/2501.00747)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have demonstrated the effectiveness of Iterative Self-Improvement (ISI) techniques. However, continuous training on self-generated data leads to reduced output diversity, a limitation particularly critical in reasoning tasks where diverse solution paths are essential. We present DIVE (Diversified Iterative Self-Improvement), a novel framework that addresses this challenge through two key components: Sample Pool Expansion for broader solution exploration, and Data Selection for balancing diversity and quality in preference pairs. Experiments on MATH and GSM8k datasets show that DIVE achieves a 10% to 45% relative increase in output diversity metrics while maintaining performance quality compared to vanilla ISI. Our ablation studies confirm both components' significance in achieving these improvements. Code is available at this https URL.

Title: Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation

Authors: Suho Park, SuBeen Lee, Hyun Seok Seong, Jaejoon Yoo, Jae-Pil Heo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00752
Pdf URL: https://arxiv.org/pdf/2501.00752
Copy Paste: [[2501.00752]] Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation(https://arxiv.org/abs/2501.00752)
Keywords: segmentation
Abstract: We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at this https URL

Title: Beyond Static Datasets: A Behavior-Driven Entity-Specific Simulation to Overcome Data Scarcity and Train Effective Crypto Anti-Money Laundering Models

Authors: Dinesh Srivasthav P, Manoj Apte
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00757
Pdf URL: https://arxiv.org/pdf/2501.00757
Copy Paste: [[2501.00757]] Beyond Static Datasets: A Behavior-Driven Entity-Specific Simulation to Overcome Data Scarcity and Train Effective Crypto Anti-Money Laundering Models(https://arxiv.org/abs/2501.00757)
Keywords: privacy
Abstract: For different factors/reasons, ranging from inherent characteristics and features providing decentralization, enhanced privacy, ease of transactions, etc., to implied external hardships in enforcing regulations, contradictions in data sharing policies, etc., cryptocurrencies have been severely abused for carrying out numerous malicious and illicit activities including money laundering, darknet transactions, scams, terrorism financing, arm trades. However, money laundering is a key crime to be mitigated to also suspend the movement of funds from other illicit activities. Billions of dollars are annually being laundered. It is getting extremely difficult to identify money laundering in crypto transactions owing to many layering strategies available today, and rapidly evolving tactics, and patterns the launderers use to obfuscate the illicit funds. Many detection methods have been proposed ranging from naive approaches involving complete manual investigation to machine learning models. However, there are very limited datasets available for effectively training machine learning models. Also, the existing datasets are static and class-imbalanced, posing challenges for scalability and suitability to specific scenarios, due to lack of customization to varying requirements. This has been a persistent challenge in literature. In this paper, we propose behavior embedded entity-specific money laundering-like transaction simulation that helps in generating various transaction types and models the transactions embedding the behavior of several entities observed in this space. The paper discusses the design and architecture of the simulator, a custom dataset we generated using the simulator, and the performance of models trained on this synthetic data in detecting real addresses involved in money laundering.

Title: Less is More: Token Context-aware Learning for Object Tracking

Authors: Chenlong Xu, Bineng Zhong, Qihua Liang, Yaozong Zheng, Guorong Li, Shuxiang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00758
Pdf URL: https://arxiv.org/pdf/2501.00758
Copy Paste: [[2501.00758]] Less is More: Token Context-aware Learning for Object Tracking(https://arxiv.org/abs/2501.00758)
Keywords: robust
Abstract: Recently, several studies have shown that utilizing contextual information to perceive target states is crucial for object tracking. They typically capture context by incorporating multiple video frames. However, these naive frame-context methods fail to consider the importance of each patch within a reference frame, making them susceptible to noise and redundant tokens, which deteriorates tracking performance. To address this challenge, we propose a new token context-aware tracking pipeline named LMTrack, designed to automatically learn high-quality reference tokens for efficient visual tracking. Embracing the principle of Less is More, the core idea of LMTrack is to analyze the importance distribution of all reference tokens, where important tokens are collected, continually attended to, and updated. Specifically, a novel Token Context Memory module is designed to dynamically collect high-quality spatio-temporal information of a target in an autoregressive manner, eliminating redundant background tokens from the reference frames. Furthermore, an effective Unidirectional Token Attention mechanism is designed to establish dependencies between reference tokens and search frame, enabling robust cross-frame association and target localization. Extensive experiments demonstrate the superiority of our tracker, achieving state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.

Title: Enhancing Transformers for Generalizable First-Order Logical Entailment

Authors: Tianshi Zheng, Jiazheng Wang, Zihao Wang, Jiaxin Bai, Hang Yin, Zheye Deng, Yangqiu Song, Jianxin Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00759
Pdf URL: https://arxiv.org/pdf/2501.00759
Copy Paste: [[2501.00759]] Enhancing Transformers for Generalizable First-Order Logical Entailment(https://arxiv.org/abs/2501.00759)
Keywords: transformer
Abstract: Transformers, as a fundamental deep learning architecture, have demonstrated remarkable capabilities in reasoning. This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and explores ways to improve it. The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) the unseen knowledge and query settings discussed in the task of knowledge graph query answering, enabling a characterization of fine-grained generalizability. Results on our comprehensive dataset show that transformers outperform previous methods specifically designed for this task and provide detailed empirical evidence on the impact of input query syntax, token embedding, and transformer architectures on the reasoning capability of transformers. Interestingly, our findings reveal a mismatch between positional encoding and other design choices in transformer architectures employed in prior practices. This discovery motivates us to propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.

Title: Beyond Words: AuralLLM and SignMST-C for Precise Sign Language Production and Bidirectional Accessibility

Authors: Yulong Li, Yuxuan Zhang, Feilong Tang, Mian Zhou, Zhixiang Lu, Haochen Xue, Yifang Wang, Kang Dang, Jionglong Su
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00765
Pdf URL: https://arxiv.org/pdf/2501.00765
Copy Paste: [[2501.00765]] Beyond Words: AuralLLM and SignMST-C for Precise Sign Language Production and Bidirectional Accessibility(https://arxiv.org/abs/2501.00765)
Keywords: robust
Abstract: Although sign language recognition aids non-hearing-impaired understanding, many hearing-impaired individuals still rely on sign language alone due to limited literacy, underscoring the need for advanced sign language production and translation (SLP and SLT) systems. In the field of sign language production, the lack of adequate models and datasets restricts practical applications. Existing models face challenges in production accuracy and pose control, making it difficult to provide fluent sign language expressions across diverse scenarios. Additionally, data resources are scarce, particularly high-quality datasets with complete sign vocabulary and pose annotations. To address these issues, we introduce CNText2Sign and CNSign, comprehensive datasets to benchmark SLP and SLT, respectively, with CNText2Sign covering gloss and landmark mappings for SLP, and CNSign providing extensive video-to-text data for SLT. To improve the accuracy and applicability of sign language systems, we propose the AuraLLM and SignMST-C models. AuraLLM, incorporating LoRA and RAG techniques, achieves a BLEU-4 score of 50.41 on the CNText2Sign dataset, enabling precise control over gesture semantics and motion. SignMST-C employs self-supervised rapid motion video pretraining, achieving a BLEU-4 score of 31.03/32.08 on the PHOENIX2014-T benchmark, setting a new state-of-the-art. These models establish robust baselines for the datasets released for their respective tasks.

Title: Revisiting Graph Neural Networks on Graph-level Tasks: Comprehensive Experiments, Analysis, and Improvements

Authors: Haoyang Li, Yuming Xu, Chen Jason Zhang, Alexander Zhou, Lei Chen, Qing Li
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2501.00773
Pdf URL: https://arxiv.org/pdf/2501.00773
Copy Paste: [[2501.00773]] Revisiting Graph Neural Networks on Graph-level Tasks: Comprehensive Experiments, Analysis, and Improvements(https://arxiv.org/abs/2501.00773)
Keywords: robust
Abstract: Graphs are essential data structures for modeling complex interactions in domains such as social networks, molecular structures, and biological systems. Graph-level tasks, which predict properties or classes for the entire graph, are critical for applications, such as molecular property prediction and subgraph counting. Graph Neural Networks (GNNs) have shown promise in these tasks, but their evaluations are often limited to narrow datasets, tasks, and inconsistent experimental setups, restricting their generalizability. To address these limitations, we propose a unified evaluation framework for graph-level GNNs. This framework provides a standardized setting to evaluate GNNs across diverse datasets, various graph tasks (e.g., graph classification and regression), and challenging scenarios, including noisy, imbalanced, and few-shot graphs. Additionally, we propose a novel GNN model with enhanced expressivity and generalization capabilities. Specifically, we enhance the expressivity of GNNs through a $k$-path rooted subgraph approach, enabling the model to effectively count subgraphs (e.g., paths and cycles). Moreover, we introduce a unified graph contrastive learning algorithm for graphs across diverse domains, which adaptively removes unimportant edges to augment graphs, thereby significantly improving generalization performance. Extensive experiments demonstrate that our model achieves superior performance against fourteen effective baselines across twenty-seven graph datasets, establishing it as a robust and generalizable model for graph-level tasks.

Title: FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation

Authors: Qianli Wang, Nils Feldhus, Simon Ostermann, Luis Felipe Villa-Arenas, Sebastian Möller, Vera Schmitt
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00777
Pdf URL: https://arxiv.org/pdf/2501.00777
Copy Paste: [[2501.00777]] FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation(https://arxiv.org/abs/2501.00777)
Keywords: large language model
Abstract: Counterfactual examples are widely used in natural language processing (NLP) as valuable data to improve models, and in explainable artificial intelligence (XAI) to understand model behavior. The automated generation of counterfactual examples remains a challenging task even for large language models (LLMs), despite their impressive performance on many tasks. In this paper, we first introduce ZeroCF, a faithful approach for leveraging important words derived from feature attribution methods to generate counterfactual examples in a zero-shot setting. Second, we present a new framework, FitCF, which further verifies aforementioned counterfactuals by label flip verification and then inserts them as demonstrations for few-shot prompting, outperforming two state-of-the-art baselines. Through ablation studies, we identify the importance of each of FitCF's core components in improving the quality of counterfactuals, as assessed through flip rate, perplexity, and similarity measures. Furthermore, we show the effectiveness of LIME and Integrated Gradients as backbone attribution methods for FitCF and find that the number of demonstrations has the largest effect on performance. Finally, we reveal a strong correlation between the faithfulness of feature attribution scores and the quality of generated counterfactuals.

Title: Navigating Nuance: In Quest for Political Truth

Authors: Soumyadeep Sar, Dwaipayan Roy
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.00782
Pdf URL: https://arxiv.org/pdf/2501.00782
Copy Paste: [[2501.00782]] Navigating Nuance: In Quest for Political Truth(https://arxiv.org/abs/2501.00782)
Keywords: robust
Abstract: This study investigates the several nuanced rationales for countering the rise of political bias. We evaluate the performance of the Llama-3 (70B) language model on the Media Bias Identification Benchmark (MBIB), based on a novel prompting technique that incorporates subtle reasons for identifying political leaning. Our findings underscore the challenges of detecting political bias and highlight the potential of transfer learning methods to enhance future models. Through our framework, we achieve a comparable performance with the supervised and fully fine-tuned ConvBERT model, which is the state-of-the-art model, performing best among other baseline models for the political bias task on MBIB. By demonstrating the effectiveness of our approach, we contribute to the development of more robust tools for mitigating the spread of misinformation and polarization. Our codes and dataset are made publicly available in github.

Title: Shifting-Merging: Secure, High-Capacity and Efficient Steganography via Large Language Models

Authors: Minhao Bai, Jinshuai Yang, Kaiyi Pang, Yongfeng Huang, Yue Gao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.00786
Pdf URL: https://arxiv.org/pdf/2501.00786
Copy Paste: [[2501.00786]] Shifting-Merging: Secure, High-Capacity and Efficient Steganography via Large Language Models(https://arxiv.org/abs/2501.00786)
Keywords: secure, privacy, large language model
Abstract: In the face of escalating surveillance and censorship within the cyberspace, the sanctity of personal privacy has come under siege, necessitating the development of steganography, which offers a way to securely hide messages within innocent-looking texts. Previous methods alternate the texts to hide private massages, which is not secure. Large Language Models (LLMs) provide high-quality and explicit distribution, which is an available mathematical tool for secure steganography methods. However, existing attempts fail to achieve high capacity, time efficiency and correctness simultaneously, and their strongly coupling designs leave little room for refining them to achieve better performance. To provide a secure, high-capacity and efficient steganography method, we introduce ShiMer. Specifically, ShiMer pseudorandomly shifts the probability interval of the LLM's distribution to obtain a private distribution, and samples a token according to the private bits. ShiMer produced steganographic texts are indistinguishable in quality from the normal texts directly generated by the language model. To further enhance the capacity of ShiMer, we design a reordering algorithm to minimize the occurrence of interval splitting during decoding phase. Experimental results indicate that our method achieves the highest capacity and efficiency among existing secure steganography techniques.

Title: LENS-XAI: Redefining Lightweight and Explainable Network Security through Knowledge Distillation and Variational Autoencoders for Scalable Intrusion Detection in Cybersecurity

Authors: Muhammet Anil Yagiz, Polat Goktas
Subjects: cs.CR, cs.AI, cs.CY, cs.ET
Abstract URL: https://arxiv.org/abs/2501.00790
Pdf URL: https://arxiv.org/pdf/2501.00790
Copy Paste: [[2501.00790]] LENS-XAI: Redefining Lightweight and Explainable Network Security through Knowledge Distillation and Variational Autoencoders for Scalable Intrusion Detection in Cybersecurity(https://arxiv.org/abs/2501.00790)
Keywords: security, attack, robust, interpretability, explainability
Abstract: The rapid proliferation of Industrial Internet of Things (IIoT) systems necessitates advanced, interpretable, and scalable intrusion detection systems (IDS) to combat emerging cyber threats. Traditional IDS face challenges such as high computational demands, limited explainability, and inflexibility against evolving attack patterns. To address these limitations, this study introduces the Lightweight Explainable Network Security framework (LENS-XAI), which combines robust intrusion detection with enhanced interpretability and scalability. LENS-XAI integrates knowledge distillation, variational autoencoder models, and attribution-based explainability techniques to achieve high detection accuracy and transparency in decision-making. By leveraging a training set comprising 10% of the available data, the framework optimizes computational efficiency without sacrificing performance. Experimental evaluation on four benchmark datasets: Edge-IIoTset, UKM-IDS20, CTU-13, and NSL-KDD, demonstrates the framework's superior performance, achieving detection accuracies of 95.34%, 99.92%, 98.42%, and 99.34%, respectively. Additionally, the framework excels in reducing false positives and adapting to complex attack scenarios, outperforming existing state-of-the-art methods. Key strengths of LENS-XAI include its lightweight design, suitable for resource-constrained environments, and its scalability across diverse IIoT and cybersecurity contexts. Moreover, the explainability module enhances trust and transparency, critical for practical deployment in dynamic and sensitive applications. This research contributes significantly to advancing IDS by addressing computational efficiency, feature interpretability, and real-world applicability. Future work could focus on extending the framework to ensemble AI systems for distributed environments, further enhancing its robustness and adaptability.

Title: Multimodal Large Models Are Effective Action Anticipators

Authors: Binglu Wang, Yao Tian, Shunzhou Wang, Le Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00795
Pdf URL: https://arxiv.org/pdf/2501.00795
Copy Paste: [[2501.00795]] Multimodal Large Models Are Effective Action Anticipators(https://arxiv.org/abs/2501.00795)
Keywords: robust, transformer, large language model
Abstract: The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation. Code is available at this https URL.

Title: Make Shuffling Great Again: A Side-Channel Resistant Fisher-Yates Algorithm for Protecting Neural Networks

Authors: Leonard Puškáč, Marek Benovič, Jakub Breier, Xiaolu Hou
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00798
Pdf URL: https://arxiv.org/pdf/2501.00798
Copy Paste: [[2501.00798]] Make Shuffling Great Again: A Side-Channel Resistant Fisher-Yates Algorithm for Protecting Neural Networks(https://arxiv.org/abs/2501.00798)
Keywords: secure, protect, attack
Abstract: Neural network models implemented in embedded devices have been shown to be susceptible to side-channel attacks (SCAs), allowing recovery of proprietary model parameters, such as weights and biases. There are already available countermeasure methods currently used for protecting cryptographic implementations that can be tailored to protect embedded neural network models. Shuffling, a hiding-based countermeasure that randomly shuffles the order of computations, was shown to be vulnerable to SCA when the Fisher-Yates algorithm is used. In this paper, we propose a design of an SCA-secure version of the Fisher-Yates algorithm. By integrating the masking technique for modular reduction and Blakely's method for modular multiplication, we effectively remove the vulnerability in the division operation that led to side-channel leakage in the original version of the algorithm. We experimentally evaluate that the countermeasure is effective against SCA by implementing a correlation power analysis attack on an embedded neural network model implemented on ARM Cortex-M4. Compared to the original proposal, the memory overhead is $2\times$ the biggest layer of the network, while the time overhead varies from $4\%$ to $0.49\%$ for a layer with $100$ and $1000$ neurons, respectively.

Title: Reasoning-Oriented and Analogy-Based Methods for Locating and Editing in Zero-Shot Event-Relational Reasoning

Authors: Jingyao Tang, Lishuang Li, Liteng Mi, Haiming Wu, Hongbin Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00803
Pdf URL: https://arxiv.org/pdf/2501.00803
Copy Paste: [[2501.00803]] Reasoning-Oriented and Analogy-Based Methods for Locating and Editing in Zero-Shot Event-Relational Reasoning(https://arxiv.org/abs/2501.00803)
Keywords: interpretability
Abstract: Zero-shot event-relational reasoning is an important task in natural language processing, and existing methods jointly learn a variety of event-relational prefixes and inference-form prefixes to achieve such tasks. However, training prefixes consumes large computational resources and lacks interpretability. Additionally, learning various relational and inferential knowledge inefficiently exploits the connections between tasks. Therefore, we first propose a method for Reasoning-Oriented Locating and Editing (ROLE), which locates and edits the key modules of the language model for reasoning about event relations, enhancing interpretability and also resource-efficiently optimizing the reasoning ability. Subsequently, we propose a method for Analogy-Based Locating and Editing (ABLE), which efficiently exploits the similarities and differences between tasks to optimize the zero-shot reasoning capability. Experimental results show that ROLE improves interpretability and reasoning performance with reduced computational cost. ABLE achieves SOTA results in zero-shot reasoning.

Title: MixSA: Training-free Reference-based Sketch Extraction via Mixture-of-Self-Attention

Authors: Rui Yang, Xiaojun Wu, Shengfeng He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00816
Pdf URL: https://arxiv.org/pdf/2501.00816
Copy Paste: [[2501.00816]] MixSA: Training-free Reference-based Sketch Extraction via Mixture-of-Self-Attention(https://arxiv.org/abs/2501.00816)
Keywords: extraction, diffusion
Abstract: Current sketch extraction methods either require extensive training or fail to capture a wide range of artistic styles, limiting their practical applicability and versatility. We introduce Mixture-of-Self-Attention (MixSA), a training-free sketch extraction method that leverages strong diffusion priors for enhanced sketch perception. At its core, MixSA employs a mixture-of-self-attention technique, which manipulates self-attention layers by substituting the keys and values with those from reference sketches. This allows for the seamless integration of brushstroke elements into initial outline images, offering precise control over texture density and enabling interpolation between styles to create novel, unseen styles. By aligning brushstroke styles with the texture and contours of colored images, particularly in late decoder layers handling local textures, MixSA addresses the common issue of color averaging by adjusting initial outlines. Evaluated with various perceptual metrics, MixSA demonstrates superior performance in sketch quality, flexibility, and applicability. This approach not only overcomes the limitations of existing methods but also empowers users to generate diverse, high-fidelity sketches that more accurately reflect a wide range of artistic expressions.

Title: Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention

Authors: Zhenyu Guo, Wenguang Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00823
Pdf URL: https://arxiv.org/pdf/2501.00823
Copy Paste: [[2501.00823]] Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention(https://arxiv.org/abs/2501.00823)
Keywords: interpretability, transformer
Abstract: Transformers have achieved remarkable success across diverse domains, but their monolithic architecture presents challenges in interpretability, adaptability, and scalability. This paper introduces a novel modular Transformer architecture that explicitly decouples knowledge and reasoning through a generalized cross-attention mechanism to a shared knowledge base, specifically designed for effective knowledge retrieval. Critically, we provide a rigorous mathematical derivation demonstrating that the Feed-Forward Network (FFN) in a standard Transformer is a specialized case (a closure) of this generalized cross-attention, revealing its role in implicit knowledge retrieval and validating our design. This theoretical framework provides a new lens for understanding FFNs and lays the foundation for future research exploring enhanced interpretability, adaptability, and scalability, enabling richer interplay with external knowledge bases and other systems.

Title: Information Sifting Funnel: Privacy-preserving Collaborative Inference Against Model Inversion Attacks

Authors: Rongke Liu
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2501.00824
Pdf URL: https://arxiv.org/pdf/2501.00824
Copy Paste: [[2501.00824]] Information Sifting Funnel: Privacy-preserving Collaborative Inference Against Model Inversion Attacks(https://arxiv.org/abs/2501.00824)
Keywords: privacy, protect, defense, attack, extraction
Abstract: The complexity of neural networks and inference tasks, coupled with demands for computational efficiency and real-time feedback, poses significant challenges for resource-constrained edge devices. Collaborative inference mitigates this by assigning shallow feature extraction to edge devices and offloading features to the cloud for further inference, reducing computational load. However, transmitted features remain susceptible to model inversion attacks (MIAs), which can reconstruct original input data. Current defenses, such as perturbation and information bottleneck techniques, offer explainable protection but face limitations, including the lack of standardized criteria for assessing MIA difficulty, challenges in mutual information estimation, and trade-offs among usability, privacy, and deployability. To address these challenges, we introduce the first criterion to evaluate MIA difficulty in collaborative inference, supported by theoretical analysis of existing attacks and defenses, validated using experiments with the Mutual Information Neural Estimator (MINE). Based on these findings, we propose SiftFunnel, a privacy-preserving framework for collaborative inference. The edge model is trained with linear and non-linear correlation constraints to reduce redundant information in transmitted features, enhancing privacy protection. Label smoothing and a cloud-based upsampling module are added to balance usability and privacy. To improve deployability, the edge model incorporates a funnel-shaped structure and attention mechanisms, preserving both privacy and usability. Extensive experiments demonstrate that SiftFunnel outperforms state-of-the-art defenses against MIAs, achieving superior privacy protection with less than 3% accuracy loss and striking an optimal balance among usability, privacy, and practicality.

Title: Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models

Authors: Benjamin Icard, Evangelia Zve, Lila Sainero, Alice Breton, Jean-Gabriel Ganascia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00828
Pdf URL: https://arxiv.org/pdf/2501.00828
Copy Paste: [[2501.00828]] Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models(https://arxiv.org/abs/2501.00828)
Keywords: interpretability, transformer
Abstract: This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.

Title: LLM+AL: Bridging Large Language Models and Action Languages for Complex Reasoning about Actions

Authors: Adam Ishay, Joohyung Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00830
Pdf URL: https://arxiv.org/pdf/2501.00830
Copy Paste: [[2501.00830]] LLM+AL: Bridging Large Language Models and Action Languages for Complex Reasoning about Actions(https://arxiv.org/abs/2501.00830)
Keywords: large language model
Abstract: Large Language Models (LLMs) have made significant strides in various intelligent tasks but still struggle with complex action reasoning tasks that require systematic search. To address this limitation, we propose a method that bridges the natural language understanding capabilities of LLMs with the symbolic reasoning strengths of action languages. Our approach, termed "LLM+AL," leverages the LLM's strengths in semantic parsing and commonsense knowledge generation alongside the action language's proficiency in automated reasoning based on encoded knowledge. We compare LLM+AL against state-of-the-art LLMs, including ChatGPT-4, Claude 3 Opus, Gemini Ultra 1.0, and o1-preview, using benchmarks for complex reasoning about actions. Our findings indicate that, although all methods exhibit errors, LLM+AL, with relatively minimal human corrections, consistently leads to correct answers, whereas standalone LLMs fail to improve even with human feedback. LLM+AL also contributes to automated generation of action languages.

Title: Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation

Authors: Qianang Zhou, Junhui Hou, Meiyi Yang, Yongjian Deng, Youfu Li, Junlin Xiong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00838
Pdf URL: https://arxiv.org/pdf/2501.00838
Copy Paste: [[2501.00838]] Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation(https://arxiv.org/abs/2501.00838)
Keywords: robust, transformer
Abstract: Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time. Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios. These complementary characteristics underscore the potential of integrating frame and event data for optical flow estimation. However, most cross-modal approaches fail to fully utilize the complementary advantages, relying instead on simply stacking information. This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality, achieving effective cross-modal fusion. Specifically, we propose an event-enhanced frame representation that preserves the rich texture of frames and the basic structure of events. We use the enhanced representation as the guiding modality and employ events to capture temporally dense motion information. The robust motion features derived from the guiding modality direct the aggregation of motion information from events. To further enhance fusion, we propose a transformer-based module that complements sparse event motion features with spatially rich frame information and enhances global information propagation. Additionally, a mix-fusion encoder is designed to extract comprehensive spatiotemporal contextual features from both modalities. Extensive experiments on the MVSEC and DSEC-Flow datasets demonstrate the effectiveness of our framework. Leveraging the complementary strengths of frames and events, our method achieves leading performance on the DSEC-Flow dataset. Compared to the event-only model, frame guidance improves accuracy by 10\%. Furthermore, it outperforms the state-of-the-art fusion-based method with a 4\% accuracy gain and a 45\% reduction in inference time.

Title: A Survey of Secure Semantic Communications

Authors: Rui Meng, Song Gao, Dayu Fan, Haixiao Gao, Yining Wang, Xiaodong Xu, Bizhu Wang, Suyu Lv, Zhidi Zhang, Mengying Sun, Shujun Han, Chen Dong, Xiaofeng Tao, Ping Zhang
Subjects: cs.CR, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2501.00842
Pdf URL: https://arxiv.org/pdf/2501.00842
Copy Paste: [[2501.00842]] A Survey of Secure Semantic Communications(https://arxiv.org/abs/2501.00842)
Keywords: secure, security, privacy, attack, robust
Abstract: Semantic communication (SemCom) is regarded as a promising and revolutionary technology in 6G, aiming to transcend the constraints of ``Shannon's trap" by filtering out redundant information and extracting the core of effective data. Compared to traditional communication paradigms, SemCom offers several notable advantages, such as reducing the burden on data transmission, enhancing network management efficiency, and optimizing resource allocation. Numerous researchers have extensively explored SemCom from various perspectives, including network architecture, theoretical analysis, potential technologies, and future applications. However, as SemCom continues to evolve, a multitude of security and privacy concerns have arisen, posing threats to the confidentiality, integrity, and availability of SemCom systems. This paper presents a comprehensive survey of the technologies that can be utilized to secure SemCom. Firstly, we elaborate on the entire life cycle of SemCom, which includes the model training, model transfer, and semantic information transmission phases. Then, we identify the security and privacy issues that emerge during these three stages. Furthermore, we summarize the techniques available to mitigate these security and privacy threats, including data cleaning, robust learning, defensive strategies against backdoor attacks, adversarial training, differential privacy, cryptography, blockchain technology, model compression, and physical-layer security. Lastly, this paper outlines future research directions to guide researchers in related fields.

Title: Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation

Authors: Kun Li, George Vosselman, Michael Ying Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00851
Pdf URL: https://arxiv.org/pdf/2501.00851
Copy Paste: [[2501.00851]] Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation(https://arxiv.org/abs/2501.00851)
Keywords: transformer, segmentation
Abstract: The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression. Recent advancements, particularly Transformer-based fusion designs, have demonstrated remarkable progress in this domain. However, existing methods primarily focus on refining visual features using language-aware guidance during the cross-modal fusion stage, neglecting the complementary vision-to-language flow. This limitation often leads to irrelevant or suboptimal representations. In addition, the diverse spatial scales of ground objects in aerial images pose significant challenges to the visual perception capabilities of existing models when conditioned on textual inputs. In this paper, we propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges for RRSIS. Specifically, we design a Bidirectional Alignment Module (BAM) with learnable query tokens to selectively and effectively represent visual and linguistic features, emphasizing regions associated with key tokens. BAM is further enhanced with a dynamic feature selection block, designed to provide both macro- and micro-level visual features, preserving global context and local details to facilitate more effective cross-modal interaction. Furthermore, SBANet incorporates a text-conditioned channel and spatial aggregator to bridge the gap between the encoder and decoder, enhancing cross-scale information exchange in complex aerial scenarios. Extensive experiments demonstrate that our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets, both quantitatively and qualitatively. The code will be released after publication.

Title: DiffETM: Diffusion Process Enhanced Embedded Topic Model

Authors: Wei Shao, Mingyang Liu, Linqi Song
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00862
Pdf URL: https://arxiv.org/pdf/2501.00862
Copy Paste: [[2501.00862]] DiffETM: Diffusion Process Enhanced Embedded Topic Model(https://arxiv.org/abs/2501.00862)
Keywords: diffusion
Abstract: The embedded topic model (ETM) is a widely used approach that assumes the sampled document-topic distribution conforms to the logistic normal distribution for easier optimization. However, this assumption oversimplifies the real document-topic distribution, limiting the model's performance. In response, we propose a novel method that introduces the diffusion process into the sampling process of document-topic distribution to overcome this limitation and maintain an easy optimization process. We validate our method through extensive experiments on two mainstream datasets, proving its effectiveness in improving topic modeling performance.

Title: Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation

Authors: Shoutao Guo, Shaolei Zhang, Zhengrui Ma, Yang Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00868
Pdf URL: https://arxiv.org/pdf/2501.00868
Copy Paste: [[2501.00868]] Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation(https://arxiv.org/abs/2501.00868)
Keywords: large language model
Abstract: Simultaneous generation models write generation results while reading streaming inputs, necessitating a policy-maker to determine the appropriate output timing. Existing simultaneous generation methods generally adopt the traditional encoder-decoder architecture and learn the generation and policy-making capabilities through complex dynamic programming techniques. Although LLMs excel at text generation, they face challenges in taking on the role of policy-makers through traditional training methods, limiting their exploration in simultaneous generation. To overcome these limitations, we propose a novel LLM-driven Simultaneous Generation (LSG) framework, which allows the off-the-shelf LLM to decide the generation timing and produce output concurrently. Specifically, LSG selects the generation policy that minimizes latency as the baseline policy. Referring to the baseline policy, LSG enables the LLM to devise an improved generation policy that better balances latency and generation quality, and writes generation results accordingly. Experiments on simultaneous translation and streaming automatic speech recognition tasks show that our method can achieve state-of-the-art performance utilizing the open-source LLMs and demonstrate practicality in real-world scenarios.

Title: Exploring Structured Semantic Priors Underlying Diffusion Score for Test-time Adaptation

Authors: Mingjia Li, Shuang Li, Tongrui Su, Longhui Yuan, Jian Liang, Wei Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00873
Pdf URL: https://arxiv.org/pdf/2501.00873
Copy Paste: [[2501.00873]] Exploring Structured Semantic Priors Underlying Diffusion Score for Test-time Adaptation(https://arxiv.org/abs/2501.00873)
Keywords: diffusion, generative
Abstract: Capitalizing on the complementary advantages of generative and discriminative models has always been a compelling vision in machine learning, backed by a growing body of research. This work discloses the hidden semantic structure within score-based generative models, unveiling their potential as effective discriminative priors. Inspired by our theoretical findings, we propose DUSA to exploit the structured semantic priors underlying diffusion score to facilitate the test-time adaptation of image classifiers or dense predictors. Notably, DUSA extracts knowledge from a single timestep of denoising diffusion, lifting the curse of Monte Carlo-based likelihood estimation over timesteps. We demonstrate the efficacy of our DUSA in adapting a wide variety of competitive pre-trained discriminative models on diverse test-time scenarios. Additionally, a thorough ablation study is conducted to dissect the pivotal elements in DUSA. Code is publicly available at this https URL.

Title: LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

Authors: Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2501.00874
Pdf URL: https://arxiv.org/pdf/2501.00874
Copy Paste: [[2501.00874]] LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models(https://arxiv.org/abs/2501.00874)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.

Title: FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation

Authors: Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00877
Pdf URL: https://arxiv.org/pdf/2501.00877
Copy Paste: [[2501.00877]] FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation(https://arxiv.org/abs/2501.00877)
Keywords: segmentation
Abstract: Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel alignment loss to refine the coarse-grained alignment from CLIP, achieving finer-grained pixel-text semantic alignment. Additionally, to enrich category boundary information, we introduce the alignment matrices as optimizable pseudo-masks during forward propagation and propose Category Information Supplementation module. These pseudo-masks, derived from cosine and convolutional similarity, provide essential global and local boundary information between different categories. By combining these two strategies, FGAseg effectively enhances pixel-level alignment and category boundary information, addressing key challenges in open-vocabulary segmentation. Extensive experiments demonstrate that FGAseg outperforms existing methods on open-vocabulary semantic segmentation benchmarks.

Title: TrustRAG: Enhancing Robustness and Trustworthiness in RAG

Authors: Huichi Zhou, Kin-Hei Lee, Zhonghao Zhan, Yue Chen, Zhenhao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00879
Pdf URL: https://arxiv.org/pdf/2501.00879
Copy Paste: [[2501.00879]] TrustRAG: Enhancing Robustness and Trustworthiness in RAG(https://arxiv.org/abs/2501.00879)
Keywords: defense, attack, robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. However, these systems remain vulnerable to corpus poisoning attacks that can significantly degrade LLM performance through the injection of malicious content. To address these challenges, we propose TrustRAG, a robust framework that systematically filters compromised and irrelevant content before it reaches the language model. Our approach implements a two-stage defense mechanism: first, it employs K-means clustering to identify potential attack patterns in retrieved documents based on their semantic embeddings, effectively isolating suspicious content. Second, it leverages cosine similarity and ROUGE metrics to detect malicious documents while resolving discrepancies between the model's internal knowledge and external information through a self-assessment process. TrustRAG functions as a plug-and-play, training-free module that integrates seamlessly with any language model, whether open or closed-source, maintaining high contextual relevance while strengthening defenses against attacks. Through extensive experimental validation, we demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches across multiple model architectures and datasets. We have made TrustRAG available as open-source software at \url{this https URL}.

Title: Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

Authors: Teng Hu, Jiangning Zhang, Ran Yi, Jieyu Weng, Yabiao Wang, Xianfang Zeng, Zhucun Xue, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00880
Pdf URL: https://arxiv.org/pdf/2501.00880
Copy Paste: [[2501.00880]] Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction(https://arxiv.org/abs/2501.00880)
Keywords: robust
Abstract: Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation.

Title: FullTransNet: Full Transformer with Local-Global Attention for Video Summarization

Authors: Libin Lan, Lu Jiang, Tianshu Yu, Xiaojuan Liu, Zhongshi He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00882
Pdf URL: https://arxiv.org/pdf/2501.00882
Copy Paste: [[2501.00882]] FullTransNet: Full Transformer with Local-Global Attention for Video Summarization(https://arxiv.org/abs/2501.00882)
Keywords: transformer
Abstract: Video summarization mainly aims to produce a compact, short, informative, and representative synopsis of raw videos, which is of great importance for browsing, analyzing, and understanding video content. Dominant video summarization approaches are generally based on recurrent or convolutional neural networks, even recent encoder-only transformers. We propose using full transformer as an alternative architecture to perform video summarization. The full transformer with an encoder-decoder structure, specifically designed for handling sequence transduction problems, is naturally suitable for video summarization tasks. This work considers supervised video summarization and casts it as a sequence-to-sequence learning problem. Our key idea is to directly apply the full transformer to the video summarization task, which is intuitively sound and effective. Also, considering the efficiency problem, we replace full attention with the combination of local and global sparse attention, which enables modeling long-range dependencies while reducing computational costs. Based on this, we propose a transformer-like architecture, named FullTransNet, which has a full encoder-decoder structure with local-global sparse attention for video summarization. Specifically, both the encoder and decoder in FullTransNet are stacked the same way as ones in the vanilla transformer, and the local-global sparse attention is used only at the encoder side. Extensive experiments on two public multimedia benchmark datasets SumMe and TVSum demonstrate that our proposed model can outperform other video summarization approaches, achieving F-Measures of 54.4% on SumMe and 63.9% on TVSum with relatively lower compute and memory requirements, verifying its effectiveness and efficiency. The code and models are publicly available on GitHub.

Title: Diversity Optimization for Travelling Salesman Problem via Deep Reinforcement Learning

Authors: Qi Li, Zhiguang Cao, Yining Ma, Yaoxin Wu, Yue-Jiao Gong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00884
Pdf URL: https://arxiv.org/pdf/2501.00884
Copy Paste: [[2501.00884]] Diversity Optimization for Travelling Salesman Problem via Deep Reinforcement Learning(https://arxiv.org/abs/2501.00884)
Keywords: robust
Abstract: Existing neural methods for the Travelling Salesman Problem (TSP) mostly aim at finding a single optimal solution. To discover diverse yet high-quality solutions for Multi-Solution TSP (MSTSP), we propose a novel deep reinforcement learning based neural solver, which is primarily featured by an encoder-decoder structured policy. Concretely, on the one hand, a Relativization Filter (RF) is designed to enhance the robustness of the encoder to affine transformations of the instances, so as to potentially improve the quality of the found solutions. On the other hand, a Multi-Attentive Adaptive Active Search (MA3S) is tailored to allow the decoders to strike a balance between the optimality and diversity. Experimental evaluations on benchmark instances demonstrate the superiority of our method over recent neural baselines across different metrics, and its competitive performance against state-of-the-art traditional heuristics with significantly reduced computational time, ranging from $1.3\times$ to $15\times$ faster. Furthermore, we demonstrate that our method can also be applied to the Capacitated Vehicle Routing Problem (CVRP).

Title: Representation in large language models

Authors: Cameron C. Yetman
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00885
Pdf URL: https://arxiv.org/pdf/2501.00885
Copy Paste: [[2501.00885]] Representation in large language models(https://arxiv.org/abs/2501.00885)
Keywords: large language model
Abstract: The extraordinary success of recent Large Language Models (LLMs) on a diverse array of tasks has led to an explosion of scientific and philosophical theorizing aimed at explaining how they do what they do. Unfortunately, disagreement over fundamental theoretical issues has led to stalemate, with entrenched camps of LLM optimists and pessimists often committed to very different views of how these systems work. Overcoming stalemate requires agreement on fundamental questions, and the goal of this paper is to address one such question, namely: is LLM behavior driven partly by representation-based information processing of the sort implicated in biological cognition, or is it driven entirely by processes of memorization and stochastic table look-up? This is a question about what kind of algorithm LLMs implement, and the answer carries serious implications for higher level questions about whether these systems have beliefs, intentions, concepts, knowledge, and understanding. I argue that LLM behavior is partially driven by representation-based information processing, and then I describe and defend a series of practical techniques for investigating these representations and developing explanations on their basis. The resulting account provides a groundwork for future theorizing about language models and their successors.

Title: Unfolding the Headline: Iterative Self-Questioning for News Retrieval and Timeline Summarization

Authors: Weiqi Wu, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.00888
Pdf URL: https://arxiv.org/pdf/2501.00888
Copy Paste: [[2501.00888]] Unfolding the Headline: Iterative Self-Questioning for News Retrieval and Timeline Summarization(https://arxiv.org/abs/2501.00888)
Keywords: large language model
Abstract: In the fast-changing realm of information, the capacity to construct coherent timelines from extensive event-related content has become increasingly significant and challenging. The complexity arises in aggregating related documents to build a meaningful event graph around a central topic. This paper proposes CHRONOS - Causal Headline Retrieval for Open-domain News Timeline SummarizatiOn via Iterative Self-Questioning, which offers a fresh perspective on the integration of Large Language Models (LLMs) to tackle the task of Timeline Summarization (TLS). By iteratively reflecting on how events are linked and posing new questions regarding a specific news topic to gather information online or from an offline knowledge base, LLMs produce and refresh chronological summaries based on documents retrieved in each round. Furthermore, we curate Open-TLS, a novel dataset of timelines on recent news topics authored by professional journalists to evaluate open-domain TLS where information overload makes it impossible to find comprehensive relevant documents from the web. Our experiments indicate that CHRONOS is not only adept at open-domain timeline summarization, but it also rivals the performance of existing state-of-the-art systems designed for closed-domain applications, where a related news corpus is provided for summarization.

Title: Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model

Authors: Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, Zhenwei Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00895
Pdf URL: https://arxiv.org/pdf/2501.00895
Copy Paste: [[2501.00895]] Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model(https://arxiv.org/abs/2501.00895)
Keywords: robust, diffusion, generative
Abstract: Generative foundation models have advanced large-scale text-driven natural image generation, becoming a prominent research trend across various vertical domains. However, in the remote sensing field, there is still a lack of research on large-scale text-to-image (text2image) generation technology. Existing remote sensing image-text datasets are small in scale and confined to specific geographic areas and scene types. Besides, existing text2image methods have struggled to achieve global-scale, multi-resolution controllable, and unbounded image generation. To address these challenges, this paper presents two key contributions: the Git-10M dataset and the Text2Earth foundation model. Git-10M is a global-scale image-text dataset comprising 10 million image-text pairs, 5 times larger than the previous largest one. The dataset covers a wide range of geographic scenes and contains resolution information, significantly surpassing existing datasets in both size and diversity. Building on Git-10M, we propose Text2Earth, a 1.3 billion parameter generative foundation model based on the diffusion framework to model global-scale remote sensing scenes. Text2Earth integrates a resolution guidance mechanism, enabling users to specify image resolutions. A dynamic condition adaptation strategy is proposed for training and inference to improve image quality. Text2Earth excels in zero-shot text2image generation and demonstrates robust generalization and flexibility across multiple tasks, including unbounded scene construction, image editing, and cross-modal image generation. This robust capability surpasses previous models restricted to the basic fixed size and limited scene types. On the previous benchmark dataset, Text2Earth outperforms previous models with an improvement of +26.23 FID and +20.95% Zero-shot Cls-OA this http URL project page is \url{this https URL}

Title: Population Aware Diffusion for Time Series Generation

Authors: Yang Li, Han Meng, Zhenyu Bi, Ingolv T. Urnes, Haipeng Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00910
Pdf URL: https://arxiv.org/pdf/2501.00910
Copy Paste: [[2501.00910]] Population Aware Diffusion for Time Series Generation(https://arxiv.org/abs/2501.00910)
Keywords: diffusion
Abstract: Diffusion models have shown promising ability in generating high-quality time series (TS) data. Despite the initial success, existing works mostly focus on the authenticity of data at the individual level, but pay less attention to preserving the population-level properties on the entire dataset. Such population-level properties include value distributions for each dimension and distributions of certain functional dependencies (e.g., cross-correlation, CC) between different dimensions. For instance, when generating house energy consumption TS data, the value distributions of the outside temperature and the kitchen temperature should be preserved, as well as the distribution of CC between them. Preserving such TS population-level properties is critical in maintaining the statistical insights of the datasets, mitigating model bias, and augmenting downstream tasks like TS prediction. Yet, it is often overlooked by existing models. Hence, data generated by existing models often bear distribution shifts from the original data. We propose Population-aware Diffusion for Time Series (PaD-TS), a new TS generation model that better preserves the population-level properties. The key novelties of PaD-TS include 1) a new training method explicitly incorporating TS population-level property preservation, and 2) a new dual-channel encoder model architecture that better captures the TS data structure. Empirical results in major benchmark datasets show that PaD-TS can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining a performance comparable to state-of-the-art models on individual-level authenticity.

Title: Aligning LLMs with Domain Invariant Reward Models

Authors: David Wu, Sanjiban Choudhury
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00911
Pdf URL: https://arxiv.org/pdf/2501.00911
Copy Paste: [[2501.00911]] Aligning LLMs with Domain Invariant Reward Models(https://arxiv.org/abs/2501.00911)
Keywords: large language model
Abstract: Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey \emph{domain-agnostic} concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: $0.621 \rightarrow 0.661$), (2) Clean-to-noisy (accuracy: $0.671 \rightarrow 0.703$), (3) Few-shot-to-full transfer (accuracy: $0.845 \rightarrow 0.920$), and (4) Simple-to-complex tasks transfer (correlation: $0.508 \rightarrow 0.556$). Our code, models and data are available at \url{this https URL}.

Title: Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models

Authors: Emily Johnson, Noah Wilson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00917
Pdf URL: https://arxiv.org/pdf/2501.00917
Copy Paste: [[2501.00917]] Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models(https://arxiv.org/abs/2501.00917)
Keywords: diffusion, generative
Abstract: Text-to-image generation has witnessed significant advancements with the integration of Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual descriptions with high-quality, visually coherent images. This paper introduces the Vision-Language Aligned Diffusion (VLAD) model, a generative framework that addresses these challenges through a dual-stream strategy combining semantic alignment and hierarchical diffusion. VLAD utilizes a Contextual Composition Module (CCM) to decompose textual prompts into global and local representations, ensuring precise alignment with visual features. Furthermore, it incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images. Experiments conducted on MARIO-Eval and INNOVATOR-Eval benchmarks demonstrate that VLAD significantly outperforms state-of-the-art methods in terms of image quality, semantic alignment, and text rendering accuracy. Human evaluations further validate the superior performance of VLAD, making it a promising approach for text-to-image generation in complex scenarios.

Title: On the Low-Complexity of Fair Learning for Combinatorial Multi-Armed Bandit

Authors: Xiaoyi Wu, Bo Ji, Bin Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00924
Pdf URL: https://arxiv.org/pdf/2501.00924
Copy Paste: [[2501.00924]] On the Low-Complexity of Fair Learning for Combinatorial Multi-Armed Bandit(https://arxiv.org/abs/2501.00924)
Keywords: fair
Abstract: Combinatorial Multi-Armed Bandit with fairness constraints is a framework where multiple arms form a super arm and can be pulled in each round under uncertainty to maximize cumulative rewards while ensuring the minimum average reward required by each arm. The existing pessimistic-optimistic algorithm linearly combines virtual queue-lengths (tracking the fairness violations) and Upper Confidence Bound estimates as a weight for each arm and selects a super arm with the maximum total weight. The number of super arms could be exponential to the number of arms in many scenarios. In wireless networks, interference constraints can cause the number of super arms to grow exponentially with the number of arms. Evaluating all the feasible super arms to find the one with the maximum total weight can incur extremely high computational complexity in the pessimistic-optimistic algorithm. To avoid this, we develop a low-complexity fair learning algorithm based on the so-called pick-and-compare approach that involves randomly picking $M$ feasible super arms to evaluate. By setting $M$ to a constant, the number of comparison steps in the pessimistic-optimistic algorithm can be reduced to a constant, thereby significantly reducing the computational complexity. Our theoretical proof shows this low-complexity design incurs only a slight sacrifice in fairness and regret performance. Finally, we validate the theoretical result by extensive simulations.

Title: Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition

Authors: Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2501.00935
Pdf URL: https://arxiv.org/pdf/2501.00935
Copy Paste: [[2501.00935]] Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition(https://arxiv.org/abs/2501.00935)
Keywords: transformer
Abstract: Dynamic gesture recognition is one of the challenging research areas due to variations in pose, size, and shape of the signer's hand. In this letter, Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) for dynamic hand gesture recognition is proposed. A pyramidal hierarchy of multiscale features is extracted using the transformer multiscaled head attention model. The proposed model employs different attention dimensions for each head of the transformer which enables it to provide attention at the multiscale level. Further, in addition to single modality, recognition performance using multiple modalities is examined. Extensive experiments demonstrate the superior performance of the proposed MsMHA-VTN with an overall accuracy of 88.22\% and 99.10\% on NVGesture and Briareo datasets, respectively.

Title: SPADE: Enhancing Adaptive Cyber Deception Strategies with Generative AI and Structured Prompt Engineering

Authors: Shihab Ahmed, A B M Mohaimenur Rahman, Md Morshed Alam, Md Sajidul Islam Sajid
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.00940
Pdf URL: https://arxiv.org/pdf/2501.00940
Copy Paste: [[2501.00940]] SPADE: Enhancing Adaptive Cyber Deception Strategies with Generative AI and Structured Prompt Engineering(https://arxiv.org/abs/2501.00940)
Keywords: security, defense, generative, large language model
Abstract: The rapid evolution of modern malware presents significant challenges to the development of effective defense mechanisms. Traditional cyber deception techniques often rely on static or manually configured parameters, limiting their adaptability to dynamic and sophisticated threats. This study leverages Generative AI (GenAI) models to automate the creation of adaptive cyber deception ploys, focusing on structured prompt engineering (PE) to enhance relevance, actionability, and deployability. We introduce a systematic framework (SPADE) to address inherent challenges large language models (LLMs) pose to adaptive deceptions, including generalized outputs, ambiguity, under-utilization of contextual information, and scalability constraints. Evaluations across diverse malware scenarios using metrics such as Recall, Exact Match (EM), BLEU Score, and expert quality assessments identified ChatGPT-4o as the top performer. Additionally, it achieved high engagement (93%) and accuracy (96%) with minimal refinements. Gemini and ChatGPT-4o Mini demonstrated competitive performance, with Llama3.2 showing promise despite requiring further optimization. These findings highlight the transformative potential of GenAI in automating scalable, adaptive deception strategies and underscore the critical role of structured PE in advancing real-world cybersecurity applications.

Title: A Novel Diffusion Model for Pairwise Geoscience Data Generation with Unbalanced Training Dataset

Authors: Junhuan Yang, Yuzhou Zhang, Yi Sheng, Youzuo Lin, Lei Yang
Subjects: cs.LG, cs.CV, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2501.00941
Pdf URL: https://arxiv.org/pdf/2501.00941
Copy Paste: [[2501.00941]] A Novel Diffusion Model for Pairwise Geoscience Data Generation with Unbalanced Training Dataset(https://arxiv.org/abs/2501.00941)
Keywords: diffusion, generative
Abstract: Recently, the advent of generative AI technologies has made transformational impacts on our daily lives, yet its application in scientific applications remains in its early stages. Data scarcity is a major, well-known barrier in data-driven scientific computing, so physics-guided generative AI holds significant promise. In scientific computing, most tasks study the conversion of multiple data modalities to describe physical phenomena, for example, spatial and waveform in seismic imaging, time and frequency in signal processing, and temporal and spectral in climate modeling; as such, multi-modal pairwise data generation is highly required instead of single-modal data generation, which is usually used in natural images (e.g., faces, scenery). Moreover, in real-world applications, the unbalance of available data in terms of modalities commonly exists; for example, the spatial data (i.e., velocity maps) in seismic imaging can be easily simulated, but real-world seismic waveform is largely lacking. While the most recent efforts enable the powerful diffusion model to generate multi-modal data, how to leverage the unbalanced available data is still unclear. In this work, we use seismic imaging in subsurface geophysics as a vehicle to present ``UB-Diff'', a novel diffusion model for multi-modal paired scientific data generation. One major innovation is a one-in-two-out encoder-decoder network structure, which can ensure pairwise data is obtained from a co-latent representation. Then, the co-latent representation will be used by the diffusion process for pairwise data generation. Experimental results on the OpenFWI dataset show that UB-Diff significantly outperforms existing techniques in terms of Fréchet Inception Distance (FID) score and pairwise evaluation, indicating the generation of reliable and useful multi-modal pairwise data.

Title: Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers

Authors: Lukas Kuhn, Sari Sadiya, Jorg Schlotterer, Christin Seifert, Gemma Roig
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2501.00942
Pdf URL: https://arxiv.org/pdf/2501.00942
Copy Paste: [[2501.00942]] Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers(https://arxiv.org/abs/2501.00942)
Keywords: transformer
Abstract: Shortcut learning, i.e., a model's reliance on undesired features not directly relevant to the task, is a major challenge that severely limits the applications of machine learning algorithms, particularly when deploying them to assist in making sensitive decisions, such as in medical diagnostics. In this work, we leverage recent advancements in machine learning to create an unsupervised framework that is capable of both detecting and mitigating shortcut learning in transformers. We validate our method on multiple datasets. Results demonstrate that our framework significantly improves both worst-group accuracy (samples misclassified due to shortcuts) and average accuracy, while minimizing human annotation effort. Moreover, we demonstrate that the detected shortcuts are meaningful and informative to human experts, and that our framework is computationally efficient, allowing it to be run on consumer hardware.

Title: Diffusion Prism: Enhancing Diversity and Morphology Consistency in Mask-to-Image Diffusion

Authors: Hao Wang, Xiwen Chen, Ashish Bastola, Jiayou Qin, Abolfazl Razi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.00944
Pdf URL: https://arxiv.org/pdf/2501.00944
Copy Paste: [[2501.00944]] Diffusion Prism: Enhancing Diversity and Morphology Consistency in Mask-to-Image Diffusion(https://arxiv.org/abs/2501.00944)
Keywords: diffusion, generative
Abstract: The emergence of generative AI and controllable diffusion has made image-to-image synthesis increasingly practical and efficient. However, when input images exhibit low entropy and sparse, the inherent characteristics of diffusion models often result in limited diversity. This constraint significantly interferes with data augmentation. To address this, we propose Diffusion Prism, a training-free framework that efficiently transforms binary masks into realistic and diverse samples while preserving morphological features. We explored that a small amount of artificial noise will significantly assist the image-denoising process. To prove this novel mask-to-image concept, we use nano-dendritic patterns as an example to demonstrate the merit of our method compared to existing controllable diffusion models. Furthermore, we extend the proposed framework to other biological patterns, highlighting its potential applications across various fields.

Title: Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model

Authors: Omid Saghatchian, Atiyeh Gh. Moghadam, Ahmad Nickabadi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00946
Pdf URL: https://arxiv.org/pdf/2501.00946
Copy Paste: [[2501.00946]] Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model(https://arxiv.org/abs/2501.00946)
Keywords: diffusion
Abstract: Diffusion models have emerged as a promising approach for generating high-quality, high-dimensional images. Nevertheless, these models are hindered by their high computational cost and slow inference, partly due to the quadratic computational complexity of the self-attention mechanisms with respect to input size. Various approaches have been proposed to address this drawback. One such approach focuses on reducing the number of tokens fed into the self-attention, known as token merging (ToMe). In our method, which is called cached adaptive token merging(CA-ToMe), we calculate the similarity between tokens and then merge the r proportion of the most similar tokens. However, due to the repetitive patterns observed in adjacent steps and the variation in the frequency of similarities, we aim to enhance this approach by implementing an adaptive threshold for merging tokens and adding a caching mechanism that stores similar pairs across several adjacent steps. Empirical results demonstrate that our method operates as a training-free acceleration method, achieving a speedup factor of 1.24 in the denoising process while maintaining the same FID scores compared to existing approaches.

Title: Incremental Dialogue Management: Survey, Discussion, and Implications for HRI

Authors: Casey Kennington, Pierre Lison, David Schlangen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00953
Pdf URL: https://arxiv.org/pdf/2501.00953
Copy Paste: [[2501.00953]] Incremental Dialogue Management: Survey, Discussion, and Implications for HRI(https://arxiv.org/abs/2501.00953)
Keywords: large language model
Abstract: Efforts towards endowing robots with the ability to speak have benefited from recent advancements in NLP, in particular large language models. However, as powerful as current models have become, they still operate on sentence or multi-sentence level input, not on the word-by-word input that humans operate on, affecting the degree of responsiveness that they offer, which is critical in situations where humans interact with robots using speech. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms.

Title: The Silent Majority: Demystifying Memorization Effect in the Presence of Spurious Correlations

Authors: Chenyu You, Haocheng Dai, Yifei Min, Jasjeet S. Sekhon, Sarang Joshi, James S. Duncan
Subjects: cs.LG, cs.AI, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.00961
Pdf URL: https://arxiv.org/pdf/2501.00961
Copy Paste: [[2501.00961]] The Silent Majority: Demystifying Memorization Effect in the Presence of Spurious Correlations(https://arxiv.org/abs/2501.00961)
Keywords: robust
Abstract: Machine learning models often rely on simple spurious features -- patterns in training data that correlate with targets but are not causally related to them, like image backgrounds in foreground classification. This reliance typically leads to imbalanced test performance across minority and majority groups. In this work, we take a closer look at the fundamental cause of such imbalanced performance through the lens of memorization, which refers to the ability to predict accurately on \textit{atypical} examples (minority groups) in the training set but failing in achieving the same accuracy in the testing set. This paper systematically shows the ubiquitous existence of spurious features in a small set of neurons within the network, providing the first-ever evidence that memorization may contribute to imbalanced group performance. Through three experimental sources of converging empirical evidence, we find the property of a small subset of neurons or channels in memorizing minority group information. Inspired by these findings, we articulate the hypothesis: the imbalanced group performance is a byproduct of ``noisy'' spurious memorization confined to a small set of neurons. To further substantiate this hypothesis, we show that eliminating these unnecessary spurious memorization patterns via a novel framework during training can significantly affect the model performance on minority groups. Our experimental results across various architectures and benchmarks offer new insights on how neural networks encode core and spurious knowledge, laying the groundwork for future research in demystifying robustness to spurious correlation.

Title: CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation

Authors: Daniel Silver, Ron Kimmel
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00975
Pdf URL: https://arxiv.org/pdf/2501.00975
Copy Paste: [[2501.00975]] CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation(https://arxiv.org/abs/2501.00975)
Keywords: segmentation
Abstract: In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer's motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.

Title: Are LLMs effective psychological assessors? Leveraging adaptive RAG for interpretable mental health screening through psychometric practice

Authors: Federico Ravenda, Seyed Ali Bahrainian, Andrea Raballo, Antonietta Mira, Noriko Kando
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00982
Pdf URL: https://arxiv.org/pdf/2501.00982
Copy Paste: [[2501.00982]] Are LLMs effective psychological assessors? Leveraging adaptive RAG for interpretable mental health screening through psychometric practice(https://arxiv.org/abs/2501.00982)
Keywords: large language model
Abstract: In psychological practice, standardized questionnaires serve as essential tools for assessing mental constructs (e.g., attitudes, traits, and emotions) through structured questions (aka items). With the increasing prevalence of social media platforms where users share personal experiences and emotions, researchers are exploring computational methods to leverage this data for rapid mental health screening. In this study, we propose a novel adaptive Retrieval-Augmented Generation (RAG) approach that completes psychological questionnaires by analyzing social media posts. Our method retrieves the most relevant user posts for each question in a psychological survey and uses Large Language Models (LLMs) to predict questionnaire scores in a zero-shot setting. Our findings are twofold. First we demonstrate that this approach can effectively predict users' responses to psychological questionnaires, such as the Beck Depression Inventory II (BDI-II), achieving performance comparable to or surpassing state-of-the-art models on Reddit-based benchmark datasets without relying on training data. Second, we show how this methodology can be generalized as a scalable screening tool, as the final assessment is systematically derived by completing standardized questionnaires and tracking how individual item responses contribute to the diagnosis, aligning with established psychometric practices.

Title: Optimizing Noise Schedules of Generative Models in High Dimensionss

Authors: Santiago Aranguri, Giulio Biroli, Marc Mezard, Eric Vanden-Eijnden
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00988
Pdf URL: https://arxiv.org/pdf/2501.00988
Copy Paste: [[2501.00988]] Optimizing Noise Schedules of Generative Models in High Dimensionss(https://arxiv.org/abs/2501.00988)
Keywords: diffusion, generative
Abstract: Recent works have shown that diffusion models can undergo phase transitions, the resolution of which is needed for accurately generating samples. This has motivated the use of different noise schedules, the two most common choices being referred to as variance preserving (VP) and variance exploding (VE). Here we revisit these schedules within the framework of stochastic interpolants. Using the Gaussian Mixture (GM) and Curie-Weiss (CW) data distributions as test case models, we first investigate the effect of the variance of the initial noise distribution and show that VP recovers the low-level feature (the distribution of each mode) but misses the high-level feature (the asymmetry between modes), whereas VE performs oppositely. We also show that this dichotomy, which happens when denoising by a constant amount in each step, can be avoided by using noise schedules specific to VP and VE that allow for the recovery of both high- and low-level features. Finally we show that these schedules yield generative models for the GM and CW model whose probability flow ODE can be discretized using $\Theta_d(1)$ steps in dimension $d$ instead of the $\Theta_d(\sqrt{d})$ steps required by constant denoising.

Title: Is It Still Fair? Investigating Gender Fairness in Cross-Corpus Speech Emotion Recognition

Authors: Shreya G. Upadhyay, Woan-Shiuan Chien, Chi-Chun Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00995
Pdf URL: https://arxiv.org/pdf/2501.00995
Copy Paste: [[2501.00995]] Is It Still Fair? Investigating Gender Fairness in Cross-Corpus Speech Emotion Recognition(https://arxiv.org/abs/2501.00995)
Keywords: fair
Abstract: Speech emotion recognition (SER) is a vital component in various everyday applications. Cross-corpus SER models are increasingly recognized for their ability to generalize performance. However, concerns arise regarding fairness across demographics in diverse corpora. Existing fairness research often focuses solely on corpus-specific fairness, neglecting its generalizability in cross-corpus scenarios. Our study focuses on this underexplored area, examining the gender fairness generalizability in cross-corpus SER scenarios. We emphasize that the performance of cross-corpus SER models and their fairness are two distinct considerations. Moreover, we propose the approach of a combined fairness adaptation mechanism to enhance gender fairness in the SER transfer learning tasks by addressing both source and target genders. Our findings bring one of the first insights into the generalizability of gender fairness in cross-corpus SER systems.

Title: Exploring Information Processing in Large Language Models: Insights from Information Bottleneck Theory

Authors: Zhou Yang, Zhengyu Qi, Zhaochun Ren, Zhikai Jia, Haizhou Sun, Xiaofei Zhu, Xiangwen Liao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00999
Pdf URL: https://arxiv.org/pdf/2501.00999
Copy Paste: [[2501.00999]] Exploring Information Processing in Large Language Models: Insights from Information Bottleneck Theory(https://arxiv.org/abs/2501.00999)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks by understanding input information and predicting corresponding outputs. However, the internal mechanisms by which LLMs comprehend input and make effective predictions remain poorly understood. In this paper, we explore the working mechanism of LLMs in information processing from the perspective of Information Bottleneck Theory. We propose a non-training construction strategy to define a task space and identify the following key findings: (1) LLMs compress input information into specific task spaces (e.g., sentiment space, topic space) to facilitate task understanding; (2) they then extract and utilize relevant information from the task space at critical moments to generate accurate predictions. Based on these insights, we introduce two novel approaches: an Information Compression-based Context Learning (IC-ICL) and a Task-Space-guided Fine-Tuning (TS-FT). IC-ICL enhances reasoning performance and inference efficiency by compressing retrieved example information into the task space. TS-FT employs a space-guided loss to fine-tune LLMs, encouraging the learning of more effective compression and selection mechanisms. Experiments across multiple datasets validate the effectiveness of task space construction. Additionally, IC-ICL not only improves performance but also accelerates inference speed by over 40\%, while TS-FT achieves superior results with a minimal strategy adjustment.

Title: Physics-informed Gaussian Processes for Safe Envelope Expansion

Authors: D. Isaiah Harp, Joshua Ott, Dylan M. Asmar, John Alora, Mykel J. Kochenderfer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01000
Pdf URL: https://arxiv.org/pdf/2501.01000
Copy Paste: [[2501.01000]] Physics-informed Gaussian Processes for Safe Envelope Expansion(https://arxiv.org/abs/2501.01000)
Keywords: robust
Abstract: Flight test analysis often requires predefined test points with arbitrarily tight tolerances, leading to extensive and resource-intensive experimental campaigns. To address this challenge, we propose a novel approach to flight test analysis using Gaussian processes (GPs) with physics-informed mean functions to estimate aerodynamic quantities from arbitrary flight test data, validated using real T-38 aircraft data collected in collaboration with the United States Air Force Test Pilot School. We demonstrate our method by estimating the pitching moment coefficient without requiring predefined or repeated flight test points, significantly reducing the need for extensive experimental campaigns. Our approach incorporates aerodynamic models as priors within the GP framework, enhancing predictive accuracy across diverse flight conditions and providing robust uncertainty quantification. Key contributions include the integration of physics-based priors in a probabilistic model, which allows for precise computation from arbitrary flight test maneuvers, and the demonstration of our method capturing relevant dynamic characteristics such as short-period mode behavior. The proposed framework offers a scalable and generalizable solution for efficient data-driven flight test analysis and is able to accurately predict the short period frequency and damping for the T-38 across several Mach and dynamic pressure profiles.

Title: Multi-Objective Optimization-Based Anonymization of Structured Data for Machine Learning

Authors: Yusi Wei, Hande Y. Benson, Joseph K. Agor, Muge Capan
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2501.01002
Pdf URL: https://arxiv.org/pdf/2501.01002
Copy Paste: [[2501.01002]] Multi-Objective Optimization-Based Anonymization of Structured Data for Machine Learning(https://arxiv.org/abs/2501.01002)
Keywords: privacy, protect, attack
Abstract: Data is essential for secondary use, but ensuring its privacy while allowing such use is a critical challenge. Various techniques have been proposed to address privacy concerns in data sharing and publishing. However, these methods often degrade data utility, impacting the performance of machine learning (ML) models. Our research identifies key limitations in existing optimization models for privacy preservation, particularly in handling categorical variables, assessing data utility, and evaluating effectiveness across diverse datasets. We propose a novel multi-objective optimization model that simultaneously minimizes information loss and maximizes protection against attacks. This model is empirically validated using diverse datasets and compared with two existing algorithms. We assess information loss, the number of individuals subject to linkage or homogeneity attacks, and ML performance after anonymization. The results indicate that our model achieves lower information loss and more effectively mitigates the risk of attacks, reducing the number of individuals susceptible to these attacks compared to alternative algorithms in some cases. Additionally, our model maintains comparative ML performance relative to the original data or data anonymized by other methods. Our findings highlight significant improvements in privacy protection and ML model performance, offering a comprehensive framework for balancing privacy and utility in data sharing.

Title: EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy

Authors: Ao Gao, Luosong Guo, Tao Chen, Zhao Wang, Ying Tai, Jian Yang, Zhenyu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01003
Pdf URL: https://arxiv.org/pdf/2501.01003
Copy Paste: [[2501.01003]] EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy(https://arxiv.org/abs/2501.01003)
Keywords: robust
Abstract: 3D Gaussian Splatting (3DGS) techniques have achieved satisfactory 3D scene representation. Despite their impressive performance, they confront challenges due to the limitation of structure-from-motion (SfM) methods on acquiring accurate scene initialization, or the inefficiency of densification strategy. In this paper, we introduce a novel framework EasySplat to achieve high-quality 3DGS modeling. Instead of using SfM for scene initialization, we employ a novel method to release the power of large-scale pointmap approaches. Specifically, we propose an efficient grouping strategy based on view similarity, and use robust pointmap priors to obtain high-quality point clouds and camera poses for 3D scene initialization. After obtaining a reliable scene structure, we propose a novel densification approach that adaptively splits Gaussian primitives based on the average shape of neighboring Gaussian ellipsoids, utilizing KNN scheme. In this way, the proposed method tackles the limitation on initialization and optimization, leading to an efficient and accurate 3DGS modeling. Extensive experiments demonstrate that EasySplat outperforms the current state-of-the-art (SOTA) in handling novel view synthesis.

Title: Prediction of Geoeffective CMEs Using SOHO Images and Deep Learning

Authors: Khalid A. Alobaid, Jason T. L. Wang, Haimin Wang, Ju Jing, Yasser Abduallah, Zhenduo Wang, Hameedullah Farooki, Huseyin Cavus, Vasyl Yurchyshyn
Subjects: cs.LG, astro-ph.SR, physics.space-ph
Abstract URL: https://arxiv.org/abs/2501.01011
Pdf URL: https://arxiv.org/pdf/2501.01011
Copy Paste: [[2501.01011]] Prediction of Geoeffective CMEs Using SOHO Images and Deep Learning(https://arxiv.org/abs/2501.01011)
Keywords: protect
Abstract: The application of machine learning to the study of coronal mass ejections (CMEs) and their impacts on Earth has seen significant growth recently. Understanding and forecasting CME geoeffectiveness is crucial for protecting infrastructure in space and ensuring the resilience of technological systems on Earth. Here we present GeoCME, a deep-learning framework designed to predict, deterministically or probabilistically, whether a CME event that arrives at Earth will cause a geomagnetic storm. A geomagnetic storm is defined as a disturbance of the Earth's magnetosphere during which the minimum Dst index value is less than -50 nT. GeoCME is trained on observations from the instruments including LASCO C2, EIT and MDI on board the Solar and Heliospheric Observatory (SOHO), focusing on a dataset that includes 136 halo/partial halo CMEs in Solar Cycle 23. Using ensemble and transfer learning techniques, GeoCME is capable of extracting features hidden in the SOHO observations and making predictions based on the learned features. Our experimental results demonstrate the good performance of GeoCME, achieving a Matthew's correlation coefficient of 0.807 and a true skill statistics score of 0.714 when the tool is used as a deterministic prediction model. When the tool is used as a probabilistic forecasting model, it achieves a Brier score of 0.094 and a Brier skill score of 0.493. These results are promising, showing that the proposed GeoCME can help enhance our understanding of CME-triggered solar-terrestrial interactions.

Title: MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based on Large language Model

Authors: Chengze Zhang, Changshan Li, Shiyang Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01014
Pdf URL: https://arxiv.org/pdf/2501.01014
Copy Paste: [[2501.01014]] MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based on Large language Model(https://arxiv.org/abs/2501.01014)
Keywords: extraction, large language model
Abstract: The exponential growth of data and advancements in big data technologies have created a demand for more efficient and automated approaches to data analysis and storytelling. However, automated data analysis systems still face challenges in leveraging large language models (LLMs) for data insight discovery, augmented analysis, and data storytelling. This paper introduces the Multidimensional Data Storytelling Framework (MDSF) based on large language models for automated insight generation and context-aware storytelling. The framework incorporates advanced preprocessing techniques, augmented analysis algorithms, and a unique scoring mechanism to identify and prioritize actionable insights. The use of fine-tuned LLMs enhances contextual understanding and generates narratives with minimal manual intervention. The architecture also includes an agent-based mechanism for real-time storytelling continuation control. Key findings reveal that MDSF outperforms existing methods across various datasets in terms of insight ranking accuracy, descriptive quality, and narrative coherence. The experimental evaluation demonstrates MDSF's ability to automate complex analytical tasks, reduce interpretive biases, and improve user satisfaction. User studies further underscore its practical utility in enhancing content structure, conclusion extraction, and richness of detail.

Title: Boosting Adversarial Transferability with Spatial Adversarial Alignment

Authors: Zhaoyu Chen, Haijing Guo, Kaixun Jiang, Jiyuan Fu, Xinyu Zhou, Dingkang Yang, Hao Tang, Bo Li, Wenqiang Zhang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2501.01015
Pdf URL: https://arxiv.org/pdf/2501.01015
Copy Paste: [[2501.01015]] Boosting Adversarial Transferability with Spatial Adversarial Alignment(https://arxiv.org/abs/2501.01015)
Keywords: attack
Abstract: Deep neural networks are vulnerable to adversarial examples that exhibit transferability across various models. Numerous approaches are proposed to enhance the transferability of adversarial examples, including advanced optimization, data augmentation, and model modifications. However, these methods still show limited transferability, particularly in cross-architecture scenarios, such as from CNN to ViT. To achieve high transferability, we propose a technique termed Spatial Adversarial Alignment (SAA), which employs an alignment loss and leverages a witness model to fine-tune the surrogate model. Specifically, SAA consists of two key parts: spatial-aware alignment and adversarial-aware alignment. First, we minimize the divergences of features between the two models in both global and local regions, facilitating spatial alignment. Second, we introduce a self-adversarial strategy that leverages adversarial examples to impose further constraints, aligning features from an adversarial perspective. Through this alignment, the surrogate model is trained to concentrate on the common features extracted by the witness model. This facilitates adversarial attacks on these shared features, thereby yielding perturbations that exhibit enhanced transferability. Extensive experiments on various architectures on ImageNet show that aligned surrogate models based on SAA can provide higher transferable adversarial examples, especially in cross-architecture attacks.

Title: Efficient Connectivity-Preserving Instance Segmentation with Supervoxel-Based Loss Function

Authors: Anna Grim, Jayaram Chandrashekar, Uygar Sumbul
Subjects: cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2501.01022
Pdf URL: https://arxiv.org/pdf/2501.01022
Copy Paste: [[2501.01022]] Efficient Connectivity-Preserving Instance Segmentation with Supervoxel-Based Loss Function(https://arxiv.org/abs/2501.01022)
Keywords: segmentation
Abstract: Reconstructing the intricate local morphology of neurons and their long-range projecting axons can address many connectivity related questions in neuroscience. The main bottleneck in connectomics pipelines is correcting topological errors, as multiple entangled neuronal arbors is a challenging instance segmentation problem. More broadly, segmentation of curvilinear, filamentous structures continues to pose significant challenges. To address this problem, we extend the notion of simple points from digital topology to connected sets of voxels (i.e. supervoxels) and propose a topology-aware neural network segmentation method with minimal computational overhead. We demonstrate its effectiveness on a new public dataset of 3-d light microscopy images of mouse brains, along with the benchmark datasets DRIVE, ISBI12, and CrackTree.

Title: Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer

Authors: Ziyang Chen, Yongjun Zhang, Wenting Li, Bingshu Wang, Yabo Wu, Yong Zhao, C.L. Philip Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01023
Pdf URL: https://arxiv.org/pdf/2501.01023
Copy Paste: [[2501.01023]] Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer(https://arxiv.org/abs/2501.01023)
Keywords: transformer
Abstract: In light of the advancements in transformer technology, extant research posits the construction of stereo transformers as a potential solution to the binocular stereo matching challenge. However, constrained by the low-rank bottleneck and quadratic complexity of attention mechanisms, stereo transformers still fail to demonstrate sufficient nonlinear expressiveness within a reasonable inference time. The lack of focus on key homonymous points renders the representations of such methods vulnerable to challenging conditions, including reflections and weak textures. Furthermore, a slow computing speed is not conducive to the application. To overcome these difficulties, we present the \textbf{H}adamard \textbf{A}ttention \textbf{R}ecurrent Stereo \textbf{T}ransformer (HART) that incorporates the following components: 1) For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. 2) We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. This allows HART to focus on important details. DAK also converts zero elements to non-zero elements to mitigate the reduced expressiveness caused by the low-rank bottleneck. 3) To compensate for the spatial and channel interaction missing in the Hadamard product, we propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked \textbf{1st} on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at \url{this https URL}.

Title: Towards Adversarially Robust Deep Metric Learning

Authors: Xiaopeng Ke
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01025
Pdf URL: https://arxiv.org/pdf/2501.01025
Copy Paste: [[2501.01025]] Towards Adversarially Robust Deep Metric Learning(https://arxiv.org/abs/2501.01025)
Keywords: defense, attack, robust
Abstract: Deep Metric Learning (DML) has shown remarkable successes in many domains by taking advantage of powerful deep neural networks. Deep neural networks are prone to adversarial attacks and could be easily fooled by adversarial examples. The current progress on this robustness issue is mainly about deep classification models but pays little attention to DML models. Existing works fail to thoroughly inspect the robustness of DML and neglect an important DML scenario, the clustering-based inference. In this work, we first point out the robustness issue of DML models in clustering-based inference scenarios. We find that, for the clustering-based inference, existing defenses designed DML are unable to be reused and the adaptions of defenses designed for deep classification models cannot achieve satisfactory robustness performance. To alleviate the hazard of adversarial examples, we propose a new defense, the Ensemble Adversarial Training (EAT), which exploits ensemble learning and adversarial training. EAT promotes the diversity of the ensemble, encouraging each model in the ensemble to have different robustness features, and employs a self-transferring mechanism to make full use of the robustness statistics of the whole ensemble in the update of every single model. We evaluate the EAT method on three widely-used datasets with two popular model architectures. The results show that the proposed EAT method greatly outperforms the adaptions of defenses designed for deep classification models.

Title: KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Authors: Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01028
Pdf URL: https://arxiv.org/pdf/2501.01028
Copy Paste: [[2501.01028]] KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model(https://arxiv.org/abs/2501.01028)
Keywords: large language model
Abstract: As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with <1B parameters.

Title: State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects

Authors: Harshika Goyal, Mohammad Saif Wajid, Mohd Anas Wajid, Akib Mohi Ud Din Khanday, Mehdi Neshat, Amir Gandomi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01029
Pdf URL: https://arxiv.org/pdf/2501.01029
Copy Paste: [[2501.01029]] State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects(https://arxiv.org/abs/2501.01029)
Keywords: security, transformer, generative
Abstract: The rapid advancement of deepfake technologies, specifically designed to create incredibly lifelike facial imagery and video content, has ignited a remarkable level of interest and curiosity across many fields, including forensic analysis, cybersecurity and the innovative creation of digital characters. By harnessing the latest breakthroughs in deep learning methods, such as Generative Adversarial Networks, Variational Autoencoders, Few-Shot Learning Strategies, and Transformers, the outcomes achieved in generating deepfakes have been nothing short of astounding and transformative. Also, the ongoing evolution of detection technologies is being developed to counteract the potential for misuse associated with deepfakes, effectively addressing critical concerns that range from political manipulation to the dissemination of fake news and the ever-growing issue of cyberbullying. This comprehensive review paper meticulously investigates the most recent developments in deepfake generation and detection, including around 400 publications, providing an in-depth analysis of the cutting-edge innovations shaping this rapidly evolving landscape. Starting with a thorough examination of systematic literature review methodologies, we embark on a journey that delves into the complex technical intricacies inherent in the various techniques used for deepfake generation, comprehensively addressing the challenges faced, potential solutions available, and the nuanced details surrounding manipulation formulations. Subsequently, the paper is dedicated to accurately benchmarking leading approaches against prominent datasets, offering thorough assessments of the contributions that have significantly impacted these vital domains. Ultimately, we engage in a thoughtful discussion of the existing challenges, paving the way for continuous advancements in this critical and ever-dynamic study area.

Title: ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning

Authors: Wonduk Seo, Zonghao Yuan, Yi Bu
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2501.01031
Pdf URL: https://arxiv.org/pdf/2501.01031
Copy Paste: [[2501.01031]] ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning(https://arxiv.org/abs/2501.01031)
Keywords: fair, large language model
Abstract: Cultural values alignment in Large Language Models (LLMs) is a critical challenge due to their tendency to embed Western-centric biases from training data, leading to misrepresentations and fairness issues in cross-cultural contexts. Recent approaches, such as role-assignment and few-shot learning, often struggle with reliable cultural alignment as they heavily rely on pre-trained knowledge, lack scalability, and fail to capture nuanced cultural values effectively. To address these issues, we propose ValuesRAG, a novel and effective framework that applies Retrieval-Augmented Generation (RAG) with in-context learning to integrate cultural and demographic knowledge dynamically during text generation. Leveraging the World Values Survey (WVS) dataset, ValuesRAG first generates summaries of values for each individual. Subsequently, we curated several representative regional datasets to serve as test datasets and retrieve relevant summaries of values based on demographic features, followed by a reranking step to select the top-k relevant summaries. ValuesRAG consistently outperforms baseline methods, both in the main experiment and in the ablation study where only the values summary was provided, highlighting ValuesRAG's potential to foster culturally aligned AI systems and enhance the inclusivity of AI-driven applications.

Title: DynamicLip: Shape-Independent Continuous Authentication via Lip Articulator Dynamics

Authors: Huashan Chen, Yifan Xu, Yue Feng, Ming Jian, Feng Liu, Pengfei Hu, Kebin Peng, Sen He, Zi Wang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2501.01032
Pdf URL: https://arxiv.org/pdf/2501.01032
Copy Paste: [[2501.01032]] DynamicLip: Shape-Independent Continuous Authentication via Lip Articulator Dynamics(https://arxiv.org/abs/2501.01032)
Keywords: security, privacy, attack, robust, biometric
Abstract: Biometrics authentication has become increasingly popular due to its security and convenience; however, traditional biometrics are becoming less desirable in scenarios such as new mobile devices, Virtual Reality, and Smart Vehicles. For example, while face authentication is widely used, it suffers from significant privacy concerns. The collection of complete facial data makes it less desirable for privacy-sensitive applications. Lip authentication, on the other hand, has emerged as a promising biometrics method. However, existing lip-based authentication methods heavily depend on static lip shape when the mouth is closed, which can be less robust due to lip shape dynamic motion and can barely work when the user is speaking. In this paper, we revisit the nature of lip biometrics and extract shape-independent features from the lips. We study the dynamic characteristics of lip biometrics based on articulator motion. Building on the knowledge, we propose a system for shape-independent continuous authentication via lip articulator dynamics. This system enables robust, shape-independent and continuous authentication, making it particularly suitable for scenarios with high security and privacy requirements. We conducted comprehensive experiments in different environments and attack scenarios and collected a dataset of 50 subjects. The results indicate that our system achieves an overall accuracy of 99.06% and demonstrates robustness under advanced mimic attacks and AI deepfake attacks, making it a viable solution for continuous biometric authentication in various applications.

Title: Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Authors: Bin Wang, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, AiTi Aw
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.01034
Pdf URL: https://arxiv.org/pdf/2501.01034
Copy Paste: [[2501.01034]] Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models(https://arxiv.org/abs/2501.01034)
Keywords: large language model
Abstract: Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.

Title: MSWA: Refining Local Attention with Multi-ScaleWindow Attention

Authors: Yixing Xu, Shivank Nag, Dong Li, Lu Tian, Emad Barsoum
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01039
Pdf URL: https://arxiv.org/pdf/2501.01039
Copy Paste: [[2501.01039]] MSWA: Refining Local Attention with Multi-ScaleWindow Attention(https://arxiv.org/abs/2501.01039)
Keywords: transformer
Abstract: Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each head in each layer, making it inefficient in capturing context of varying scales. To mitigate this limitation, we propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer. It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances. Experimental results on language modeling and common-sense reasoning tasks substantiate that MSWA outperforms traditional local attention in both effectiveness and efficiency.

Title: Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras

Authors: Jingkai Sun, Qiang Zhang, Jiaxu Wang, Jiahang Cao, Renjing Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01040
Pdf URL: https://arxiv.org/pdf/2501.01040
Copy Paste: [[2501.01040]] Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras(https://arxiv.org/abs/2501.01040)
Keywords: transformer
Abstract: Dynamic vision sensors (DVS) are bio-inspired devices that capture visual information in the form of asynchronous events, which encode changes in pixel intensity with high temporal resolution and low latency. These events provide rich motion cues that can be exploited for various computer vision tasks, such as action recognition. However, most existing DVS-based action recognition methods lose temporal information during data transformation or suffer from noise and outliers caused by sensor imperfections or environmental factors. To address these challenges, we propose a novel framework that preserves and exploits the spatiotemporal structure of event data for action recognition. Our framework consists of two main components: 1) a point-wise event masked autoencoder (MAE) that learns a compact and discriminative representation of event patches by reconstructing them from masked raw event camera points data; 2) an improved event points patch generation algorithm that leverages an event data inlier model and point-wise data augmentation techniques to enhance the quality and diversity of event points patches. To the best of our knowledge, our approach introduces the pre-train method into event camera raw points data for the first time, and we propose a novel event points patch embedding to utilize transformer-based models on event cameras.

Title: Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Authors: Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Bo Han, Yongjie Yin, Feng Zheng
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01042
Pdf URL: https://arxiv.org/pdf/2501.01042
Copy Paste: [[2501.01042]] Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs(https://arxiv.org/abs/2501.01042)
Keywords: attack, large language model
Abstract: Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models--a common and practical real world scenario--remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal model (IMM) as a surrogate model to craft adversarial video samples. Multimodal interactions and temporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. In addition, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as surrogate model) achieve competitive performance, with average attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for VideoQA tasks, respectively. Our code will be released upon acceptance.

Title: Dynamic Scaling of Unit Tests for Code Reward Modeling

Authors: Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2501.01054
Pdf URL: https://arxiv.org/pdf/2501.01054
Copy Paste: [[2501.01054]] Dynamic Scaling of Unit Tests for Code Reward Modeling(https://arxiv.org/abs/2501.01054)
Keywords: large language model
Abstract: Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).

Title: Risks of Cultural Erasure in Large Language Models

Authors: Rida Qadri, Aida M. Davani, Kevin Robinson, Vinodkumar Prabhakaran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01056
Pdf URL: https://arxiv.org/pdf/2501.01056
Copy Paste: [[2501.01056]] Risks of Cultural Erasure in Large Language Models(https://arxiv.org/abs/2501.01056)
Keywords: large language model
Abstract: Large language models are increasingly being integrated into applications that shape the production and discovery of societal knowledge such as search, online education, and travel planning. As a result, language models will shape how people learn about, perceive and interact with global cultures making it important to consider whose knowledge systems and perspectives are represented in models. Recognizing this importance, increasingly work in Machine Learning and NLP has focused on evaluating gaps in global cultural representational distribution within outputs. However, more work is needed on developing benchmarks for cross-cultural impacts of language models that stem from a nuanced sociologically-aware conceptualization of cultural impact or harm. We join this line of work arguing for the need of metricizable evaluations of language technologies that interrogate and account for historical power inequities and differential impacts of representation on global cultures, particularly for cultures already under-represented in the digital corpora. We look at two concepts of erasure: omission: where cultures are not represented at all and simplification i.e. when cultural complexity is erased by presenting one-dimensional views of a rich culture. The former focuses on whether something is represented, and the latter on how it is represented. We focus our analysis on two task contexts with the potential to influence global cultural production. First, we probe representations that a language model produces about different places around the world when asked to describe these contexts. Second, we analyze the cultures represented in the travel recommendations produced by a set of language model applications. Our study shows ways in which the NLP community and application developers can begin to operationalize complex socio-cultural considerations into standard evaluations and benchmarks.

Title: Dynamic Attention-Guided Context Decoding for Mitigating Context Faithfulness Hallucinations in Large Language Models

Authors: Yanwen Huang, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang, Jing Xiao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01059
Pdf URL: https://arxiv.org/pdf/2501.01059
Copy Paste: [[2501.01059]] Dynamic Attention-Guided Context Decoding for Mitigating Context Faithfulness Hallucinations in Large Language Models(https://arxiv.org/abs/2501.01059)
Keywords: robust, large language model
Abstract: Large language models (LLMs) often suffer from context faithfulness hallucinations, where outputs deviate from retrieved information due to insufficient context utilization and high output uncertainty. Our uncertainty evaluation experiments reveal a strong correlation between high uncertainty and hallucinations. We hypothesize that attention mechanisms encode signals indicative of contextual utilization, validated through probing analysis. Based on these insights, we propose Dynamic Attention-Guided Context Decoding (DAGCD), a lightweight framework that integrates attention distributions and uncertainty signals in a single-pass decoding process. Experiments across QA datasets demonstrate DAGCD's effectiveness, achieving significant improvements in faithfulness and robustness while maintaining computational efficiency.

Title: FAPL-DM-BC: A Secure and Scalable FL Framework with Adaptive Privacy and Dynamic Masking, Blockchain, and XAI for the IoVs

Authors: Sathwik Narkedimilli, Amballa Venkata Sriram, Sujith Makam, MSVPJ Sathvik, Sai Prashanth Mallellu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.01063
Pdf URL: https://arxiv.org/pdf/2501.01063
Copy Paste: [[2501.01063]] FAPL-DM-BC: A Secure and Scalable FL Framework with Adaptive Privacy and Dynamic Masking, Blockchain, and XAI for the IoVs(https://arxiv.org/abs/2501.01063)
Keywords: secure, security, privacy, federate
Abstract: The FAPL-DM-BC solution is a new FL-based privacy, security, and scalability solution for the Internet of Vehicles (IoV). It leverages Federated Adaptive Privacy-Aware Learning (FAPL) and Dynamic Masking (DM) to learn and adaptively change privacy policies in response to changing data sensitivity and state in real-time, for the optimal privacy-utility tradeoff. Secure Logging and Verification, Blockchain-based provenance and decentralized validation, and Cloud Microservices Secure Aggregation using FedAvg (Federated Averaging) and Secure Multi-Party Computation (SMPC). Two-model feedback, driven by Model-Agnostic Explainable AI (XAI), certifies local predictions and explanations to drive it to the next level of efficiency. Combining local feedback with world knowledge through a weighted mean computation, FAPL-DM-BC assures federated learning that is secure, scalable, and interpretable. Self-driving cars, traffic management, and forecasting, vehicular network cybersecurity in real-time, and smart cities are a few possible applications of this integrated, privacy-safe, and high-performance IoV platform.

Title: BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion

Authors: Md Osama, Ashim Dey, Kawsar Ahmed, Muhammad Ashad Kabir
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01069
Pdf URL: https://arxiv.org/pdf/2501.01069
Copy Paste: [[2501.01069]] BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion(https://arxiv.org/abs/2501.01069)
Keywords: transformer
Abstract: Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at this https URL.

Title: Evidential Calibrated Uncertainty-Guided Interactive Segmentation paradigm for Ultrasound Images

Authors: Jiang Shang, Yuanmeng Wu, Xiaoxiang Han, Xi Chen, Qi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01072
Pdf URL: https://arxiv.org/pdf/2501.01072
Copy Paste: [[2501.01072]] Evidential Calibrated Uncertainty-Guided Interactive Segmentation paradigm for Ultrasound Images(https://arxiv.org/abs/2501.01072)
Keywords: robust, segmentation
Abstract: Accurate and robust ultrasound image segmentation is critical for computer-aided diagnostic systems. Nevertheless, the inherent challenges of ultrasound imaging, such as blurry boundaries and speckle noise, often cause traditional segmentation methods to struggle with performance. Despite recent advancements in universal image segmentation, such as the Segment Anything Model, existing interactive segmentation methods still suffer from inefficiency and lack of specialization. These methods rely heavily on extensive accurate manual or random sampling prompts for interaction, necessitating numerous prompts and iterations to reach satisfactory performance. In response to this challenge, we propose the Evidential Uncertainty-Guided Interactive Segmentation (EUGIS), an end-to-end, efficient tiered interactive segmentation paradigm based on evidential uncertainty estimation for ultrasound image segmentation. Specifically, EUGIS harnesses evidence-based uncertainty estimation, grounded in Dempster-Shafer theory and Subjective Logic, to gauge the level of uncertainty in the predictions of model for different regions. By prioritizing sampling the high-uncertainty region, our method can effectively simulate the interactive behavior of well-trained radiologists, enhancing the targeted of sampling while reducing the number of prompts and iterations this http URL, we propose a trainable calibration mechanism for uncertainty estimation, which can further optimize the boundary between certainty and uncertainty, thereby enhancing the confidence of uncertainty estimation.

Title: Graph Generative Pre-trained Transformer

Authors: Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Li-Ping Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01073
Pdf URL: https://arxiv.org/pdf/2501.01073
Copy Paste: [[2501.01073]] Graph Generative Pre-trained Transformer(https://arxiv.org/abs/2501.01073)
Keywords: transformer, generative
Abstract: Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.

Title: iCNN-LSTM: A batch-based incremental ransomware detection system using Sysmon

Authors: Jamil Ispahany, MD Rafiqul Islam, M. Arif Khan, MD Zahidul Islam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.01083
Pdf URL: https://arxiv.org/pdf/2501.01083
Copy Paste: [[2501.01083]] iCNN-LSTM: A batch-based incremental ransomware detection system using Sysmon(https://arxiv.org/abs/2501.01083)
Keywords: security, robust
Abstract: In response to the increasing ransomware threat, this study presents a novel detection system that integrates Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. By leveraging Sysmon logs, the system enables real-time analysis on Windows-based endpoints. Our approach overcomes the limitations of traditional models by employing batch-based incremental learning, allowing the system to continuously adapt to new ransomware variants without requiring complete retraining. The proposed model achieved an impressive average F2-score of 99.61\%, with low false positive and false negative rates of 0.17\% and 4.69\%, respectively, within a highly imbalanced dataset. This demonstrates exceptional accuracy in identifying malicious behaviour. The dynamic detection capabilities of Sysmon enhance the model's effectiveness by providing a reliable stream of security events, mitigating the vulnerabilities associated with static detection methods. Furthermore, the parallel processing of LSTM modules, combined with attention mechanisms, significantly improves training efficiency and reduces latency, making our system well-suited for real-world applications. These findings underscore the potential of our CNN-LSTM framework as a robust solution for real-time ransomware detection, ensuring adaptability and resilience in the face of evolving cyber threats.

Title: Noise-Resilient Symbolic Regression with Dynamic Gating Reinforcement Learning

Authors: Chenglu Sun, Shuo Shen, Wenzhi Tao, Deyi Xue, Zixia Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01085
Pdf URL: https://arxiv.org/pdf/2501.01085
Copy Paste: [[2501.01085]] Noise-Resilient Symbolic Regression with Dynamic Gating Reinforcement Learning(https://arxiv.org/abs/2501.01085)
Keywords: robust, interpretability
Abstract: Symbolic regression (SR) has emerged as a pivotal technique for uncovering the intrinsic information within data and enhancing the interpretability of AI models. However, current state-of-the-art (sota) SR methods struggle to perform correct recovery of symbolic expressions from high-noise data. To address this issue, we introduce a novel noise-resilient SR (NRSR) method capable of recovering expressions from high-noise data. Our method leverages a novel reinforcement learning (RL) approach in conjunction with a designed noise-resilient gating module (NGM) to learn symbolic selection policies. The gating module can dynamically filter the meaningless information from high-noise data, thereby demonstrating a high noise-resilient capability for the SR process. And we also design a mixed path entropy (MPE) bonus term in the RL process to increase the exploration capabilities of the policy. Experimental results demonstrate that our method significantly outperforms several popular baselines on benchmarks with high-noise data. Furthermore, our method also can achieve sota performance on benchmarks with clean data, showcasing its robustness and efficacy in SR tasks.

Title: Bridging Simplicity and Sophistication using GLinear: A Novel Architecture for Enhanced Time Series Prediction

Authors: Syed Tahir Hussain Rizvi, Neel Kanwal, Muddasar Naeem, Alfredo Cuzzocrea, Antonio Coronato
Subjects: cs.LG, cs.CV, cs.ET
Abstract URL: https://arxiv.org/abs/2501.01087
Pdf URL: https://arxiv.org/pdf/2501.01087
Copy Paste: [[2501.01087]] Bridging Simplicity and Sophistication using GLinear: A Novel Architecture for Enhanced Time Series Prediction(https://arxiv.org/abs/2501.01087)
Keywords: transformer
Abstract: Time Series Forecasting (TSF) is an important application across many fields. There is a debate about whether Transformers, despite being good at understanding long sequences, struggle with preserving temporal relationships in time series data. Recent research suggests that simpler linear models might outperform or at least provide competitive performance compared to complex Transformer-based models for TSF tasks. In this paper, we propose a novel data-efficient architecture, GLinear, for multivariate TSF that exploits periodic patterns to provide better accuracy. It also provides better prediction accuracy by using a smaller amount of historical data compared to other state-of-the-art linear predictors. Four different datasets (ETTh1, Electricity, Traffic, and Weather) are used to evaluate the performance of the proposed predictor. A performance comparison with state-of-the-art linear architectures (such as NLinear, DLinear, and RLinear) and transformer-based time series predictor (Autoformer) shows that the GLinear, despite being parametrically efficient, significantly outperforms the existing architectures in most cases of multivariate TSF. We hope that the proposed GLinear opens new fronts of research and development of simpler and more sophisticated architectures for data and computationally efficient time-series analysis. The source code is publicly available on GitHub.

Title: A Sysmon Incremental Learning System for Ransomware Analysis and Detection

Authors: Jamil Ispahany, MD Rafiqul Islam, M. Arif Khan, MD Zahidul Islam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.01089
Pdf URL: https://arxiv.org/pdf/2501.01089
Copy Paste: [[2501.01089]] A Sysmon Incremental Learning System for Ransomware Analysis and Detection(https://arxiv.org/abs/2501.01089)
Keywords: attack
Abstract: In the face of increasing cyber threats, particularly ransomware attacks, there is a pressing need for advanced detection and analysis systems that adapt to evolving malware behaviours. Throughout the literature, using machine learning (ML) to obviate ransomware attacks has increased in popularity. Unfortunately, most of these proposals leverage non-incremental learning approaches that require the underlying models to be updated from scratch to detect new ransomware, wasting time and resources. This approach is problematic because it leaves sensitive data vulnerable to attack during retraining, as newly emerging ransomware strains may go undetected until the model is updated. Furthermore, most of these approaches are not designed to detect ransomware in real-time data streams, limiting their effectiveness in complex network environments. To address this challenge, we present the Sysmon Incremental Learning System for Ransomware Analysis and Detection (SILRAD), which enables continuous updates to the underlying model and effectively closes the training gap. By leveraging the capabilities of Sysmon for detailed monitoring of system activities, our approach integrates online incremental learning techniques to enhance the adaptability and efficiency of ransomware detection. The most valuable features for detection were selected using the Pearson Correlation Coefficient (PCC), and concept drift detection was implemented through the ADWIN algorithm, ensuring that the model remains responsive to changes in ransomware behaviour. We compared our results to other popular techniques, such as Hoeffding Trees (HT) and Leveraging Bagging Classifier (LB), observing a detection accuracy of 98.89% and a Matthews Correlation Coefficient (MCC) rate of 94.11%, demonstrating the effectiveness of our technique.

Title: HoneypotNet: Backdoor Attacks Against Model Extraction

Authors: Yixu Wang, Tianle Gu, Yan Teng, Yingchun Wang, Xingjun Ma
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2501.01090
Pdf URL: https://arxiv.org/pdf/2501.01090
Copy Paste: [[2501.01090]] HoneypotNet: Backdoor Attacks Against Model Extraction(https://arxiv.org/abs/2501.01090)
Keywords: security, defense, attack, extraction, watermark
Abstract: Model extraction attacks are one type of inference-time attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model's predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model's outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model's output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.

Title: EliGen: Entity-Level Controlled Image Generation with Regional Attention

Authors: Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01097
Pdf URL: https://arxiv.org/pdf/2501.01097
Copy Paste: [[2501.01097]] EliGen: Entity-Level Controlled Image Generation with Regional Attention(https://arxiv.org/abs/2501.01097)
Keywords: robust, diffusion, transformer
Abstract: Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-Level controlled Image Generation. We introduce regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both positional control precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending EliGen to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with community models such as IP-Adapter and MLLM, unlocking new creative possibilities. The source code, dataset, and model will be released publicly.

Title: Long-range Brain Graph Transformer

Authors: Shuo Yu, Shan Jin, Ming Li, Tabinda Sarwar, Feng Xia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01100
Pdf URL: https://arxiv.org/pdf/2501.01100
Copy Paste: [[2501.01100]] Long-range Brain Graph Transformer(https://arxiv.org/abs/2501.01100)
Keywords: transformer
Abstract: Understanding communication and information processing among brain regions of interest (ROIs) is highly dependent on long-range connectivity, which plays a crucial role in facilitating diverse functional neural integration across the entire brain. However, previous studies generally focused on the short-range dependencies within brain networks while neglecting the long-range dependencies, limiting an integrated understanding of brain-wide communication. To address this limitation, we propose Adaptive Long-range aware TransformER (ALTER), a brain graph transformer to capture long-range dependencies between brain ROIs utilizing biased random walk. Specifically, we present a novel long-range aware strategy to explicitly capture long-range dependencies between brain ROIs. By guiding the walker towards the next hop with higher correlation value, our strategy simulates the real-world brain-wide communication. Furthermore, by employing the transformer framework, ALERT adaptively integrates both short- and long-range dependencies between brain ROIs, enabling an integrated understanding of multi-level communication across the entire brain. Extensive experiments on ABIDE and ADNI datasets demonstrate that ALTER consistently outperforms generalized state-of-the-art graph learning methods (including SAN, Graphormer, GraphTrans, and LRGNN) and other graph learning based brain network analysis methods (including FBNETGEN, BrainNetGNN, BrainGNN, and BrainNETTF) in neurological disease diagnosis. Cases of long-range dependencies are also presented to further illustrate the effectiveness of ALTER. The implementation is available at \url{this https URL}.

Title: AIM: Additional Image Guided Generation of Transferable Adversarial Attacks

Authors: Teng Li, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01106
Pdf URL: https://arxiv.org/pdf/2501.01106
Copy Paste: [[2501.01106]] AIM: Additional Image Guided Generation of Transferable Adversarial Attacks(https://arxiv.org/abs/2501.01106)
Keywords: attack, generative
Abstract: Transferable adversarial examples highlight the vulnerability of deep neural networks (DNNs) to imperceptible perturbations across various real-world applications. While there have been notable advancements in untargeted transferable attacks, targeted transferable attacks remain a significant challenge. In this work, we focus on generative approaches for targeted transferable attacks. Current generative attacks focus on reducing overfitting to surrogate models and the source data domain, but they often overlook the importance of enhancing transferability through additional semantics. To address this issue, we introduce a novel plug-and-play module into the general generator architecture to enhance adversarial transferability. Specifically, we propose a \emph{Semantic Injection Module} (SIM) that utilizes the semantics contained in an additional guiding image to improve transferability. The guiding image provides a simple yet effective method to incorporate target semantics from the target class to create targeted and highly transferable attacks. Additionally, we propose new loss formulations that can integrate the semantic injection module more effectively for both targeted and untargeted attacks. We conduct comprehensive experiments under both targeted and untargeted attack settings to demonstrate the efficacy of our proposed approach.

Title: MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification

Authors: Jimin Park, AHyun Ji, Minji Park, Mohammad Saidur Rahman, Se Eun Oh
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01110
Pdf URL: https://arxiv.org/pdf/2501.01110
Copy Paste: [[2501.01110]] MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification(https://arxiv.org/abs/2501.01110)
Keywords: generative
Abstract: Continual Learning (CL) for malware classification tackles the rapidly evolving nature of malware threats and the frequent emergence of new types. Generative Replay (GR)-based CL systems utilize a generative model to produce synthetic versions of past data, which are then combined with new data to retrain the primary model. Traditional machine learning techniques in this domain often struggle with catastrophic forgetting, where a model's performance on old data degrades over time. In this paper, we introduce a GR-based CL system that employs Generative Adversarial Networks (GANs) with feature matching loss to generate high-quality malware samples. Additionally, we implement innovative selection schemes for replay samples based on the model's hidden representations. Our comprehensive evaluation across Windows and Android malware datasets in a class-incremental learning scenario -- where new classes are introduced continuously over multiple tasks -- demonstrates substantial performance improvements over previous methods. For example, our system achieves an average accuracy of 55% on Windows malware samples, significantly outperforming other GR-based models by 28%. This study provides practical insights for advancing GR-based malware classification systems. The implementation is available at \url {this https URL}\footnote{The code will be made public upon the presentation of the paper}.

Title: Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Authors: Xuan Yu, Yuxuan Xie, Yili Liu, Haojian Lu, Rong Xiong, Yiyi Liao, Yue Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2501.01119
Pdf URL: https://arxiv.org/pdf/2501.01119
Copy Paste: [[2501.01119]] Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction(https://arxiv.org/abs/2501.01119)
Keywords: segmentation
Abstract: Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: this https URL

Title: Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning

Authors: Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01120
Pdf URL: https://arxiv.org/pdf/2501.01120
Copy Paste: [[2501.01120]] Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning(https://arxiv.org/abs/2501.01120)
Keywords: robust, transformer
Abstract: Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at this https URL.

Title: Graph2text or Graph2token: A Perspective of Large Language Models for Graph Learning

Authors: Shuo Yu, Yingbo Wang, Ruolin Li, Guchun Liu, Yanming Shen, Shaoxiong Ji, Bowen Li, Fengling Han, Xiuzhen Zhang, Feng Xia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01124
Pdf URL: https://arxiv.org/pdf/2501.01124
Copy Paste: [[2501.01124]] Graph2text or Graph2token: A Perspective of Large Language Models for Graph Learning(https://arxiv.org/abs/2501.01124)
Keywords: large language model
Abstract: Graphs are data structures used to represent irregular networks and are prevalent in numerous real-world applications. Previous methods directly model graph structures and achieve significant success. However, these methods encounter bottlenecks due to the inherent irregularity of graphs. An innovative solution is converting graphs into textual representations, thereby harnessing the powerful capabilities of Large Language Models (LLMs) to process and comprehend graphs. In this paper, we present a comprehensive review of methodologies for applying LLMs to graphs, termed LLM4graph. The core of LLM4graph lies in transforming graphs into texts for LLMs to understand and analyze. Thus, we propose a novel taxonomy of LLM4graph methods in the view of the transformation. Specifically, existing methods can be divided into two paradigms: Graph2text and Graph2token, which transform graphs into texts or tokens as the input of LLMs, respectively. We point out four challenges during the transformation to systematically present existing methods in a problem-oriented perspective. For practical concerns, we provide a guideline for researchers on selecting appropriate models and LLMs for different graphs and hardware constraints. We also identify five future research directions for LLM4graph.

Title: DuMo: Dual Encoder Modulation Network for Precise Concept Erasure

Authors: Feng Han, Kai Chen, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01125
Pdf URL: https://arxiv.org/pdf/2501.01125
Copy Paste: [[2501.01125]] DuMo: Dual Encoder Modulation Network for Precise Concept Erasure(https://arxiv.org/abs/2501.01125)
Keywords: generative
Abstract: The exceptional generative capability of text-to-image models has raised substantial safety concerns regarding the generation of Not-Safe-For-Work (NSFW) content and potential copyright infringement. To address these concerns, previous methods safeguard the models by eliminating inappropriate concepts. Nonetheless, these models alter the parameters of the backbone network and exert considerable influences on the structural (low-frequency) components of the image, which undermines the model's ability to retain non-target concepts. In this work, we propose our Dual encoder Modulation network (DuMo), which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts. In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image. To minimize the damage to non-target concepts during erasure, the parameters of the backbone U-NET are frozen and the prior knowledge from the original skip connection features is introduced to the erasure process. Meanwhile, the phenomenon is observed that distinct erasing preferences for the image structure and details are demonstrated by the EPR at different timesteps and layers. Therefore, we adopt a novel Time-Layer MOdulation process (TLMO) that adjusts the erasure scale of EPR module's outputs across different layers and timesteps, automatically balancing the erasure effects and model's generative ability. Our method achieves state-of-the-art performance on Explicit Content Erasure, Cartoon Concept Removal and Artistic Style Erasure, clearly outperforming alternative methods. Code is available at this https URL

Title: Source-free Semantic Regularization Learning for Semi-supervised Domain Adaptation

Authors: Xinyang Huang, Chuang Zhu, Ruiying Ren, Shengjie Liu, Tiejun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01126
Pdf URL: https://arxiv.org/pdf/2501.01126
Copy Paste: [[2501.01126]] Source-free Semantic Regularization Learning for Semi-supervised Domain Adaptation(https://arxiv.org/abs/2501.01126)
Keywords: robust
Abstract: Semi-supervised domain adaptation (SSDA) has been extensively researched due to its ability to improve classification performance and generalization ability of models by using a small amount of labeled data on the target domain. However, existing methods cannot effectively adapt to the target domain due to difficulty in fully learning rich and complex target semantic information and relationships. In this paper, we propose a novel SSDA learning framework called semantic regularization learning (SERL), which captures the target semantic information from multiple perspectives of regularization learning to achieve adaptive fine-tuning of the source pre-trained model on the target domain. SERL includes three robust semantic regularization techniques. Firstly, semantic probability contrastive regularization (SPCR) helps the model learn more discriminative feature representations from a probabilistic perspective, using semantic information on the target domain to understand the similarities and differences between samples. Additionally, adaptive weights in SPCR can help the model learn the semantic distribution correctly through the probabilities of different samples. To further comprehensively understand the target semantic distribution, we introduce hard-sample mixup regularization (HMR), which uses easy samples as guidance to mine the latent target knowledge contained in hard samples, thereby learning more complete and complex target semantic knowledge. Finally, target prediction regularization (TPR) regularizes the target predictions of the model by maximizing the correlation between the current prediction and the past learned objective, thereby mitigating the misleading of semantic information caused by erroneous pseudo-labels. Extensive experiments on three benchmark datasets demonstrate that our SERL method achieves state-of-the-art performance.

Title: InDeed: Interpretable image deep decomposition with guaranteed generalizability

Authors: Sihan Wang, Shangqi Gao, Fuping Wu, Xiahai Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01127
Pdf URL: https://arxiv.org/pdf/2501.01127
Copy Paste: [[2501.01127]] InDeed: Interpretable image deep decomposition with guaranteed generalizability(https://arxiv.org/abs/2501.01127)
Keywords: interpretability
Abstract: Image decomposition aims to analyze an image into elementary components, which is essential for numerous downstream tasks and also by nature provides certain interpretability to the analysis. Deep learning can be powerful for such tasks, but surprisingly their combination with a focus on interpretability and generalizability is rarely explored. In this work, we introduce a novel framework for interpretable deep image decomposition, combining hierarchical Bayesian modeling and deep learning to create an architecture-modularized and model-generalizable deep neural network (DNN). The proposed framework includes three steps: (1) hierarchical Bayesian modeling of image decomposition, (2) transforming the inference problem into optimization tasks, and (3) deep inference via a modularized Bayesian DNN. We further establish a theoretical connection between the loss function and the generalization error bound, which inspires a new test-time adaptation approach for out-of-distribution scenarios. We instantiated the application using two downstream tasks, \textit{i.e.}, image denoising, and unsupervised anomaly detection, and the results demonstrated improved generalizability as well as interpretability of our methods. The source code will be released upon the acceptance of this paper.

Title: An Inclusive Theoretical Framework of Robust Supervised Contrastive Loss against Label Noise

Authors: Jingyi Cui, Yi-Ge Zhang, Hengyu Liu, Yisen Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01130
Pdf URL: https://arxiv.org/pdf/2501.01130
Copy Paste: [[2501.01130]] An Inclusive Theoretical Framework of Robust Supervised Contrastive Loss against Label Noise(https://arxiv.org/abs/2501.01130)
Keywords: robust
Abstract: Learning from noisy labels is a critical challenge in machine learning, with vast implications for numerous real-world scenarios. While supervised contrastive learning has recently emerged as a powerful tool for navigating label noise, many existing solutions remain heuristic, often devoid of a systematic theoretical foundation for crafting robust supervised contrastive losses. To address the gap, in this paper, we propose a unified theoretical framework for robust losses under the pairwise contrastive paradigm. In particular, we for the first time derive a general robust condition for arbitrary contrastive losses, which serves as a criterion to verify the theoretical robustness of a supervised contrastive loss against label noise. The theory indicates that the popular InfoNCE loss is in fact non-robust, and accordingly inspires us to develop a robust version of InfoNCE, termed Symmetric InfoNCE (SymNCE). Moreover, we highlight that our theory is an inclusive framework that provides explanations to prior robust techniques such as nearest-neighbor (NN) sample selection and robust contrastive loss. Validation experiments on benchmark datasets demonstrate the superiority of SymNCE against label noise.

Title: Privacy Bills of Materials: A Transparent Privacy Information Inventory for Collaborative Privacy Notice Generation in Mobile App Development

Authors: Zhen Tao, Shidong Pan, Zhenchang Xing, Xiaoyu Sun, Omar Haggag, John Grundy, Ze Shi Li, Jingjie Li, Liming Zhu
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2501.01131
Pdf URL: https://arxiv.org/pdf/2501.01131
Copy Paste: [[2501.01131]] Privacy Bills of Materials: A Transparent Privacy Information Inventory for Collaborative Privacy Notice Generation in Mobile App Development(https://arxiv.org/abs/2501.01131)
Keywords: privacy
Abstract: Privacy regulations mandate that developers must provide authentic and comprehensive privacy notices, e.g., privacy policies or labels, to inform users of their apps' privacy practices. However, due to a lack of knowledge of privacy requirements, developers often struggle to create accurate privacy notices, especially for sophisticated mobile apps with complex features and in crowded development teams. To address these challenges, we introduce Privacy Bills of Materials (PriBOM), a systematic software engineering approach that leverages different development team roles to better capture and coordinate mobile app privacy information. PriBOM facilitates transparency-centric privacy documentation and specific privacy notice creation, enabling traceability and trackability of privacy practices. We present a pre-fill of PriBOM based on static analysis and privacy notice analysis techniques. We demonstrate the perceived usefulness of PriBOM through a human evaluation with 150 diverse participants. Our findings suggest that PriBOM could serve as a significant solution for providing privacy support in DevOps for mobile apps.

Title: Missing Data as Augmentation in the Earth Observation Domain: A Multi-View Learning Approach

Authors: Francisco Mena, Diego Arenas, Andreas Dengel
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2501.01132
Pdf URL: https://arxiv.org/pdf/2501.01132
Copy Paste: [[2501.01132]] Missing Data as Augmentation in the Earth Observation Domain: A Multi-View Learning Approach(https://arxiv.org/abs/2501.01132)
Keywords: robust, transformer
Abstract: Multi-view learning (MVL) leverages multiple sources or views of data to enhance machine learning model performance and robustness. This approach has been successfully used in the Earth Observation (EO) domain, where views have a heterogeneous nature and can be affected by missing data. Despite the negative effect that missing data has on model predictions, the ML literature has used it as an augmentation technique to improve model generalization, like masking the input data. Inspired by this, we introduce novel methods for EO applications tailored to MVL with missing views. Our methods integrate the combination of a set to simulate all combinations of missing views as different training samples. Instead of replacing missing data with a numerical value, we use dynamic merge functions, like average, and more complex ones like Transformer. This allows the MVL model to entirely ignore the missing views, enhancing its predictive robustness. We experiment on four EO datasets with temporal and static views, including state-of-the-art methods from the EO domain. The results indicate that our methods improve model robustness under conditions of moderate missingness, and improve the predictive performance when all views are present. The proposed methods offer a single adaptive solution to operate effectively with any combination of available views.

Title: Adaptive Hardness-driven Augmentation and Alignment Strategies for Multi-Source Domain Adaptations

Authors: Yang Yuxiang, Zeng Xinyi, Zeng Pinxian, Zu Chen, Yan Binyu, Zhou Jiliu, Wang Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01142
Pdf URL: https://arxiv.org/pdf/2501.01142
Copy Paste: [[2501.01142]] Adaptive Hardness-driven Augmentation and Alignment Strategies for Multi-Source Domain Adaptations(https://arxiv.org/abs/2501.01142)
Keywords: robust
Abstract: Multi-source Domain Adaptation (MDA) aims to transfer knowledge from multiple labeled source domains to an unlabeled target domain. Nevertheless, traditional methods primarily focus on achieving inter-domain alignment through sample-level constraints, such as Maximum Mean Discrepancy (MMD), neglecting three pivotal aspects: 1) the potential of data augmentation, 2) the significance of intra-domain alignment, and 3) the design of cluster-level constraints. In this paper, we introduce a novel hardness-driven strategy for MDA tasks, named "A3MDA" , which collectively considers these three aspects through Adaptive hardness quantification and utilization in both data Augmentation and domain this http URL achieve this, "A3MDA" progressively proposes three Adaptive Hardness Measurements (AHM), i.e., Basic, Smooth, and Comparative AHMs, each incorporating distinct mechanisms for diverse scenarios. Specifically, Basic AHM aims to gauge the instantaneous hardness for each source/target sample. Then, hardness values measured by Smooth AHM will adaptively adjust the intensity level of strong data augmentation to maintain compatibility with the model's generalization this http URL contrast, Comparative AHM is designed to facilitate cluster-level constraints. By leveraging hardness values as sample-specific weights, the traditional MMD is enhanced into a weighted-clustered variant, strengthening the robustness and precision of inter-domain alignment. As for the often-neglected intra-domain alignment, we adaptively construct a pseudo-contrastive matrix by selecting harder samples based on the hardness rankings, enhancing the quality of pseudo-labels, and shaping a well-clustered target feature space. Experiments on multiple MDA benchmarks show that " A3MDA " outperforms other methods.

Title: BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference

Authors: Wonsuk Jang, Thierry Tambe
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01144
Pdf URL: https://arxiv.org/pdf/2501.01144
Copy Paste: [[2501.01144]] BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference(https://arxiv.org/abs/2501.01144)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success, but their increasing size poses significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with fine-grained block-wise quantization emerging as a promising hardware-supported solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. To address this, we propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. Importantly, DialectFP4 ensures hardware efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. Furthermore, we propose a two-stage approach for online DialectFP4 activation quantization. BlockDialect achieves 11.40% (6.90%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with a comparable bit usage per data, while being only 5.89% (3.31%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

Title: PoVF: Empowering Decentralized Blockchain Systems with Verifiable Function Consensus

Authors: Chenxi Xiong, Ting Yang, Yu Wang, Bing Dong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.01146
Pdf URL: https://arxiv.org/pdf/2501.01146
Copy Paste: [[2501.01146]] PoVF: Empowering Decentralized Blockchain Systems with Verifiable Function Consensus(https://arxiv.org/abs/2501.01146)
Keywords: secure, security, fair
Abstract: Consensus mechanism is the core technology for blockchain to ensure that transactions are executed in sequence. It also determines the decentralization, security, and efficiency of blockchain. Existing mechanisms all have certain centralization issues and fail to ensure the decentralization of blockchain networks. A decentralized and efficient mechanism is required to improve blockchain systems. This paper proposes a fair consensus mechanism called Proof of Verifiable Functions (PoVF), based on the verifiability and unpredictability of verifiable functions. PoVF provides a sufficiently fair mechanism, ensuring that all nodes in blockchain network have equal opportunity to participate in consensus. In addition, a structure called "Delay buffer" is proposed to ensure transactions are executed sequentially. It delay the selection of blocks to avoid blockchain forks caused by broadcasting and transaction execution confusion. According to our security analysis, PoVF is provably secure and has the ability to resist potential adversaries. According to the experiments, PoVF-based blockchain can process up to 4000 transactions per second with nodes configured with only 4-core CPUs. This paper uses the Gini coefficient to measure the decentralization of blockchains, and the PoVF-based blockchain achieves the lowest Gini coefficient of 0.39 among all sampled blockchains. PoVF has been shown to provide sufficient efficiency while ensuring decentralization and security through experiments.

Title: TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions

Authors: Vriksha Srihari, R. Bhavya, Shruti Jayaraman, V. Mary Anita Rajam
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01156
Pdf URL: https://arxiv.org/pdf/2501.01156
Copy Paste: [[2501.01156]] TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions(https://arxiv.org/abs/2501.01156)
Keywords: diffusion, generative, large language model
Abstract: While generative models such as text-to-image, large language models and text-to-video have seen significant progress, the extension to text-to-virtual-reality remains largely unexplored, due to a deficit in training data and the complexity of achieving realistic depth and motion in virtual environments. This paper proposes an approach to coalesce existing generative systems to form a stereoscopic virtual reality video from text. Carried out in three main stages, we start with a base text-to-image model that captures context from an input text. We then employ Stable Diffusion on the rudimentary image produced, to generate frames with enhanced realism and overall quality. These frames are processed with depth estimation algorithms to create left-eye and right-eye views, which are stitched side-by-side to create an immersive viewing experience. Such systems would be highly beneficial in virtual reality production, since filming and scene building often require extensive hours of work and post-production effort. We utilize image evaluation techniques, specifically Fréchet Inception Distance and CLIP Score, to assess the visual quality of frames produced for the video. These quantitative measures establish the proficiency of the proposed method. Our work highlights the exciting possibilities of using natural language-driven graphics in fields like virtual reality simulations.

Title: Leveraging Full Dependency Parsing Graph Information For Biomedical Event Extraction

Authors: Farshad Noravesh, Reza Haffari, Ong Huey Fang, Layki Soon, Sailaja Rajalana, Arghya Pal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01158
Pdf URL: https://arxiv.org/pdf/2501.01158
Copy Paste: [[2501.01158]] Leveraging Full Dependency Parsing Graph Information For Biomedical Event Extraction(https://arxiv.org/abs/2501.01158)
Keywords: extraction
Abstract: Many models are proposed in the literature on biomedical event extraction(BEE). Some of them use the shortest dependency path(SDP) information to represent the argument classification task. There is an issue with this representation since even missing one word from the dependency parsing graph may totally change the final prediction. To this end, the full adjacency matrix of the dependency graph is used to embed individual tokens using a graph convolutional network(GCN). An ablation study is also done to show the effect of the dependency graph on the overall performance. The results show a significant improvement when dependency graph information is used. The proposed model slightly outperforms state-of-the-art models on BEE over different datasets.

Title: 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Authors: Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01163
Pdf URL: https://arxiv.org/pdf/2501.01163
Copy Paste: [[2501.01163]] 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer(https://arxiv.org/abs/2501.01163)
Keywords: extraction, transformer
Abstract: Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks. The code and model will be released to promote future exploration.

Title: Towards Interactive Deepfake Analysis

Authors: Lixiong Qin, Ning Jiang, Yang Zhang, Yuhan Qiu, Dingheng Zeng, Jiani Hu, Weihong Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01164
Pdf URL: https://arxiv.org/pdf/2501.01164
Copy Paste: [[2501.01164]] Towards Interactive Deepfake Analysis(https://arxiv.org/abs/2501.01164)
Keywords: large language model
Abstract: Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multi-modal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPT-assisted data construction process resulting in an instruction-following dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at this https URL to facilitate further research.

Title: Deep Learning in Palmprint Recognition-A Comprehensive Survey

Authors: Chengrui Gao, Ziyuan Yang, Wei Jia, Lu Leng, Bob Zhang, Andrew Beng Jin Teoh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01166
Pdf URL: https://arxiv.org/pdf/2501.01166
Copy Paste: [[2501.01166]] Deep Learning in Palmprint Recognition-A Comprehensive Survey(https://arxiv.org/abs/2501.01166)
Keywords: security, privacy, biometric, extraction, segmentation
Abstract: Palmprint recognition has emerged as a prominent biometric technology, widely applied in diverse scenarios. Traditional handcrafted methods for palmprint recognition often fall short in representation capability, as they heavily depend on researchers' prior knowledge. Deep learning (DL) has been introduced to address this limitation, leveraging its remarkable successes across various domains. While existing surveys focus narrowly on specific tasks within palmprint recognition-often grounded in traditional methodologies-there remains a significant gap in comprehensive research exploring DL-based approaches across all facets of palmprint recognition. This paper bridges that gap by thoroughly reviewing recent advancements in DL-powered palmprint recognition. The paper systematically examines progress across key tasks, including region-of-interest segmentation, feature extraction, and security/privacy-oriented challenges. Beyond highlighting these advancements, the paper identifies current challenges and uncovers promising opportunities for future research. By consolidating state-of-the-art progress, this review serves as a valuable resource for researchers, enabling them to stay abreast of cutting-edge technologies and drive innovation in palmprint recognition.

Title: Machine Learning-Based Prediction of ICU Readmissions in Intracerebral Hemorrhage Patients: Insights from the MIMIC Databases

Authors: Shuheng Chen, Junyi Fan, Armin Abdollahi, Negin Ashrafi, Kamiar Alaei, Greg Placencia, Maryam Pishgar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01183
Pdf URL: https://arxiv.org/pdf/2501.01183
Copy Paste: [[2501.01183]] Machine Learning-Based Prediction of ICU Readmissions in Intracerebral Hemorrhage Patients: Insights from the MIMIC Databases(https://arxiv.org/abs/2501.01183)
Keywords: robust
Abstract: Intracerebral hemorrhage (ICH) is a life-risking condition characterized by bleeding within the brain parenchyma. ICU readmission in ICH patients is a critical outcome, reflecting both clinical severity and resource utilization. Accurate prediction of ICU readmission risk is crucial for guiding clinical decision-making and optimizing healthcare resources. This study utilized the Medical Information Mart for Intensive Care (MIMIC-III and MIMIC-IV) databases, which contain comprehensive clinical and demographic data on ICU patients. Patients with ICH were identified from both databases. Various clinical, laboratory, and demographic features were extracted for analysis based on both overview literature and experts' opinions. Preprocessing methods like imputing and sampling were applied to improve the performance of our models. Machine learning techniques, such as Artificial Neural Network (ANN), XGBoost, and Random Forest, were employed to develop predictive models for ICU readmission risk. Model performance was evaluated using metrics such as AUROC, accuracy, sensitivity, and specificity. The developed models demonstrated robust predictive accuracy for ICU readmission in ICH patients, with key predictors including demographic information, clinical parameters, and laboratory measurements. Our study provides a predictive framework for ICU readmission risk in ICH patients, which can aid in clinical decision-making and improve resource allocation in intensive care settings.

Title: Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection

Authors: Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, Djamila Aouada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01184
Pdf URL: https://arxiv.org/pdf/2501.01184
Copy Paste: [[2501.01184]] Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection(https://arxiv.org/abs/2501.01184)
Keywords: interpretability
Abstract: Detecting deepfake videos is highly challenging due to the complex intertwined spatial and temporal artifacts in forged sequences. Most recent approaches rely on binary classifiers trained on both real and fake data. However, such methods may struggle to focus on important artifacts, which can hinder their generalization capability. Additionally, these models often lack interpretability, making it difficult to understand how predictions are made. To address these issues, we propose FakeSTormer, offering two key contributions. First, we introduce a multi-task learning framework with additional spatial and temporal branches that enable the model to focus on subtle spatio-temporal artifacts. These branches also provide interpretability by highlighting video regions that may contain artifacts. Second, we propose a video-level data synthesis algorithm that generates pseudo-fake videos with subtle artifacts, providing the model with high-quality samples and ground truth data for our spatial and temporal branches. Extensive experiments on several challenging benchmarks demonstrate the competitiveness of our approach compared to recent state-of-the-art methods. The code is available at this https URL.

Title: NET-SA: An Efficient Secure Aggregation Architecture Based on In-Network Computing

Authors: Qingqing Ren, Wen Wang, Shuyong Zhu, Zhiyuan Wu, Yujun Zhang
Subjects: cs.CR, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2501.01187
Pdf URL: https://arxiv.org/pdf/2501.01187
Copy Paste: [[2501.01187]] NET-SA: An Efficient Secure Aggregation Architecture Based on In-Network Computing(https://arxiv.org/abs/2501.01187)
Keywords: secure, privacy, protect, attack
Abstract: Privacy-preserving machine learning (PPML) enables clients to collaboratively train deep learning models without sharing private datasets, but faces privacy leakage risks due to gradient leakage attacks. Prevailing methods leverage secure aggregation strategies to enhance PPML, where clients leverage masks and secret sharing to further protect gradient data while tolerating participant dropouts. These methods, however, require frequent inter-client communication to negotiate keys and perform secret sharing, leading to substantial communication overhead. To tackle this issue, we propose NET-SA, an efficient secure aggregation architecture for PPML based on in-network computing. NET-SA employs seed homomorphic pseudorandom generators for local gradient masking and utilizes programmable switches for seed aggregation. Accurate and secure gradient aggregation is then performed on the central server based on masked gradients and aggregated seeds. This design effectively reduces communication overhead due to eliminating the communication-intensive phases of seed agreement and secret sharing, with enhanced dropout tolerance due to overcoming the threshold limit of secret sharing. Extensive experiments on server clusters and Intel Tofino programmable switch demonstrate that NET-SA achieves up to 77x and 12x enhancements in runtime and 2x decrease in total client communication cost compared with state-of-the-art methods.

Title: A Game Between the Defender and the Attacker for Trigger-based Black-box Model Watermarking

Authors: Chaoyue Huang, Hanzhou Wu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.01194
Pdf URL: https://arxiv.org/pdf/2501.01194
Copy Paste: [[2501.01194]] A Game Between the Defender and the Attacker for Trigger-based Black-box Model Watermarking(https://arxiv.org/abs/2501.01194)
Keywords: protect, attack, watermark
Abstract: Watermarking deep neural network (DNN) models has attracted a great deal of attention and interest in recent years because of the increasing demand to protect the intellectual property of DNN models. Many practical algorithms have been proposed by covertly embedding a secret watermark into a given DNN model through either parametric/structural modulation or backdooring against intellectual property infringement from the attacker while preserving the model performance on the original task. Despite the performance of these approaches, the lack of basic research restricts the algorithmic design to either a trial-based method or a data-driven technique. This has motivated the authors in this paper to introduce a game between the model attacker and the model defender for trigger-based black-box model watermarking. For each of the two players, we construct the payoff function and determine the optimal response, which enriches the theoretical foundation of model watermarking and may inspire us to develop novel schemes in the future.

Title: LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

Authors: Kyoungkook Kang, Gyujin Sim, Geonung Kim, Donguk Kim, Seungho Nam, Sunghyun Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01197
Pdf URL: https://arxiv.org/pdf/2501.01197
Copy Paste: [[2501.01197]] LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge(https://arxiv.org/abs/2501.01197)
Keywords: generative
Abstract: Layers have become indispensable tools for professional artists, allowing them to build a hierarchical structure that enables independent control over individual visual elements. In this paper, we propose LayeringDiff, a novel pipeline for the synthesis of layered images, which begins by generating a composite image using an off-the-shelf image generative model, followed by disassembling the image into its constituent foreground and background layers. By extracting layers from a composite image, rather than generating them from scratch, LayeringDiff bypasses the need for large-scale training to develop generative capabilities for individual layers. Furthermore, by utilizing a pretrained off-the-shelf generative model, our method can produce diverse contents and object scales in synthesized layers. For effective layer decomposition, we adapt a large-scale pretrained generative prior to estimate foreground and background layers. We also propose high-frequency alignment modules to refine the fine-details of the estimated layers. Our comprehensive experiments demonstrate that our approach effectively synthesizes layered images and supports various practical applications.

Title: Empirical Analysis of Nature-Inspired Algorithms for Autism Spectrum Disorder Detection Using 3D Video Dataset

Authors: Aneesh Panchal, Kainat Khan, Rahul Katarya
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2501.01202
Pdf URL: https://arxiv.org/pdf/2501.01202
Copy Paste: [[2501.01202]] Empirical Analysis of Nature-Inspired Algorithms for Autism Spectrum Disorder Detection Using 3D Video Dataset(https://arxiv.org/abs/2501.01202)
Keywords: robust, extraction
Abstract: Autism Spectrum Disorder (ASD) is a chronic neurodevelopmental disorder symptoms of which includes repetitive behaviour and lack of social and communication skills. Even though these symptoms can be seen very clearly in social but a large number of individuals with ASD remain undiagnosed. In this paper, we worked on a methodology for the detection of ASD from a 3-dimensional walking video dataset, utilizing supervised machine learning (ML) classification algorithms and nature-inspired optimization algorithms for feature extraction from the dataset. The proposed methodology involves the classification of ASD using a supervised ML classification algorithm and extracting important and relevant features from the dataset using nature-inspired optimization algorithms. We also included the ranking coefficients to find the initial leading particle. This selection of particle significantly reduces the computation time and hence, improves the total efficiency and accuracy for ASD detection. To evaluate the efficiency of the proposed methodology, we deployed various combinationsalgorithms of classification algorithm and nature-inspired algorithms resulting in an outstanding classification accuracy of $100\%$ using the random forest classification algorithm and gravitational search algorithm for feature selection. The application of the proposed methodology with different datasets would enhance the robustness and generalizability of the proposed methodology. Due to high accuracy and less total computation time, the proposed methodology will offer a significant contribution to the medical and academic fields, providing a foundation for future research and advancements in ASD diagnosis.

Title: Real-time Cross-modal Cybersickness Prediction in Virtual Reality

Authors: Yitong Zhu, Tangyao Li, Yuyang Wang
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2501.01212
Pdf URL: https://arxiv.org/pdf/2501.01212
Copy Paste: [[2501.01212]] Real-time Cross-modal Cybersickness Prediction in Virtual Reality(https://arxiv.org/abs/2501.01212)
Keywords: extraction, transformer
Abstract: Cybersickness remains a significant barrier to the widespread adoption of immersive virtual reality (VR) experiences, as it can greatly disrupt user engagement and comfort. Research has shown that cybersickness can significantly be reflected in head and eye tracking data, along with other physiological data (e.g., TMP, EDA, and BMP). Despite the application of deep learning techniques such as CNNs and LSTMs, these models often struggle to capture the complex interactions between multiple data modalities and lack the capacity for real-time inference, limiting their practical application. Addressing this gap, we propose a lightweight model that leverages a transformer-based encoder with sparse self-attention to process bio-signal features and a PP-TSN network for video feature extraction. These features are then integrated via a cross-modal fusion module, creating a video-aware bio-signal representation that supports cybersickness prediction based on both visual and bio-signal inputs. Our model, trained with a lightweight framework, was validated on a public dataset containing eye and head tracking data, physiological data, and VR video, and demonstrated state-of-the-art performance in cybersickness prediction, achieving a high accuracy of 93.13\% using only VR video inputs. These findings suggest that our approach not only enables effective, real-time cybersickness prediction but also addresses the longstanding issue of modality interaction in VR environments. This advancement provides a foundation for future research on multimodal data integration in VR, potentially leading to more personalized, comfortable and widely accessible VR experiences.

Title: TabTreeFormer: Tree Augmented Tabular Data Generation using Transformers

Authors: Jiayu Li, Bingyin Zhao, Zilong Zhao, Kevin Yee, Uzair Javaid, Yingjie Lao, Biplab Sikdar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01216
Pdf URL: https://arxiv.org/pdf/2501.01216
Copy Paste: [[2501.01216]] TabTreeFormer: Tree Augmented Tabular Data Generation using Transformers(https://arxiv.org/abs/2501.01216)
Keywords: privacy, transformer, generative
Abstract: Transformers have achieved remarkable success in tabular data generation. However, they lack domain-specific inductive biases which are critical to preserving the intrinsic characteristics of tabular data. Meanwhile, they suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that incorporates a tree-based model that retains tabular-specific inductive biases of non-smooth and potentially low-correlated patterns due to its discreteness and non-rotational invariance, and hence enhances the fidelity and utility of synthetic data. In addition, we devise a dual-quantization tokenizer to capture the multimodal continuous distribution and further facilitate the learning of numerical value distribution. Moreover, our proposed tokenizer reduces the vocabulary size and sequence length due to the limited dimension-wise semantic meaning and training set size of tabular data, rendering a significant model size shrink without sacrificing the capability of the transformer model. We evaluate TabTreeFormer on 10 datasets against multiple generative models on various metrics; our experimental results show that TabTreeFormer achieves superior fidelity, utility, privacy, and efficiency. Our best model yields a 40% utility improvement with 1/16 of the baseline model size.

Title: Classification of Operational Records in Aviation Using Deep Learning Approaches

Authors: Aziida Nanyonga, Graham Wild
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01222
Pdf URL: https://arxiv.org/pdf/2501.01222
Copy Paste: [[2501.01222]] Classification of Operational Records in Aviation Using Deep Learning Approaches(https://arxiv.org/abs/2501.01222)
Keywords: robust
Abstract: Ensuring safety in the aviation industry is critical, even minor anomalies can lead to severe consequences. This study evaluates the performance of four different models for DP (deep learning), including: Bidirectional Long Short-Term Memory (BLSTM), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Simple Recurrent Neural Networks (sRNN), on a multi-class classification task involving Commercial, Military, and Private categories using the Socrata aviation dataset of 4,864 records. The models were assessed using a classification report, confusion matrix analysis, accuracy metrics, validation loss and accuracy curves. Among the models, BLSTM achieved the highest overall accuracy of 72%, demonstrating superior performance in stability and balanced classification, while LSTM followed closely with 71%, excelling in recall for the Commercial class. CNN and sRNN exhibited lower accuracies of 67% and 69%, with significant misclassifications in the Private class. While the results highlight the strengths of BLSTM and LSTM in handling sequential dependencies and complex classification tasks, all models faced challenges with class imbalance, particularly in predicting the Military and Private categories. Addressing these limitations through data augmentation, advanced feature engineering, and ensemble learning techniques could enhance classification accuracy and robustness. This study underscores the importance of selecting appropriate architectures for domain specific tasks

Title: Conditional Consistency Guided Image Translation and Enhancement

Authors: A. V. Subramanyam, Amil Bhagat, Milind Jain
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01223
Pdf URL: https://arxiv.org/pdf/2501.01223
Copy Paste: [[2501.01223]] Conditional Consistency Guided Image Translation and Enhancement(https://arxiv.org/abs/2501.01223)
Keywords: diffusion, generative
Abstract: Consistency models have emerged as a promising alternative to diffusion models, offering high-quality generative capabilities through single-step sample generation. However, their application to multi-domain image translation tasks, such as cross-modal translation and low-light image enhancement remains largely unexplored. In this paper, we introduce Conditional Consistency Models (CCMs) for multi-domain image translation by incorporating additional conditional inputs. We implement these modifications by introducing task-specific conditional inputs that guide the denoising process, ensuring that the generated outputs retain structural and contextual information from the corresponding input domain. We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images across multiple domains. Code is available at this https URL.

Title: Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent

Authors: Yongxian Wei, Anke Tang, Li Shen, Feng Xiong, Chun Yuan, Xiaochun Cao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01230
Pdf URL: https://arxiv.org/pdf/2501.01230
Copy Paste: [[2501.01230]] Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent(https://arxiv.org/abs/2501.01230)
Keywords: data-free
Abstract: Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental requirement of model merging: ensuring the merged model performs comparably to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ($\textit{i.e.}$, minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through $\textit{data-free}$ optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a $\textit{shared subspace}$ spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.

Title: SVFR: A Unified Framework for Generalized Video Face Restoration

Authors: Zhiyao Wang, Xu Chen, Chengming Xu, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Chengjie Wang, Yuqi Liu, Yiyi Zhou, Rongrong Ji
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2501.01235
Pdf URL: https://arxiv.org/pdf/2501.01235
Copy Paste: [[2501.01235]] SVFR: A Unified Framework for Generalized Video Face Restoration(https://arxiv.org/abs/2501.01235)
Keywords: diffusion, generative
Abstract: Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration.

Title: Automated Self-Refinement and Self-Correction for LLM-based Product Attribute Value Extraction

Authors: Alexander Brinkmann, Christian Bizer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01237
Pdf URL: https://arxiv.org/pdf/2501.01237
Copy Paste: [[2501.01237]] Automated Self-Refinement and Self-Correction for LLM-based Product Attribute Value Extraction(https://arxiv.org/abs/2501.01237)
Keywords: extraction, large language model
Abstract: Structured product data, in the form of attribute-value pairs, is essential for e-commerce platforms to support features such as faceted product search and attribute-based product comparison. However, vendors often provide unstructured product descriptions, making attribute value extraction necessary to ensure data consistency and usability. Large language models (LLMs) have demonstrated their potential for product attribute value extraction in few-shot scenarios. Recent research has shown that self-refinement techniques can improve the performance of LLMs on tasks such as code generation and text-to-SQL translation. For other tasks, the application of these techniques has resulted in increased costs due to processing additional tokens, without achieving any improvement in performance. This paper investigates applying two self-refinement techniques, error-based prompt rewriting and self-correction, to the product attribute value extraction task. The self-refinement techniques are evaluated across zero-shot, few-shot in-context learning, and fine-tuning scenarios using GPT-4o. The experiments show that both self-refinement techniques have only a marginal impact on the model's performance across the different scenarios, while significantly increasing processing costs. For scenarios with training data, fine-tuning yields the highest performance, while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions increases.

Title: EHCTNet: Enhanced Hybrid of CNN and Transformer Network for Remote Sensing Image Change Detection

Authors: Junjie Yang, Haibo Wan, Zhihai Shang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01238
Pdf URL: https://arxiv.org/pdf/2501.01238
Copy Paste: [[2501.01238]] EHCTNet: Enhanced Hybrid of CNN and Transformer Network for Remote Sensing Image Change Detection(https://arxiv.org/abs/2501.01238)
Keywords: extraction, transformer
Abstract: Remote sensing (RS) change detection incurs a high cost because of false negatives, which are more costly than false positives. Existing frameworks, struggling to improve the Precision metric to reduce the cost of false positive, still have limitations in focusing on the change of interest, which leads to missed detections and discontinuity issues. This work tackles these issues by enhancing feature learning capabilities and integrating the frequency components of feature information, with a strategy to incrementally boost the Recall value. We propose an enhanced hybrid of CNN and Transformer network (EHCTNet) for effectively mining the change information of interest. Firstly, a dual branch feature extraction module is used to extract the multi scale features of RS images. Secondly, the frequency component of these features is exploited by a refined module I. Thirdly, an enhanced token mining module based on the Kolmogorov Arnold Network is utilized to derive semantic information. Finally, the semantic change information's frequency component, beneficial for final detection, is mined from the refined module II. Extensive experiments validate the effectiveness of EHCTNet in comprehending complex changes of interest. The visualization outcomes show that EHCTNet detects more intact and continuous changed areas and perceives more accurate neighboring distinction than state of the art models.

Title: Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

Authors: Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.01243
Pdf URL: https://arxiv.org/pdf/2501.01243
Copy Paste: [[2501.01243]] Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants(https://arxiv.org/abs/2501.01243)
Keywords: large language model
Abstract: Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench comprises a development set with 900 problems and a test set with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. Moreover, inspired by multi-modal agents, we also explore which abilities of MLLMs need to be supplemented by specialist models.

Title: SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

Authors: Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, Dian Shao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01245
Pdf URL: https://arxiv.org/pdf/2501.01245
Copy Paste: [[2501.01245]] SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization(https://arxiv.org/abs/2501.01245)
Keywords: large language model
Abstract: Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., "salto backward tucked with 1 turn"). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model's predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.

Title: Large Language Model-Enhanced Symbolic Reasoning for Knowledge Base Completion

Authors: Qiyuan He, Jianfei Yu, Wenya Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01246
Pdf URL: https://arxiv.org/pdf/2501.01246
Copy Paste: [[2501.01246]] Large Language Model-Enhanced Symbolic Reasoning for Knowledge Base Completion(https://arxiv.org/abs/2501.01246)
Keywords: robust, large language model
Abstract: Integrating large language models (LLMs) with rule-based reasoning offers a powerful solution for improving the flexibility and reliability of Knowledge Base Completion (KBC). Traditional rule-based KBC methods offer verifiable reasoning yet lack flexibility, while LLMs provide strong semantic understanding yet suffer from hallucinations. With the aim of combining LLMs' understanding capability with the logical and rigor of rule-based approaches, we propose a novel framework consisting of a Subgraph Extractor, an LLM Proposer, and a Rule Reasoner. The Subgraph Extractor first samples subgraphs from the KB. Then, the LLM uses these subgraphs to propose diverse and meaningful rules that are helpful for inferring missing facts. To effectively avoid hallucination in LLMs' generations, these proposed rules are further refined by a Rule Reasoner to pinpoint the most significant rules in the KB for Knowledge Base Completion. Our approach offers several key benefits: the utilization of LLMs to enhance the richness and diversity of the proposed rules and the integration with rule-based reasoning to improve reliability. Our method also demonstrates strong performance across diverse KB datasets, highlighting the robustness and generalizability of the proposed framework.

Title: Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers?

Authors: Manuel Weber, Moritz Huber, Maximilian Auch, Alexander Döschl, Max-Emanuel Keller, Peter Mandl
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01256
Pdf URL: https://arxiv.org/pdf/2501.01256
Copy Paste: [[2501.01256]] Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers?(https://arxiv.org/abs/2501.01256)
Keywords: large language model
Abstract: In recent years, toxic content and hate speech have become widespread phenomena on the internet. Moderators of online newspapers and forums are now required, partly due to legal regulations, to carefully review and, if necessary, delete reader comments. This is a labor-intensive process. Some providers of large language models already offer solutions for automated hate speech detection or the identification of toxic content. These include GPT-4o from OpenAI, Jigsaw's (Google) Perspective API, and OpenAI's Moderation API. Based on the selected German test dataset HOCON34k, which was specifically created for developing tools to detect hate speech in reader comments of online newspapers, these solutions are compared with each other and against the HOCON34k baseline. The test dataset contains 1,592 annotated text samples. For GPT-4o, three different promptings are used, employing a Zero-Shot, One-Shot, and Few-Shot approach. The results of the experiments demonstrate that GPT-4o outperforms both the Perspective API and the Moderation API, and exceeds the HOCON34k baseline by approximately 5 percentage points, as measured by a combined metric of MCC and F2-score.

Title: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Authors: Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01257
Pdf URL: https://arxiv.org/pdf/2501.01257
Copy Paste: [[2501.01257]] CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings(https://arxiv.org/abs/2501.01257)
Keywords: large language model
Abstract: With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.

Title: Detail Matters: Mamba-Inspired Joint Unfolding Network for Snapshot Spectral Compressive Imaging

Authors: Mengjie Qin, Yuchao Feng, Zongliang Wu, Yulun Zhang, Xin Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01262
Pdf URL: https://arxiv.org/pdf/2501.01262
Copy Paste: [[2501.01262]] Detail Matters: Mamba-Inspired Joint Unfolding Network for Snapshot Spectral Compressive Imaging(https://arxiv.org/abs/2501.01262)
Keywords: transformer
Abstract: In the coded aperture snapshot spectral imaging system, Deep Unfolding Networks (DUNs) have made impressive progress in recovering 3D hyperspectral images (HSIs) from a single 2D measurement. However, the inherent nonlinear and ill-posed characteristics of HSI reconstruction still pose challenges to existing methods in terms of accuracy and stability. To address this issue, we propose a Mamba-inspired Joint Unfolding Network (MiJUN), which integrates physics-embedded DUNs with learning-based HSI imaging. Firstly, leveraging the concept of trapezoid discretization to expand the representation space of unfolding networks, we introduce an accelerated unfolding network scheme. This approach can be interpreted as a generalized accelerated half-quadratic splitting with a second-order differential equation, which reduces the reliance on initial optimization stages and addresses challenges related to long-range interactions. Crucially, within the Mamba framework, we restructure the Mamba-inspired global-to-local attention mechanism by incorporating a selective state space model and an attention mechanism. This effectively reinterprets Mamba as a variant of the Transformer} architecture, improving its adaptability and efficiency. Furthermore, we refine the scanning strategy with Mamba by integrating the tensor mode-$k$ unfolding into the Mamba network. This approach emphasizes the low-rank properties of tensors along various modes, while conveniently facilitating 12 scanning directions. Numerical and visual comparisons on both simulation and real datasets demonstrate the superiority of our proposed MiJUN, and achieving overwhelming detail representation.

Title: Stealthy Backdoor Attack to Real-world Models in Android Apps

Authors: Jiali Wei, Ming Fan, Xicheng Zhang, Wenjing Jiao, Haijun Wang, Ting Liu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01263
Pdf URL: https://arxiv.org/pdf/2501.01263
Copy Paste: [[2501.01263]] Stealthy Backdoor Attack to Real-world Models in Android Apps(https://arxiv.org/abs/2501.01263)
Keywords: security, attack, robust, steal
Abstract: Powered by their superior performance, deep neural networks (DNNs) have found widespread applications across various domains. Many deep learning (DL) models are now embedded in mobile apps, making them more accessible to end users through on-device DL. However, deploying on-device DL to users' smartphones simultaneously introduces several security threats. One primary threat is backdoor attacks. Extensive research has explored backdoor attacks for several years and has proposed numerous attack approaches. However, few studies have investigated backdoor attacks on DL models deployed in the real world, or they have shown obvious deficiencies in effectiveness and stealthiness. In this work, we explore more effective and stealthy backdoor attacks on real-world DL models extracted from mobile apps. Our main justification is that imperceptible and sample-specific backdoor triggers generated by DNN-based steganography can enhance the efficacy of backdoor attacks on real-world models. We first confirm the effectiveness of steganography-based backdoor attacks on four state-of-the-art DNN models. Subsequently, we systematically evaluate and analyze the stealthiness of the attacks to ensure they are difficult to perceive. Finally, we implement the backdoor attacks on real-world models and compare our approach with three baseline methods. We collect 38,387 mobile apps, extract 89 DL models from them, and analyze these models to obtain the prerequisite model information for the attacks. After identifying the target models, our approach achieves an average of 12.50% higher attack success rate than DeepPayload while better maintaining the normal performance of the models. Extensive experimental results demonstrate that our method enables more effective, robust, and stealthy backdoor attacks on real-world models.

Title: ProgCo: Program Helps Self-Correction of Large Language Models

Authors: Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01264
Pdf URL: https://arxiv.org/pdf/2501.01264
Copy Paste: [[2501.01264]] ProgCo: Program Helps Self-Correction of Large Language Models(https://arxiv.org/abs/2501.01264)
Keywords: large language model
Abstract: Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools.

Title: Does a Large Language Model Really Speak in Human-Like Language?

Authors: Mose Park, Yunjin Choi, Jong-June Jeon
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2501.01273
Pdf URL: https://arxiv.org/pdf/2501.01273
Copy Paste: [[2501.01273]] Does a Large Language Model Really Speak in Human-Like Language?(https://arxiv.org/abs/2501.01273)
Keywords: large language model
Abstract: Large Language Models (LLMs) have recently emerged, attracting considerable attention due to their ability to generate highly natural, human-like text. This study compares the latent community structures of LLM-generated text and human-written text within a hypothesis testing procedure. Specifically, we analyze three text sets: original human-written texts ($\mathcal{O}$), their LLM-paraphrased versions ($\mathcal{G}$), and a twice-paraphrased set ($\mathcal{S}$) derived from $\mathcal{G}$. Our analysis addresses two key questions: (1) Is the difference in latent community structures between $\mathcal{O}$ and $\mathcal{G}$ the same as that between $\mathcal{G}$ and $\mathcal{S}$? (2) Does $\mathcal{G}$ become more similar to $\mathcal{O}$ as the LLM parameter controlling text variability is adjusted? The first question is based on the assumption that if LLM-generated text truly resembles human language, then the gap between the pair ($\mathcal{O}$, $\mathcal{G}$) should be similar to that between the pair ($\mathcal{G}$, $\mathcal{S}$), as both pairs consist of an original text and its paraphrase. The second question examines whether the degree of similarity between LLM-generated and human text varies with changes in the breadth of text generation. To address these questions, we propose a statistical hypothesis testing framework that leverages the fact that each text has corresponding parts across all datasets due to their paraphrasing relationship. This relationship enables the mapping of one dataset's relative position to another, allowing two datasets to be mapped to a third dataset. As a result, both mapped datasets can be quantified with respect to the space characterized by the third dataset, facilitating a direct comparison between them. Our results indicate that GPT-generated text remains distinct from human-authored text.

Title: HybridTrack: A Hybrid Approach for Robust Multi-Object Tracking

Authors: Leandro Di Bella, Yangxintong Lyu, Bruno Cornelis, Adrian Munteanu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2501.01275
Pdf URL: https://arxiv.org/pdf/2501.01275
Copy Paste: [[2501.01275]] HybridTrack: A Hybrid Approach for Robust Multi-Object Tracking(https://arxiv.org/abs/2501.01275)
Keywords: robust
Abstract: The evolution of Advanced Driver Assistance Systems (ADAS) has increased the need for robust and generalizable algorithms for multi-object tracking. Traditional statistical model-based tracking methods rely on predefined motion models and assumptions about system noise distributions. Although computationally efficient, they often lack adaptability to varying traffic scenarios and require extensive manual design and parameter tuning. To address these issues, we propose a novel 3D multi-object tracking approach for vehicles, HybridTrack, which integrates a data-driven Kalman Filter (KF) within a tracking-by-detection paradigm. In particular, it learns the transition residual and Kalman gain directly from data, which eliminates the need for manual motion and stochastic parameter modeling. Validated on the real-world KITTI dataset, HybridTrack achieves 82.08% HOTA accuracy, significantly outperforming state-of-the-art methods. We also evaluate our method under different configurations, achieving the fastest processing speed of 112 FPS. Consequently, HybridTrack eliminates the dependency on scene-specific designs while improving performance and maintaining real-time efficiency. The code will be publicly available at the time of publishing: this https URL.

Title: ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Authors: Vaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01290
Pdf URL: https://arxiv.org/pdf/2501.01290
Copy Paste: [[2501.01290]] ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark(https://arxiv.org/abs/2501.01290)
Keywords: robust
Abstract: Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging nature of our dataset, with the majority of models achieving less than 50% accuracy. Additionally, we generate synthetic training data to compare the performance of outcome-supervised reward models (ORMs) with process-supervised reward models (PRMs) to assess their ability to improve complex tool-use reasoning as evaluated by ToolComp. Our results show that PRMs generalize significantly better than ORMs, achieving a 19% and 11% improvement in rank@1 accuracy for ranking base and fine-tuned model trajectories, respectively. These findings highlight the critical role of process supervision in both the evaluation and training of AI models, paving the way for more robust and capable systems in complex, multi-step tool-use tasks.

Title: Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case

Authors: Kaushik Roy, Harshul Surana, Darssan Eswaramoorthi, Yuxin Zi, Vedant Palit, Ritvik Garimella, Amit Sheth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01305
Pdf URL: https://arxiv.org/pdf/2501.01305
Copy Paste: [[2501.01305]] Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case(https://arxiv.org/abs/2501.01305)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly attracting the attention of healthcare professionals for their potential to assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.

Title: Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking

Authors: Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01306
Pdf URL: https://arxiv.org/pdf/2501.01306
Copy Paste: [[2501.01306]] Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking(https://arxiv.org/abs/2501.01306)
Keywords: large language model
Abstract: Large language models (LLMs) demonstrate exceptional capabilities, yet still face the hallucination issue. Typical text generation approaches adopt an auto-regressive generation without deliberate reasoning, which often results in untrustworthy and factually inaccurate responses. In this paper, we propose HaluSearch, a novel framework that incorporates tree search-based algorithms (e.g. MCTS) to enable an explicit slow thinking generation process for mitigating hallucinations of LLMs during inference. Specifically, HaluSearch frames text generation as a step-by-step reasoning process, using a self-evaluation reward model to score each generation step and guide the tree search towards the most reliable generation pathway for fully exploiting the internal knowledge of LLMs. To balance efficiency and quality, we introduce a hierarchical thinking system switch mechanism inspired by the dual process theory in cognitive science, which dynamically alternates between fast and slow thinking modes at both the instance and step levels, adapting to the complexity of questions and reasoning states. We conduct extensive experiments on both English and Chinese datasets and the results show that our approach significantly outperforms baseline approaches.

Title: Multi-Head Explainer: A General Framework to Improve Explainability in CNNs and Transformers

Authors: Bohang Sun, Pietro Liò
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01311
Pdf URL: https://arxiv.org/pdf/2501.01311
Copy Paste: [[2501.01311]] Multi-Head Explainer: A General Framework to Improve Explainability in CNNs and Transformers(https://arxiv.org/abs/2501.01311)
Keywords: explainability, transformer
Abstract: In this study, we introduce the Multi-Head Explainer (MHEX), a versatile and modular framework that enhances both the explainability and accuracy of Convolutional Neural Networks (CNNs) and Transformer-based models. MHEX consists of three core components: an Attention Gate that dynamically highlights task-relevant features, Deep Supervision that guides early layers to capture fine-grained details pertinent to the target class, and an Equivalent Matrix that unifies refined local and global representations to generate comprehensive saliency maps. Our approach demonstrates superior compatibility, enabling effortless integration into existing residual networks like ResNet and Transformer architectures such as BERT with minimal modifications. Extensive experiments on benchmark datasets in medical imaging and text classification show that MHEX not only improves classification accuracy but also produces highly interpretable and detailed saliency scores.

Title: SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

Authors: Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01320
Pdf URL: https://arxiv.org/pdf/2501.01320
Copy Paste: [[2501.01320]] SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration(https://arxiv.org/abs/2501.01320)
Keywords: diffusion, transformer
Abstract: Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.

Title: Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension

Authors: Yanbo Fang, Ruixiang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01332
Pdf URL: https://arxiv.org/pdf/2501.01332
Copy Paste: [[2501.01332]] Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension(https://arxiv.org/abs/2501.01332)
Keywords: large language model
Abstract: Understanding how large language models (LLMs) acquire, retain, and apply knowledge remains an open challenge. This paper introduces a novel framework, K-(CSA)^2, which categorizes LLM knowledge along two dimensions: correctness and confidence. The framework defines six categories of knowledge, ranging from highly confident correctness to confidently held misconceptions, enabling a nuanced evaluation of model comprehension beyond binary accuracy. Using this framework, we demonstrate how techniques like chain-of-thought prompting and reinforcement learning with human feedback fundamentally alter the knowledge structures of internal (pre-trained) and external (context-dependent) knowledge in LLMs. CoT particularly enhances base model performance and shows synergistic benefits when applied to aligned LLMs. Moreover, our layer-wise analysis reveals that higher layers in LLMs encode more high-confidence knowledge, while low-confidence knowledge tends to emerge in middle-to-lower layers.

Title: Analysis of Security in OS-Level Virtualization

Authors: Krishna Sai Ketha, Guanqun Song, Ting Zhu
Subjects: cs.CR, cs.OS
Abstract URL: https://arxiv.org/abs/2501.01334
Pdf URL: https://arxiv.org/pdf/2501.01334
Copy Paste: [[2501.01334]] Analysis of Security in OS-Level Virtualization(https://arxiv.org/abs/2501.01334)
Keywords: security, attack
Abstract: Virtualization is a technique that allows multiple instances typically running different guest operating systems on top of single physical hardware. A hypervisor, a layer of software running on top of the host operating system, typically runs and manages these different guest operating systems. Rather than to run different services on different servers for reliability and security reasons, companies started to employ virtualization over their servers to run these services within a single server. This approach proves beneficial to the companies as it provides much better reliability, stronger isolation, improved security and resource utilization compared to running services on multiple servers. Although hypervisor based virtualization offers better resource utilization and stronger isolation, it also suffers from high overhead as the host operating system has to maintain different guest operating systems. To tackle this issue, another form of virtualization known as Operating System-level virtualization has emerged. This virtualization provides light-weight, minimal and efficient virtualization, as the different instances are run on top of the same host operating system, sharing the resources of the host operating system. But due to instances sharing the same host operating system affects the isolation of the instances. In this paper, we will first establish the basic concepts of virtualization and point out the differences between the hyper-visor based virtualization and operating system-level virtualization. Next, we will discuss the container creation life-cycle which helps in forming a container threat model for the container systems, which allows to map different potential attack vectors within these systems. Finally, we will discuss a case study, which further looks at isolation provided by the containers.

Title: CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models

Authors: Johan Wahréus, Ahmed Mohamed Hussain, Panos Papadimitratos
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01335
Pdf URL: https://arxiv.org/pdf/2501.01335
Copy Paste: [[2501.01335]] CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models(https://arxiv.org/abs/2501.01335)
Keywords: security, attack, generative, large language model
Abstract: Numerous studies have investigated methods for jailbreaking Large Language Models (LLMs) to generate harmful content. Typically, these methods are evaluated using datasets of malicious prompts designed to bypass security policies established by LLM providers. However, the generally broad scope and open-ended nature of existing datasets can complicate the assessment of jailbreaking effectiveness, particularly in specific domains, notably cybersecurity. To address this issue, we present and publicly release CySecBench, a comprehensive dataset containing 12662 prompts specifically designed to evaluate jailbreaking techniques in the cybersecurity domain. The dataset is organized into 10 distinct attack-type categories, featuring close-ended prompts to enable a more consistent and accurate assessment of jailbreaking attempts. Furthermore, we detail our methodology for dataset generation and filtration, which can be adapted to create similar datasets in other domains. To demonstrate the utility of CySecBench, we propose and evaluate a jailbreaking approach based on prompt obfuscation. Our experimental results show that this method successfully elicits harmful content from commercial black-box LLMs, achieving Success Rates (SRs) of 65% with ChatGPT and 88% with Gemini; in contrast, Claude demonstrated greater resilience with a jailbreaking SR of 17%. Compared to existing benchmark approaches, our method shows superior performance, highlighting the value of domain-specific evaluation datasets for assessing LLM security measures. Moreover, when evaluated using prompts from a widely used dataset (i.e., AdvBench), it achieved an SR of 78.5%, higher than the state-of-the-art methods.

Title: Aligning Large Language Models for Faithful Integrity Against Opposing Argument

Authors: Yong Zhao, Yang Deng, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.01336
Pdf URL: https://arxiv.org/pdf/2501.01336
Copy Paste: [[2501.01336]] Aligning Large Language Models for Faithful Integrity Against Opposing Argument(https://arxiv.org/abs/2501.01336)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model's confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM's ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings. Code and data will be released via this https URL

Title: Machine Learning for Modeling Wireless Radio Metrics with Crowdsourced Data and Local Environment Features

Authors: Yifeng Qiu, Alexis Bose
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01344
Pdf URL: https://arxiv.org/pdf/2501.01344
Copy Paste: [[2501.01344]] Machine Learning for Modeling Wireless Radio Metrics with Crowdsourced Data and Local Environment Features(https://arxiv.org/abs/2501.01344)
Keywords: robust
Abstract: This paper presents a suite of machine learning models, CRC-ML-Radio Metrics, designed for modeling RSRP, RSRQ, and RSSI wireless radio metrics in 4G environments. These models utilize crowdsourced data with local environmental features to enhance prediction accuracy across both indoor at elevation and outdoor urban settings. They achieve RMSE performance of 9.76 to 11.69 dB for RSRP, 2.90 to 3.23 dB for RSRQ, and 9.50 to 10.36 dB for RSSI, evaluated on over 300,000 data points in the Toronto, Montreal, and Vancouver areas. These results demonstrate the robustness and adaptability of the models, supporting precise network planning and quality of service optimization in complex Canadian urban environments.

Title: Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Authors: Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Lu Cheng, Mengnan Du
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.01346
Pdf URL: https://arxiv.org/pdf/2501.01346
Copy Paste: [[2501.01346]] Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability(https://arxiv.org/abs/2501.01346)
Keywords: explainability
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and linguistic representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and parameter-tuning approaches. Finally, we outline promising future research directions, emphasizing the need for standardized evaluation protocols and in-depth explainability studies.

Title: Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement

Authors: Z. Zhang, B. Liu, J. Bao, L. Chen, S. Zhu, J. Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01368
Pdf URL: https://arxiv.org/pdf/2501.01368
Copy Paste: [[2501.01368]] Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement(https://arxiv.org/abs/2501.01368)
Keywords: diffusion
Abstract: Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propose a novel yet generic test-time controllable generation method that aims at natural text prompts and complex conditions. Specifically, we decouple spatial conditions into semantic and geometric conditions and then enforce their consistency during the image-generation process individually. As for the former, we target bridging the gap between the semantic condition and text prompts, as well as the gap between such condition and the attention map from diffusion models. To achieve this, we propose to first complete the prompt w.r.t. semantic condition, and then remove the negative impact of distracting prompt words by measuring their statistics in attention maps as well as distances in word space w.r.t. this condition. To further cope with the complex geometric conditions, we introduce a geometric transform module, in which Region-of-Interests will be identified in attention maps and further used to translate category-wise latents w.r.t. geometric condition. More importantly, we propose a diffusion-based latents-refill method to explicitly remove the impact of latents at the RoI, reducing the artifacts on generated images. Experiments on Coco-stuff dataset showcase 30$\%$ relative boost compared to SOTA training-free methods on layout consistency evaluation metrics.

Title: Iris Recognition for Infants

Authors: Rasel Ahmed Bhuiyan, Mateusz Trokielewicz, Piotr Maciejewicz, Sherri Bucher, Adam Czajka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01375
Pdf URL: https://arxiv.org/pdf/2501.01375
Copy Paste: [[2501.01375]] Iris Recognition for Infants(https://arxiv.org/abs/2501.01375)
Keywords: privacy, biometric, segmentation
Abstract: Non-invasive, efficient, physical token-less, accurate and stable identification methods for newborns may prevent baby swapping at birth, limit baby abductions and improve post-natal health monitoring across geographies, within the context of both the formal (i.e., hospitals) and informal (i.e., humanitarian and fragile settings) health sectors. This paper explores the feasibility of application iris recognition to build biometric identifiers for 4-6 week old infants. We (a) collected near infrared (NIR) iris images from 17 infants using a specially-designed NIR iris sensor; (b) evaluated six iris recognition methods to assess readiness of the state-of-the-art iris recognition to be applied to newborns and infants; (c) proposed a new segmentation model that correctly detects iris texture within infants iris images, and coupled it with several iris texture encoding approaches to offer, to the first of our knowledge, a fully-operational infant iris recognition system; and, (d) trained a StyleGAN-based model to synthesize iris images mimicking samples acquired from infants to deliver to the research community privacy-safe infant iris images. The proposed system, incorporating the specially-designed iris sensor and segmenter, and applied to the collected infant iris samples, achieved Equal Error Rate (EER) of 3\% and Area Under ROC Curve (AUC) of 99\%, compared to EER$\geq$20\% and AUC$\leq$88\% obtained for state of the art adult iris recognition systems. This suggests that it may be feasible to design methods that succesfully extract biometric features from infant irises.

Title: OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios

Authors: Xize Cheng, Dongjie Fu, Xiaoda Yang, Minghui Fang, Ruofan Hu, Jingyu Lu, Bai Jionghao, Zehan Wang, Shengpeng Ji, Rongjie Huang, Linjun Li, Yu Chen, Tao Jin, Zhou Zhao
Subjects: cs.CL, cs.HC, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.01384
Pdf URL: https://arxiv.org/pdf/2501.01384
Copy Paste: [[2501.01384]] OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios(https://arxiv.org/abs/2501.01384)
Keywords: large language model
Abstract: With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{this https URL}.

Title: A Unified Hyperparameter Optimization Pipeline for Transformer-Based Time Series Forecasting Models

Authors: Jingjing Xu, Caesar Wu, Yuan-Fang Li, Grégoire Danoy, Pascal Bouvry
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01394
Pdf URL: https://arxiv.org/pdf/2501.01394
Copy Paste: [[2501.01394]] A Unified Hyperparameter Optimization Pipeline for Transformer-Based Time Series Forecasting Models(https://arxiv.org/abs/2501.01394)
Keywords: transformer
Abstract: Transformer-based models for time series forecasting (TSF) have attracted significant attention in recent years due to their effectiveness and versatility. However, these models often require extensive hyperparameter optimization (HPO) to achieve the best possible performance, and a unified pipeline for HPO in transformer-based TSF remains lacking. In this paper, we present one such pipeline and conduct extensive experiments on several state-of-the-art (SOTA) transformer-based TSF models. These experiments are conducted on standard benchmark datasets to evaluate and compare the performance of different models, generating practical insights and examples. Our pipeline is generalizable beyond transformer-based architectures and can be applied to other SOTA models, such as Mamba and TimeMixer, as demonstrated in our experiments. The goal of this work is to provide valuable guidance to both industry practitioners and academic researchers in efficiently identifying optimal hyperparameters suited to their specific domain applications. The code and complete experimental results are available on GitHub.

Title: Best Transition Matrix Esitimation or Best Label Noise Robustness Classifier? Two Possible Methods to Enhance the Performance of T-revision

Authors: Haixu Liu, Zerui Tao, Naihui Zhang, Sixing Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01402
Pdf URL: https://arxiv.org/pdf/2501.01402
Copy Paste: [[2501.01402]] Best Transition Matrix Esitimation or Best Label Noise Robustness Classifier? Two Possible Methods to Enhance the Performance of T-revision(https://arxiv.org/abs/2501.01402)
Keywords: robust
Abstract: Label noise refers to incorrect labels in a dataset caused by human errors or collection defects, which is common in real-world applications and can significantly reduce the accuracy of models. This report explores how to estimate noise transition matrices and construct deep learning classifiers that are robust against label noise. In cases where the transition matrix is known, we apply forward correction and importance reweighting methods to correct the impact of label noise using the transition matrix. When the transition matrix is unknown or inaccurate, we use the anchor point assumption and T-Revision series methods to estimate or correct the noise matrix. In this study, we further improved the T-Revision method by developing T-Revision-Alpha and T-Revision-Softmax to enhance stability and robustness. Additionally, we designed and implemented two baseline classifiers, a Multi-Layer Perceptron (MLP) and ResNet-18, based on the cross-entropy loss function. We compared the performance of these methods on predicting clean labels and estimating transition matrices using the FashionMINIST dataset with known noise transition matrices. For the CIFAR-10 dataset, where the noise transition matrix is unknown, we estimated the noise matrix and evaluated the ability of the methods to predict clean labels.

Title: nnY-Net: Swin-NeXt with Cross-Attention for 3D Medical Images Segmentation

Authors: Haixu Liu, Zerui Tao, Wenzhen Dong, Qiuzhuang Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01406
Pdf URL: https://arxiv.org/pdf/2501.01406
Copy Paste: [[2501.01406]] nnY-Net: Swin-NeXt with Cross-Attention for 3D Medical Images Segmentation(https://arxiv.org/abs/2501.01406)
Keywords: transformer, segmentation
Abstract: This paper provides a novel 3D medical image segmentation model structure called nnY-Net. This name comes from the fact that our model adds a cross-attention module at the bottom of the U-net structure to form a Y structure. We integrate the advantages of the two latest SOTA models, MedNeXt and SwinUNETR, and use Swin Transformer as the encoder and ConvNeXt as the decoder to innovatively design the Swin-NeXt structure. Our model uses the lowest-level feature map of the encoder as Key and Value and uses patient features such as pathology and treatment information as Query to calculate the attention weights in a Cross Attention module. Moreover, we simplify some pre- and post-processing as well as data enhancement methods in 3D image segmentation based on the dynUnet and nnU-net frameworks. We integrate our proposed Swin-NeXt with Cross-Attention framework into this framework. Last, we construct a DiceFocalCELoss to improve the training efficiency for the uneven data convergence of voxel classification.

Title: Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

Authors: Yaxian Wang, Henghui Ding, Shuting He, Xudong Jiang, Bifan Wei, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01416
Pdf URL: https://arxiv.org/pdf/2501.01416
Copy Paste: [[2501.01416]] Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension(https://arxiv.org/abs/2501.01416)
Keywords: robust, segmentation
Abstract: In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on single-target expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed HieA2G.

Title: A Multi-task Supervised Compression Model for Split Computing

Authors: Yoshitomo Matsubara, Matteo Mendula, Marco Levorato
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2501.01420
Pdf URL: https://arxiv.org/pdf/2501.01420
Copy Paste: [[2501.01420]] A Multi-task Supervised Compression Model for Split Computing(https://arxiv.org/abs/2501.01420)
Keywords: segmentation
Abstract: Split computing ($\neq$ split learning) is a promising approach to deep learning models for resource-constrained edge computing systems, where weak sensor (mobile) devices are wirelessly connected to stronger edge servers through channels with limited communication capacity. State-of-theart work on split computing presents methods for single tasks such as image classification, object detection, or semantic segmentation. The application of existing methods to multitask problems degrades model accuracy and/or significantly increase runtime latency. In this study, we propose Ladon, the first multi-task-head supervised compression model for multi-task split computing. Experimental results show that the multi-task supervised compression model either outperformed or rivaled strong lightweight baseline models in terms of predictive performance for ILSVRC 2012, COCO 2017, and PASCAL VOC 2012 datasets while learning compressed representations at its early layers. Furthermore, our models reduced end-to-end latency (by up to 95.4%) and energy consumption of mobile devices (by up to 88.2%) in multi-task split computing scenarios.

Title: R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

Authors: Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01421
Pdf URL: https://arxiv.org/pdf/2501.01421
Copy Paste: [[2501.01421]] R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization(https://arxiv.org/abs/2501.01421)
Keywords: robust, extraction
Abstract: Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10$\times$ more accurate than previous SCR methods with similar map sizes and require at least 5$\times$ smaller map sizes than any other SCR method while still delivering superior accuracy. Code will be available at: this https URL .

Title: Multi-Modal Video Feature Extraction for Popularity Prediction

Authors: Haixu Liu, Wenning Wang, Haoxiang Zheng, Penghao Jiang, Qirui Wang, Ruiqing Yan, Qiuzhuang Sun
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01422
Pdf URL: https://arxiv.org/pdf/2501.01422
Copy Paste: [[2501.01422]] Multi-Modal Video Feature Extraction for Popularity Prediction(https://arxiv.org/abs/2501.01422)
Keywords: extraction
Abstract: This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.

Title: Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Authors: Jingfeng Yao, Xinggang Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01423
Pdf URL: https://arxiv.org/pdf/2501.01423
Copy Paste: [[2501.01423]] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models(https://arxiv.org/abs/2501.01423)
Keywords: diffusion, transformer
Abstract: Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: this https URL.

Title: Object-level Visual Prompts for Compositional Image Generation

Authors: Gaurav Parmar, Or Patashnik, Kuan-Chieh Wang, Daniil Ostashev, Srinivasa Narasimhan, Jun-Yan Zhu, Daniel Cohen-Or, Kfir Aberman
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2501.01424
Pdf URL: https://arxiv.org/pdf/2501.01424
Copy Paste: [[2501.01424]] Object-level Visual Prompts for Compositional Image Generation(https://arxiv.org/abs/2501.01424)
Keywords: diffusion
Abstract: We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.

Title: Unifying Specialized Visual Encoders for Video Language Models

Authors: Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01426
Pdf URL: https://arxiv.org/pdf/2501.01426
Copy Paste: [[2501.01426]] Unifying Specialized Visual Encoders for Video Language Models(https://arxiv.org/abs/2501.01426)
Keywords: large language model
Abstract: The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods while parallelizing the visual processing. Finally, we provide qualitative evidence that MERV successfully captures domain knowledge from each of its encoders. Our results offer promising directions in utilizing multiple vision encoders for comprehensive video understanding.

Title: VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Authors: Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01427
Pdf URL: https://arxiv.org/pdf/2501.01427
Copy Paste: [[2501.01427]] VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control(https://arxiv.org/abs/2501.01427)
Keywords: diffusion
Abstract: Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.