2024-12-03

Title: A Supercomputing Based Distributed Cloud Marketplace

Authors: Minjun Kim, Sina Falaki
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00016
Pdf URL: https://arxiv.org/pdf/2412.00016
Copy Paste: [[2412.00016]] A Supercomputing Based Distributed Cloud Marketplace(https://arxiv.org/abs/2412.00016)
Keywords: secure, security, privacy, attack
Abstract: The once mythological 51% attack has moved beyond the hypothetical and now poses a legitimate widespread threat to blockchain technology. Current blockchains provide inferior throughput capacity when compared to that of centralized systems, creating an obvious vulnerability which allows the 51% attack to occur within decentralized systems. Despite recent advancements in blockchain which introduce interesting models that achieve high throughputs with enhanced security and privacy, no current networks have evolved to deploy the optimal solution of combining scalability, security, and distributed systems to create a legitimate supercomputing enterprise-grade developer sandbox. In this paper, we introduce an infinitely scalable, secure, and high throughput blockchain capable of amassing supercomputer speeds with off-the-shelf hardware, LuluChain. LuluChain simplifies the blockchain model to obtain greater functionality, speed, scalability, privacy, and flexibility, that works to combat the inflated pricing models set by the oligopolistic cloud computing market as it requires minimal computational work. By eliminating the need for timestamp synchronization and majority agreement among all participants, LuluChain opens the door to reliable trust, low-cost instant transactions, and flexible instant smart contracts. The supercomputing, high throughput distributed system is the ideal foundation for an essential distributed cloud marketplace.

Title: TransFair: Transferring Fairness from Ocular Disease Classification to Progression Prediction

Authors: Leila Gheisi, Henry Chu, Raju Gottumukkala, Xingquan Zhu, Mengyu Wang, Min Shi
Subjects: cs.LG, cs.AI, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2412.00051
Pdf URL: https://arxiv.org/pdf/2412.00051
Copy Paste: [[2412.00051]] TransFair: Transferring Fairness from Ocular Disease Classification to Progression Prediction(https://arxiv.org/abs/2412.00051)
Keywords: robust, fair
Abstract: The use of artificial intelligence (AI) in automated disease classification significantly reduces healthcare costs and improves the accessibility of services. However, this transformation has given rise to concerns about the fairness of AI, which disproportionately affects certain groups, particularly patients from underprivileged populations. Recently, a number of methods and large-scale datasets have been proposed to address group performance disparities. Although these methods have shown effectiveness in disease classification tasks, they may fall short in ensuring fair prediction of disease progression, mainly because of limited longitudinal data with diverse demographics available for training a robust and equitable prediction model. In this paper, we introduce TransFair to enhance demographic fairness in progression prediction for ocular diseases. TransFair aims to transfer a fairness-enhanced disease classification model to the task of progression prediction with fairness preserved. Specifically, we train a fair EfficientNet, termed FairEN, equipped with a fairness-aware attention mechanism using extensive data for ocular disease classification. Subsequently, this fair classification model is adapted to a fair progression prediction model through knowledge distillation, which aims to minimize the latent feature distances between the classification and progression prediction models. We evaluate FairEN and TransFair for fairness-enhanced ocular disease classification and progression prediction using both two-dimensional (2D) and 3D retinal images. Extensive experiments and comparisons with models with and without considering fairness learning show that TransFair effectively enhances demographic equity in predicting ocular disease progression.

Title: LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting

Authors: Lingzheng Zhang, Lifeng Shen, Yimin Zheng, Shiyuan Piao, Ziyue Li, Fugee Tsung
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00053
Pdf URL: https://arxiv.org/pdf/2412.00053
Copy Paste: [[2412.00053]] LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting(https://arxiv.org/abs/2412.00053)
Keywords: large language model
Abstract: Recent research has shown that large language models (LLMs) can be effectively used for real-world time series forecasting due to their strong natural language understanding capabilities. However, aligning time series into semantic spaces of LLMs comes with high computational costs and inference complexity, particularly for long-range time series generation. Building on recent advancements in using linear models for time series, this paper introduces an LLM-enhanced mixture of linear experts for precise and efficient time series forecasting. This approach involves developing a mixture of linear experts with multiple lookback lengths and a new multimodal fusion mechanism. The use of a mixture of linear experts is efficient due to its simplicity, while the multimodal fusion mechanism adaptively combines multiple linear experts based on the learned features of the text modality from pre-trained large language models. In experiments, we rethink the need to align time series to LLMs by existing time-series large language models and further discuss their efficiency and effectiveness in time series forecasting. Our experimental results show that the proposed LeMoLE model presents lower prediction errors and higher computational efficiency than existing LLM models.

Title: MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

Authors: Shezheng Song, Chengxiang He, Shasha Li, Shan Zhao, Chengyu Wang, Tianwei Yan, Xiaopeng Li, Qian Wan, Jun Ma, Jie Yu, Xiaoguang Mao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00060
Pdf URL: https://arxiv.org/pdf/2412.00060
Copy Paste: [[2412.00060]] MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image(https://arxiv.org/abs/2412.00060)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.

Title: Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Authors: Zhuofan Wen, Shangtong Gui, Yang Feng
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00061
Pdf URL: https://arxiv.org/pdf/2412.00061
Copy Paste: [[2412.00061]] Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration(https://arxiv.org/abs/2412.00061)
Keywords: large language model
Abstract: Inference acceleration of large language models (LLMs) has been put forward in many application scenarios and speculative decoding has shown its advantage in addressing inference acceleration. Speculative decoding usually introduces a draft model to assist the base LLM where the draft model produces drafts and the base LLM verifies the draft for acceptance or rejection. In this framework, the final inference speed is decided by the decoding speed of the draft model and the acceptance rate of the draft provided by the draft model. Currently the widely used draft models usually generate draft tokens for the next several positions in a non-autoregressive way without considering the correlations between draft tokens. Therefore, it has a high decoding speed but an unsatisfactory acceptance rate. In this paper, we focus on how to improve the performance of the draft model and aim to accelerate inference via a high acceptance rate. To this end, we propose a CTC-based draft model which strengthens the correlations between draft tokens during the draft phase, thereby generating higher-quality draft candidate sequences. Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed.

Title: Deep Learning-Based Electricity Price Forecast for Virtual Bidding in Wholesale Electricity Market

Authors: Xuesong Wang, Sharaf K. Magableh, Oraib Dawaghreh, Caisheng Wang, Jiaxuan Gong, Zhongyang Zhao, Michael H. Liao
Subjects: cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2412.00062
Pdf URL: https://arxiv.org/pdf/2412.00062
Copy Paste: [[2412.00062]] Deep Learning-Based Electricity Price Forecast for Virtual Bidding in Wholesale Electricity Market(https://arxiv.org/abs/2412.00062)
Keywords: transformer
Abstract: Virtual bidding plays an important role in two-settlement electric power markets, as it can reduce discrepancies between day-ahead and real-time markets. Renewable energy penetration increases volatility in electricity prices, making accurate forecasting critical for virtual bidders, reducing uncertainty and maximizing profits. This study presents a Transformer-based deep learning model to forecast the price spread between real-time and day-ahead electricity prices in the ERCOT (Electric Reliability Council of Texas) market. The proposed model leverages various time-series features, including load forecasts, solar and wind generation forecasts, and temporal attributes. The model is trained under realistic constraints and validated using a walk-forward approach by updating the model every week. Based on the price spread prediction results, several trading strategies are proposed and the most effective strategy for maximizing cumulative profit under realistic market conditions is identified through backtesting. The results show that the strategy of trading only at the peak hour with a precision score of over 50% produces nearly consistent profit over the test period. The proposed method underscores the importance of an accurate electricity price forecasting model and introduces a new method of evaluating the price forecast model from a virtual bidder's perspective, providing valuable insights for future research.

Title: DiffGuard: Text-Based Safety Checker for Diffusion Models

Authors: Massine El Khader, Elias Al Bouzidi, Abdellah Oumida, Mohammed Sbaihi, Eliott Binard, Jean-Philippe Poli, Wassila Ouerdane, Boussad Addad, Katarzyna Kapusta
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00064
Pdf URL: https://arxiv.org/pdf/2412.00064
Copy Paste: [[2412.00064]] DiffGuard: Text-Based Safety Checker for Diffusion Models(https://arxiv.org/abs/2412.00064)
Keywords: protect, diffusion
Abstract: Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.

Title: Targeted Therapy in Data Removal: Object Unlearning Based on Scene Graphs

Authors: Chenhan Zhang, Benjamin Zi Hao Zhao, Hassan Asghar, Dali Kaafar
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00067
Pdf URL: https://arxiv.org/pdf/2412.00067
Copy Paste: [[2412.00067]] Targeted Therapy in Data Removal: Object Unlearning Based on Scene Graphs(https://arxiv.org/abs/2412.00067)
Keywords: privacy
Abstract: Users may inadvertently upload personally identifiable information (PII) to Machine Learning as a Service (MLaaS) providers. When users no longer want their PII on these services, regulations like GDPR and COPPA mandate a right to forget for these users. As such, these services seek efficient methods to remove the influence of specific data points. Thus the introduction of machine unlearning. Traditionally, unlearning is performed with the removal of entire data samples (sample unlearning) or whole features across the dataset (feature unlearning). However, these approaches are not equipped to handle the more granular and challenging task of unlearning specific objects within a sample. To address this gap, we propose a scene graph-based object unlearning framework. This framework utilizes scene graphs, rich in semantic representation, transparently translate unlearning requests into actionable steps. The result, is the preservation of the overall semantic integrity of the generated image, bar the unlearned object. Further, we manage high computational overheads with influence functions to approximate the unlearning process. For validation, we evaluate the unlearned object's fidelity in outputs under the tasks of image reconstruction and image synthesis. Our proposed framework demonstrates improved object unlearning outcomes, with the preservation of unrequested samples in contrast to sample and feature learning methods. This work addresses critical privacy issues by increasing the granularity of targeted machine unlearning through forgetting specific object-level details without sacrificing the utility of the whole data sample or dataset feature.

Title: Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Authors: Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, Lu Yin
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00069
Pdf URL: https://arxiv.org/pdf/2412.00069
Copy Paste: [[2412.00069]] Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning(https://arxiv.org/abs/2412.00069)
Keywords: large language model
Abstract: Mixture-of-Experts (MOE) has garnered significant attention for their ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not relieve the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose Condense-MoE (CD-MoE} that, instead of dropping the entire MoE layer, condenses the big, sparse MoE layer into a small but dense layer with only a few experts that are activated for all tokens. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with certain experts isolated to serve as shared experts that are always activated. We demonstrate the effectiveness of our method across multiple MoE models such as DeepSeekMoE and QwenMoE on various benchmarks. Specifically, for the DeepSeekMoE-16B model, our approach maintains nearly 90% of the average accuracy while reducing memory usage by 30% and enhancing inference speed by 30%. Moreover, we show that with lightweight expert fine-tuning, the pruned model can achieve further improvements on specific tasks. Our code are available at this https URL.

Title: Addressing Vulnerabilities in AI-Image Detection: Challenges and Proposed Solutions

Authors: Justin Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00073
Pdf URL: https://arxiv.org/pdf/2412.00073
Copy Paste: [[2412.00073]] Addressing Vulnerabilities in AI-Image Detection: Challenges and Proposed Solutions(https://arxiv.org/abs/2412.00073)
Keywords: robust, diffusion, generative
Abstract: The rise of advanced AI models like Generative Adversarial Networks (GANs) and diffusion models such as Stable Diffusion has made the creation of highly realistic images accessible, posing risks of misuse in misinformation and manipulation. This study evaluates the effectiveness of convolutional neural networks (CNNs), as well as DenseNet architectures, for detecting AI-generated images. Using variations of the CIFAKE dataset, including images generated by different versions of Stable Diffusion, we analyze the impact of updates and modifications such as Gaussian blurring, prompt text changes, and Low-Rank Adaptation (LoRA) on detection accuracy. The findings highlight vulnerabilities in current detection methods and propose strategies to enhance the robustness and reliability of AI-image detection systems.

Title: Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness

Authors: Avinash Amballa, Durga Sandeep Saluru, Gayathri Akkinapalli, Abhishek Sureddy, Akshay Kumar Sureddy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.00074
Pdf URL: https://arxiv.org/pdf/2412.00074
Copy Paste: [[2412.00074]] Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness(https://arxiv.org/abs/2412.00074)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation. However, these models can inadvertently generate unsafe or biased responses when prompted with problematic inputs, raising significant ethical and practical concerns for real-world deployment. This research addresses the critical challenge of developing language models that generate both helpful and harmless content, navigating the delicate balance between model performance and safety. We demonstrate that incorporating safety-related instructions during the instruction-tuning of pre-trained models significantly reduces toxic responses to unsafe prompts without compromising performance on helpfulness datasets. We found Direct Preference Optimization (DPO) to be particularly effective, outperforming both SIT and RAFT by leveraging both chosen and rejected responses for learning. Our approach increased safe responses from 40$\%$ to over 90$\%$ across various harmfulness benchmarks. In addition, we discuss a rigorous evaluation framework encompassing specialized metrics and diverse datasets for safety and helpfulness tasks ensuring a comprehensive assessment of the model's capabilities.

Title: Dual Prototyping with Domain and Class Prototypes for Affective Brain-Computer Interface in Unseen Target Conditions

Authors: Guangli Li, Zhehao Zhou, Tuo Sun, Ping Tan, Li Zhang, Zhen Liang
Subjects: cs.LG, cs.AI, cs.HC, eess.SP
Abstract URL: https://arxiv.org/abs/2412.00082
Pdf URL: https://arxiv.org/pdf/2412.00082
Copy Paste: [[2412.00082]] Dual Prototyping with Domain and Class Prototypes for Affective Brain-Computer Interface in Unseen Target Conditions(https://arxiv.org/abs/2412.00082)
Keywords: robust
Abstract: EEG signals have emerged as a powerful tool in affective brain-computer interfaces, playing a crucial role in emotion recognition. However, current deep transfer learning-based methods for EEG recognition face challenges due to the reliance of both source and target data in model learning, which significantly affect model performance and generalization. To overcome this limitation, we propose a novel framework (PL-DCP) and introduce the concepts of feature disentanglement and prototype inference. The dual prototyping mechanism incorporates both domain and class prototypes: domain prototypes capture individual variations across subjects, while class prototypes represent the ideal class distributions within their respective domains. Importantly, the proposed PL-DCP framework operates exclusively with source data during training, meaning that target data remains completely unseen throughout the entire process. To address label noise, we employ a pairwise learning strategy that encodes proximity relationships between sample pairs, effectively reducing the influence of mislabeled data. Experimental validation on the SEED and SEED-IV datasets demonstrates that PL-DCP, despite not utilizing target data during training, achieves performance comparable to deep transfer learning methods that require both source and target data. This highlights the potential of PL-DCP as an effective and robust approach for EEG-based emotion recognition.

Title: Visual Error Patterns in Multi-Modal AI: A Statistical Approach

Authors: Ching-Yi Wang
Subjects: cs.LG, cs.AI, cs.CV, stat.AP
Abstract URL: https://arxiv.org/abs/2412.00083
Pdf URL: https://arxiv.org/pdf/2412.00083
Copy Paste: [[2412.00083]] Visual Error Patterns in Multi-Modal AI: A Statistical Approach(https://arxiv.org/abs/2412.00083)
Keywords: large language model
Abstract: Artificial Intelligence (AI) has achieved transformative success across a wide range of domains, revolutionizing fields such as healthcare, education, and human-computer interaction. However, the mechanisms driving AI's performance often remain opaque, particularly in the context of large language models (LLMs), which have advanced at an unprecedented pace in recent years. Multi-modal large language models (MLLMs) like GPT-4o exemplify this evolution, integrating text, audio, and visual inputs to enable interaction across diverse domains. Despite their remarkable capabilities, these models remain largely "black boxes," offering limited insight into how they process multi-modal information internally. This lack of transparency poses significant challenges, including systematic biases, flawed associations, and unintended behaviors, which require careful investigation. Understanding the decision-making processes of MLLMs is both beneficial and essential for mitigating these challenges and ensuring their reliable deployment in critical applications. GPT-4o was chosen as the focus of this study for its advanced multi-modal capabilities, which allow simultaneous processing of textual and visual information. These capabilities make it an ideal model for investigating the parallels and distinctions between machine-driven and human-driven visual perception. While GPT-4o performs effectively in tasks involving structured and complete data, its reliance on bottom-up processing, which involves a feature-by-feature analysis of sensory inputs, presents challenges when interpreting complex or ambiguous stimuli. This limitation contrasts with human vision, which is remarkably adept at resolving ambiguity and reconstructing incomplete information through high-level cognitive processes.

Title: Unpacking the Individual Components of Diffusion Policy

Authors: Xiu Yuan
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00084
Pdf URL: https://arxiv.org/pdf/2412.00084
Copy Paste: [[2412.00084]] Unpacking the Individual Components of Diffusion Policy(https://arxiv.org/abs/2412.00084)
Keywords: diffusion, transformer
Abstract: Imitation Learning presents a promising approach for learning generalizable and complex robotic skills. The recently proposed Diffusion Policy generates robot action sequences through a conditional denoising diffusion process, achieving state-of-the-art performance compared to other imitation learning methods. This paper summarizes five key components of Diffusion Policy: 1) observation sequence input; 2) action sequence execution; 3) receding horizon; 4) U-Net or Transformer network architecture; and 5) FiLM conditioning. By conducting experiments across ManiSkill and Adroit benchmarks, this study aims to elucidate the contribution of each component to the success of Diffusion Policy in various scenarios. We hope our findings will provide valuable insights for the application of Diffusion Policy in future research and industry.

Title: Residual Attention Single-Head Vision Transformer Network for Rolling Bearing Fault Diagnosis in Noisy Environments

Authors: Songjiang Lai, Tsun-Hin Cheung, Jiayi Zhao, Kaiwen Xue, Ka-Chun Fung, Kin-Man Lam
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00085
Pdf URL: https://arxiv.org/pdf/2412.00085
Copy Paste: [[2412.00085]] Residual Attention Single-Head Vision Transformer Network for Rolling Bearing Fault Diagnosis in Noisy Environments(https://arxiv.org/abs/2412.00085)
Keywords: robust, extraction, transformer
Abstract: Rolling bearings play a crucial role in industrial machinery, directly influencing equipment performance, durability, and safety. However, harsh operating conditions, such as high speeds and temperatures, often lead to bearing malfunctions, resulting in downtime, economic losses, and safety hazards. This paper proposes the Residual Attention Single-Head Vision Transformer Network (RA-SHViT-Net) for fault diagnosis in rolling bearings. Vibration signals are transformed from the time to frequency domain using the Fast Fourier Transform (FFT) before being processed by RA-SHViT-Net. The model employs the Single-Head Vision Transformer (SHViT) to capture local and global features, balancing computational efficiency and predictive accuracy. To enhance feature extraction, the Adaptive Hybrid Attention Block (AHAB) integrates channel and spatial attention mechanisms. The network architecture includes Depthwise Convolution, Single-Head Self-Attention, Residual Feed-Forward Networks (Res-FFN), and AHAB modules, ensuring robust feature representation and mitigating gradient vanishing issues. Evaluation on the Case Western Reserve University and Paderborn University datasets demonstrates the RA-SHViT-Net's superior accuracy and robustness in complex, noisy environments. Ablation studies further validate the contributions of individual components, establishing RA-SHViT-Net as an effective tool for early fault detection and classification, promoting efficient maintenance strategies in industrial settings. Keywords: rolling bearings, fault diagnosis, Vision Transformer, attention mechanism, noisy environments, Fast Fourier Transform (FFT)

Title: Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks

Authors: Zuguang Li, Shaohua Wu, Liang Li, Songge Zhang
Subjects: cs.LG, cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2412.00090
Pdf URL: https://arxiv.org/pdf/2412.00090
Copy Paste: [[2412.00090]] Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks(https://arxiv.org/abs/2412.00090)
Keywords: large language model
Abstract: In this letter, we propose an energy-efficient split learning (SL) framework for fine-tuning large language models (LLMs) using geo-distributed personal data at the network edge, where LLMs are split and alternately across massive mobile devices and an edge server. Considering the device heterogeneity and channel dynamics in edge networks, a Cut lAyer and computing Resource Decision (CARD) algorithm is developed to minimize training delay and energy consumption. Simulation results demonstrate that the proposed approach reduces the average training delay and server's energy consumption by 70.8\% and 53.1\%, compared to the benchmarks, respectively.

Title: A Novel Approach to Image Steganography Using Generative Adversarial Networks

Authors: Waheed Rehman
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00094
Pdf URL: https://arxiv.org/pdf/2412.00094
Copy Paste: [[2412.00094]] A Novel Approach to Image Steganography Using Generative Adversarial Networks(https://arxiv.org/abs/2412.00094)
Keywords: secure, robust, generative
Abstract: The field of steganography has long been focused on developing methods to securely embed information within various digital media while ensuring imperceptibility and robustness. However, the growing sophistication of detection tools and the demand for increased data hiding capacity have revealed limitations in traditional techniques. In this paper, we propose a novel approach to image steganography that leverages the power of generative adversarial networks (GANs) to address these challenges. By employing a carefully designed GAN architecture, our method ensures the creation of stego-images that are visually indistinguishable from their original counterparts, effectively thwarting detection by advanced steganalysis tools. Additionally, the adversarial training paradigm optimizes the balance between embedding capacity, imperceptibility, and robustness, enabling more efficient and secure data hiding. We evaluate our proposed method through a series of experiments on benchmark datasets and compare its performance against baseline techniques, including least significant bit (LSB) substitution and discrete cosine transform (DCT)-based methods. Our results demonstrate significant improvements in metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and robustness against detection. This work not only contributes to the advancement of image steganography but also provides a foundation for exploring GAN-based approaches for secure digital communication.

Title: Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

Authors: Zhyar Rzgar K Rostam, Gábor Kertész
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.00098
Pdf URL: https://arxiv.org/pdf/2412.00098
Copy Paste: [[2412.00098]] Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study(https://arxiv.org/abs/2412.00098)
Keywords: transformer, large language model
Abstract: The exponential growth of online textual content across diverse domains has necessitated advanced methods for automated text classification. Large Language Models (LLMs) based on transformer architectures have shown significant success in this area, particularly in natural language processing (NLP) tasks. However, general-purpose LLMs often struggle with domain-specific content, such as scientific texts, due to unique challenges like specialized vocabulary and imbalanced data. In this study, we fine-tune four state-of-the-art LLMs BERT, SciBERT, BioBERT, and BlueBERT on three datasets derived from the WoS-46985 dataset to evaluate their performance in scientific text classification. Our experiments reveal that domain-specific models, particularly SciBERT, consistently outperform general-purpose models in both abstract-based and keyword-based classification tasks. Additionally, we compare our achieved results with those reported in the literature for deep learning models, further highlighting the advantages of LLMs, especially when utilized in specific domains. The findings emphasize the importance of domain-specific adaptations for LLMs to enhance their effectiveness in specialized text classification tasks.

Title: Steering Rectified Flow Models in the Vector Field for Controlled Image Generation

Authors: Maitreya Patel, Song Wen, Dimitris N. Metaxas, Yezhou Yang
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00100
Pdf URL: https://arxiv.org/pdf/2412.00100
Copy Paste: [[2412.00100]] Steering Rectified Flow Models in the Vector Field for Controlled Image Generation(https://arxiv.org/abs/2412.00100)
Keywords: diffusion
Abstract: Diffusion models (DMs) excel in photorealism, image editing, and solving inverse problems, aided by classifier-free guidance and image inversion techniques. However, rectified flow models (RFMs) remain underexplored for these tasks. Existing DM-based methods often require additional training, lack generalization to pretrained latent models, underperform, and demand significant computational resources due to extensive backpropagation through ODE solvers and inversion processes. In this work, we first develop a theoretical and empirical understanding of the vector field dynamics of RFMs in efficiently guiding the denoising trajectory. Our findings reveal that we can navigate the vector field in a deterministic and gradient-free manner. Utilizing this property, we propose FlowChef, which leverages the vector field to steer the denoising trajectory for controlled image generation tasks, facilitated by gradient skipping. FlowChef is a unified framework for controlled image generation that, for the first time, simultaneously addresses classifier guidance, linear inverse problems, and image editing without the need for extra training, inversion, or intensive backpropagation. Finally, we perform extensive evaluations and show that FlowChef significantly outperforms baselines in terms of performance, memory, and time requirements, achieving new state-of-the-art results. Project Page: \url{this https URL}.

Title: Multi-Label Contrastive Learning : A Comprehensive Study

Authors: Alexandre Audibert, Aurélien Gauffre, Massih-Reza Amini
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.00101
Pdf URL: https://arxiv.org/pdf/2412.00101
Copy Paste: [[2412.00101]] Multi-Label Contrastive Learning : A Comprehensive Study(https://arxiv.org/abs/2412.00101)
Keywords: robust
Abstract: Multi-label classification, which involves assigning multiple labels to a single input, has emerged as a key area in both research and industry due to its wide-ranging applications. Designing effective loss functions is crucial for optimizing deep neural networks for this task, as they significantly influence model performance and efficiency. Traditional loss functions, which often maximize likelihood under the assumption of label independence, may struggle to capture complex label relationships. Recent research has turned to supervised contrastive learning, a method that aims to create a structured representation space by bringing similar instances closer together and pushing dissimilar ones apart. Although contrastive learning offers a promising approach, applying it to multi-label classification presents unique challenges, particularly in managing label interactions and data structure. In this paper, we conduct an in-depth study of contrastive learning loss for multi-label classification across diverse settings. These include datasets with both small and large numbers of labels, datasets with varying amounts of training data, and applications in both computer vision and natural language processing. Our empirical results indicate that the promising outcomes of contrastive learning are attributable not only to the consideration of label interactions but also to the robust optimization scheme of the contrastive loss. Furthermore, while the supervised contrastive loss function faces challenges with datasets containing a small number of labels and ranking-based metrics, it demonstrates excellent performance, particularly in terms of Macro-F1, on datasets with a large number of labels.

Title: ElectroVizQA: How well do Multi-modal LLMs perform in Electronics Visual Question Answering?

Authors: Pragati Shuddhodhan Meshram, Swetha Karthikeyan, Bhavya, Suma Bhat
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00102
Pdf URL: https://arxiv.org/pdf/2412.00102
Copy Paste: [[2412.00102]] ElectroVizQA: How well do Multi-modal LLMs perform in Electronics Visual Question Answering?(https://arxiv.org/abs/2412.00102)
Keywords: large language model
Abstract: Multi-modal Large Language Models (MLLMs) are gaining significant attention for their ability to process multi-modal data, providing enhanced contextual understanding of complex problems. MLLMs have demonstrated exceptional capabilities in tasks such as Visual Question Answering (VQA); however, they often struggle with fundamental engineering problems, and there is a scarcity of specialized datasets for training on topics like digital electronics. To address this gap, we propose a benchmark dataset called ElectroVizQA specifically designed to evaluate MLLMs' performance on digital electronic circuit problems commonly found in undergraduate curricula. This dataset, the first of its kind tailored for the VQA task in digital electronics, comprises approximately 626 visual questions, offering a comprehensive overview of digital electronics topics. This paper rigorously assesses the extent to which MLLMs can understand and solve digital electronic circuit questions, providing insights into their capabilities and limitations within this specialized domain. By introducing this benchmark dataset, we aim to motivate further research and development in the application of MLLMs to engineering education, ultimately bridging the performance gap and enhancing the efficacy of these models in technical fields.

Title: Differential learning kinetics govern the transition from memorization to generalization during in-context learning

Authors: Alex Nguyen, Gautam Reddy
Subjects: cs.LG, cond-mat.dis-nn, cs.AI, cs.NE, q-bio.NC
Abstract URL: https://arxiv.org/abs/2412.00104
Pdf URL: https://arxiv.org/pdf/2412.00104
Copy Paste: [[2412.00104]] Differential learning kinetics govern the transition from memorization to generalization during in-context learning(https://arxiv.org/abs/2412.00104)
Keywords: transformer
Abstract: Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network's limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relative rates at which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task diversity threshold at which the network generalizes. The theory quantitatively explains a variety of other ICL-related phenomena, including the long-tailed distribution of when ICL is acquired, the bimodal behavior of solutions close to the task diversity threshold, the influence of contextual and data distributional statistics on ICL, and the transient nature of ICL.

Title: Predicting Extubation Failure in Intensive Care: The Development of a Novel, End-to-End Actionable and Interpretable Prediction System

Authors: Akram Yoosoofsah
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2412.00105
Pdf URL: https://arxiv.org/pdf/2412.00105
Copy Paste: [[2412.00105]] Predicting Extubation Failure in Intensive Care: The Development of a Novel, End-to-End Actionable and Interpretable Prediction System(https://arxiv.org/abs/2412.00105)
Keywords: interpretability
Abstract: Predicting extubation failure in intensive care is challenging due to complex data and the severe consequences of inaccurate predictions. Machine learning shows promise in improving clinical decision-making but often fails to account for temporal patient trajectories and model interpretability, highlighting the need for innovative solutions. This study aimed to develop an actionable, interpretable prediction system for extubation failure using temporal modelling approaches such as Long Short-Term Memory (LSTM) and Temporal Convolutional Networks (TCN). A retrospective cohort study of 4,701 mechanically ventilated patients from the MIMIC-IV database was conducted. Data from the 6 hours before extubation, including static and dynamic features, were processed through novel techniques addressing data inconsistency and synthetic data challenges. Feature selection was guided by clinical relevance and literature benchmarks. Iterative experimentation involved training LSTM, TCN, and LightGBM models. Initial results showed a strong bias toward predicting extubation success, despite advanced hyperparameter tuning and static data inclusion. Data was stratified by sampling frequency to reduce synthetic data impacts, leading to a fused decision system with improved performance. However, all architectures yielded modest predictive power (AUC-ROC ~0.6; F1 <0.5) with no clear advantage in incorporating static data or additional features. Ablation analysis indicated minimal impact of individual features on model performance. This thesis highlights the challenges of synthetic data in extubation failure prediction and introduces strategies to mitigate bias, including clinician-informed preprocessing and novel feature subsetting. While performance was limited, the study provides a foundation for future work, emphasising the need for reliable, interpretable models to optimise ICU outcomes.

Title: Demographic Predictability in 3D CT Foundation Embeddings

Authors: Guangyao Zheng, Michael A. Jacobs, Vishwa S. Parekh
Subjects: cs.CV, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00110
Pdf URL: https://arxiv.org/pdf/2412.00110
Copy Paste: [[2412.00110]] Demographic Predictability in 3D CT Foundation Embeddings(https://arxiv.org/abs/2412.00110)
Keywords: privacy, protect, fair
Abstract: Self-supervised foundation models have recently been successfully extended to encode three-dimensional (3D) computed tomography (CT) images, with excellent performance across several downstream tasks, such as intracranial hemorrhage detection and lung cancer risk forecasting. However, as self-supervised models learn from complex data distributions, questions arise concerning whether these embeddings capture demographic information, such as age, sex, or race. Using the National Lung Screening Trial (NLST) dataset, which contains 3D CT images and demographic data, we evaluated a range of classifiers: softmax regression, linear regression, linear support vector machine, random forest, and decision tree, to predict sex, race, and age of the patients in the images. Our results indicate that the embeddings effectively encoded age and sex information, with a linear regression model achieving a root mean square error (RMSE) of 3.8 years for age prediction and a softmax regression model attaining an AUC of 0.998 for sex classification. Race prediction was less effective, with an AUC of 0.878. These findings suggest a detailed exploration into the information encoded in self-supervised learning frameworks is needed to help ensure fair, responsible, and patient privacy-protected healthcare AI.

Title: SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Authors: Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00114
Pdf URL: https://arxiv.org/pdf/2412.00114
Copy Paste: [[2412.00114]] SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments(https://arxiv.org/abs/2412.00114)
Keywords: defense, attack, diffusion
Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications. Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulnerabilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.

Title: OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Authors: Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00115
Pdf URL: https://arxiv.org/pdf/2412.00115
Copy Paste: [[2412.00115]] OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation(https://arxiv.org/abs/2412.00115)
Keywords: diffusion, transformer
Abstract: Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce \textbf{OpenHumanVid}, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. The source code and the dataset are available at: \href{this https URL}{this https URL}.

Title: Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Authors: Xuexiang Niu, Jinping Tang, Lei Wang, Ge Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00122
Pdf URL: https://arxiv.org/pdf/2412.00122
Copy Paste: [[2412.00122]] Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback(https://arxiv.org/abs/2412.00122)
Keywords: diffusion
Abstract: Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method with specific reward objectives, including three stages. First, generated images from diffusion model are detected to obtain the object categories and quantities. Meanwhile, the confidence of category and quantity can be derived from the detection results and given prompts. Next, we define a novel matching score, based on above confidence, to measure text-image alignment. It can guide the model for feedback learning in the form of a reward function. Finally, we fine-tune the diffusion model by backpropagation the reward function gradients to generate semantically related images. Different from previous feedbacks that focus more on overall matching, we place more emphasis on the accuracy of entity categories and quantities. Besides, we construct a text-to-image dataset for studying the compositional generation, including 1.7 K pairs of text-image with diverse combinations of entities and quantities. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity. In addition, our model can also serve as a metric for evaluating text-image alignment in other models. All code and dataset are available at this https URL.

Title: Streamlined Federated Unlearning: Unite as One to Be Highly Efficient

Authors: Lei Zhou, Youwen Zhu, Qiao Xue, Ji Zhang, Pengfei Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.00126
Pdf URL: https://arxiv.org/pdf/2412.00126
Copy Paste: [[2412.00126]] Streamlined Federated Unlearning: Unite as One to Be Highly Efficient(https://arxiv.org/abs/2412.00126)
Keywords: privacy, attack, federate
Abstract: Recently, the enactment of "right to be forgotten" laws and regulations has imposed new privacy requirements on federated learning (FL). Researchers aim to remove the influence of certain data from the trained model without training from scratch through federated unlearning (FU). While current FU research has shown progress in enhancing unlearning efficiency, it often results in degraded model performance upon achieving the goal of data unlearning, necessitating additional steps to recover the performance of the unlearned model. Moreover, these approaches also suffer from many shortcomings such as high consumption of computational and storage resources. To this end, we propose a streamlined federated unlearning approach (SFU) aimed at effectively removing the influence of target data while preserving the model's performance on the retained data without degradation. We design a practical multi-teacher system that achieves both target data influence removal and model performance preservation by guiding the unlearned model through several distinct teacher models. SFU is both computationally and storage-efficient, highly flexible, and generalizable. We conducted extensive experiments on both image and text benchmark datasets. The results demonstrate that SFU significantly improves time and communication efficiency compared to the benchmark retraining method and significantly outperforms existing state-of-the-art (SOTA) methods. Additionally, we verified the effectiveness of SFU using the backdoor attack.

Title: Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Authors: Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00127
Pdf URL: https://arxiv.org/pdf/2412.00127
Copy Paste: [[2412.00127]] Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads(https://arxiv.org/abs/2412.00127)
Keywords: diffusion, transformer
Abstract: We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise an efficient strategy for building Orthus -- by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours). Orthus-base can further embrace post-training to better model interleaved images and texts. Empirically, Orthus surpasses competing baselines including Show-o and Chameleon across standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows exceptional mixed-modality generation capabilities, reflecting the potential for handling intricate practical generation tasks.

Title: Scaling Particle Collision Data Analysis

Authors: Hengkui Wu, Panpan Chi, Yongfeng Zhu, Liujiang Liu, Shuyang Hu, Yuexin Wang, Chen Zhou, Qihao Wang, Yingsi Xin, Bruce Liu, Dahao Liang, Xinglong Jia, Manqi Ruan
Subjects: cs.LG, hep-ex, physics.data-an
Abstract URL: https://arxiv.org/abs/2412.00129
Pdf URL: https://arxiv.org/pdf/2412.00129
Copy Paste: [[2412.00129]] Scaling Particle Collision Data Analysis(https://arxiv.org/abs/2412.00129)
Keywords: large language model
Abstract: For decades, researchers have developed task-specific models to address scientific challenges across diverse disciplines. Recently, large language models (LLMs) have shown enormous capabilities in handling general tasks; however, these models encounter difficulties in addressing real-world scientific problems, particularly in domains involving large-scale numerical data analysis, such as experimental high energy physics. This limitation is primarily due to BPE tokenization's inefficacy with numerical data. In this paper, we propose a task-agnostic architecture, BBT-Neutron, which employs a binary tokenization method to facilitate pretraining on a mixture of textual and large-scale numerical experimental data. The project code is available at this https URL. We demonstrate the application of BBT-Neutron to Jet Origin Identification (JoI), a critical categorization challenge in high-energy physics that distinguishes jets originating from various quarks or gluons. Our results indicate that BBT-Neutron achieves comparable performance to state-of-the-art task-specific JoI models. Furthermore, we examine the scaling behavior of BBT-Neutron's performance with increasing data volume, suggesting the potential for BBT-Neutron to serve as a foundational model for particle physics data analysis, with possible extensions to a broad spectrum of scientific computing applications for Big Science experiments, industrial manufacturing and spacial computing.

Title: Event-based Tracking of Any Point with Motion-Robust Correlation Features

Authors: Friedhelm Hamann, Daniel Gehrig, Filbert Febryanto, Kostas Daniilidis, Guillermo Gallego
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00133
Pdf URL: https://arxiv.org/pdf/2412.00133
Copy Paste: [[2412.00133]] Event-based Tracking of Any Point with Motion-Robust Correlation Features(https://arxiv.org/abs/2412.00133)
Keywords: robust
Abstract: Tracking any point (TAP) recently shifted the motion estimation paradigm from focusing on individual salient points with local templates to tracking arbitrary points with global image contexts. However, while research has mostly focused on driving the accuracy of models in nominal settings, addressing scenarios with difficult lighting conditions and high-speed motions remains out of reach due to the limitations of the sensor. This work addresses this challenge with the first event camera-based TAP method. It leverages the high temporal resolution and high dynamic range of event cameras for robust high-speed tracking, and the global contexts in TAP methods to handle asynchronous and sparse event measurements. We further extend the TAP framework to handle event feature variations induced by motion - thereby addressing an open challenge in purely event-based tracking - with a novel feature alignment loss which ensures the learning of motion-robust features. Our method is trained with data from a new data generation pipeline and systematically ablated across all design decisions. Our method shows strong cross-dataset generalization and performs 135% better on the average Jaccard metric than the baselines. Moreover, on an established feature tracking benchmark, it achieves a 19% improvement over the previous best event-only method and even surpasses the previous best events-and-frames method by 3.7%.

Title: FonTS: Text Rendering with Typography and Style Controls

Authors: Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, Xingxing Zou
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00136
Pdf URL: https://arxiv.org/pdf/2412.00136
Copy Paste: [[2412.00136]] FonTS: Text Rendering with Typography and Style Controls(https://arxiv.org/abs/2412.00136)
Keywords: diffusion, transformer
Abstract: Visual text images are prevalent in various applications, requiring careful font selection and typographic choices. Recent advances in Diffusion Transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still face challenges such as inconsistent fonts, style variation, and limited fine-grained control, particularly at the word level. This paper proposes a two-stage DiT-based pipeline to address these issues by enhancing controllability over typography and style in text rendering. We introduce Typography Control (TC) finetuning, an efficient parameter fine-tuning method, and enclosing typography control tokens (ETC-tokens), which enable precise word-level application of typographic features. To further enhance style control, we present a Style Control Adapter (SCA) that injects style information through image inputs independent of text prompts. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in Basic and Artistic Text Rendering (BTR and ATR) tasks. Our results mark a significant advancement in the precision and adaptability of T2I models, presenting new possibilities for creative applications and design-oriented tasks.

Title: EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval

Authors: Muhammad Huzaifa, Yova Kementchedjhieva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00139
Pdf URL: https://arxiv.org/pdf/2412.00139
Copy Paste: [[2412.00139]] EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval(https://arxiv.org/abs/2412.00139)
Keywords: robust
Abstract: Text-to-image retrieval is a critical task for managing diverse visual content, but common benchmarks for the task rely on small, single-domain datasets that fail to capture real-world complexity. Pre-trained vision-language models tend to perform well with easy negatives but struggle with hard negatives--visually similar yet incorrect images--especially in open-domain scenarios. To address this, we introduce Episodic Few-Shot Adaptation (EFSA), a novel test-time framework that adapts pre-trained models dynamically to a query's domain by fine-tuning on top-k retrieved candidates and synthetic captions generated for them. EFSA improves performance across diverse domains while preserving generalization, as shown in evaluations on queries from eight highly distinct visual domains and an open-domain retrieval pool of over one million images. Our work highlights the potential of episodic few-shot adaptation to enhance robustness in the critical and understudied task of open-domain text-to-image retrieval.

Title: Differentiable Topology Estimating from Curvatures for 3D Shapes

Authors: Yihao Luo
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00140
Pdf URL: https://arxiv.org/pdf/2412.00140
Copy Paste: [[2412.00140]] Differentiable Topology Estimating from Curvatures for 3D Shapes(https://arxiv.org/abs/2412.00140)
Keywords: robust
Abstract: In the field of data-driven 3D shape analysis and generation, the estimation of global topological features from localized representations such as point clouds, voxels, and neural implicit fields is a longstanding challenge. This paper introduces a novel, differentiable algorithm tailored to accurately estimate the global topology of 3D shapes, overcoming the limitations of traditional methods rooted in mesh reconstruction and topological data analysis. The proposed method ensures high accuracy, efficiency, and instant computation with GPU compatibility. It begins with an efficient calculation of the self-adjoint Weingarten map for point clouds and its adaptations for other modalities. The curvatures are then extracted, and their integration over tangent differentiable Voronoi elements is utilized to estimate key topological invariants, including the Euler number and Genus. Additionally, an auto-optimization mechanism is implemented to refine the local moving frames and area elements based on the integrity of topological invariants. Experimental results demonstrate the method's superior performance across various datasets. The robustness and differentiability of the algorithm ensure its seamless integration into deep learning frameworks, offering vast potential for downstream tasks in 3D shape analysis.

Title: Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

Authors: Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00142
Pdf URL: https://arxiv.org/pdf/2412.00142
Copy Paste: [[2412.00142]] Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers(https://arxiv.org/abs/2412.00142)
Keywords: robust, extraction, generative
Abstract: Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1\% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.

Title: MPQ-Diff: Mixed Precision Quantization for Diffusion Models

Authors: Rocco Manz Maruzzelli, Basile Lewandowski, Lydia Y. Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00144
Pdf URL: https://arxiv.org/pdf/2412.00144
Copy Paste: [[2412.00144]] MPQ-Diff: Mixed Precision Quantization for Diffusion Models(https://arxiv.org/abs/2412.00144)
Keywords: diffusion
Abstract: Diffusion models (DMs) generate remarkable high quality images via the stochastic denoising process, which unfortunately incurs high sampling time. Post-quantizing the trained diffusion models in fixed bit-widths, e.g., 4 bits on weights and 8 bits on activation, is shown effective in accelerating sampling time while maintaining the image quality. Motivated by the observation that the cross-layer dependency of DMs vary across layers and sampling steps, we propose a mixed precision quantization scheme, MPQ-Diff, which allocates different bit-width to the weights and activation of the layers. We advocate to use the cross-layer correlation of a given layer, termed network orthogonality metric, as a proxy to measure the relative importance of a layer per sampling step. We further adopt a uniform sampling scheme to avoid the excessive profiling overhead of estimating orthogonality across all time steps. We evaluate the proposed mixed-precision on LSUN and ImageNet, showing a significant improvement in FID from 65.73 to 15.39, and 52.66 to 14.93, compared to their fixed precision quantization, respectively.

Title: Knowledge-Augmented Explainable and Interpretable Learning for Anomaly Detection and Diagnosis

Authors: Martin Atzmueller, Tim Bohne, Patricia Windler
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00146
Pdf URL: https://arxiv.org/pdf/2412.00146
Copy Paste: [[2412.00146]] Knowledge-Augmented Explainable and Interpretable Learning for Anomaly Detection and Diagnosis(https://arxiv.org/abs/2412.00146)
Keywords: interpretability, explainability
Abstract: Knowledge-augmented learning enables the combination of knowledge-based and data-driven approaches. For anomaly detection and diagnosis, understandability is typically an important factor, especially in high-risk areas. Therefore, explainability and interpretability are also major criteria in such contexts. This chapter focuses on knowledge-augmented explainable and interpretable learning to enhance understandability, transparency and ultimately computational sensemaking. We exemplify different approaches and methods in the domains of anomaly detection and diagnosis - from comparatively simple interpretable methods towards more advanced neuro-symbolic approaches.

Title: Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise

Authors: Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, Kyoobin Lee
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00150
Pdf URL: https://arxiv.org/pdf/2412.00150
Copy Paste: [[2412.00150]] Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise(https://arxiv.org/abs/2412.00150)
Keywords: robust
Abstract: Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels.

Title: DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00151
Pdf URL: https://arxiv.org/pdf/2412.00151
Copy Paste: [[2412.00151]] DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness(https://arxiv.org/abs/2412.00151)
Keywords: interpretability, large language model
Abstract: Document Visual Question Answering (VQA) requires models to interpret textual information within complex visual layouts and comprehend spatial relationships to answer questions based on document images. Existing approaches often lack interpretability and fail to precisely localize answers within the document, hindering users' ability to verify responses and understand the reasoning process. Moreover, standard metrics like Average Normalized Levenshtein Similarity (ANLS) focus on text accuracy but overlook spatial correctness. We introduce DLaVA, a novel method that enhances Multimodal Large Language Models (MLLMs) with answer localization capabilities for Document VQA. Our approach integrates image annotation directly into the MLLM pipeline, improving interpretability by enabling users to trace the model's reasoning. We present both OCR-dependent and OCR-free architectures, with the OCR-free approach eliminating the need for separate text recognition components, thus reducing complexity. To the best of our knowledge, DLaVA is the first approach to introduce answer localization within multimodal QA, marking a significant step forward in enhancing user trust and reducing the risk of AI hallucinations. Our contributions include enhancing interpretability and reliability by grounding responses in spatially annotated visual content, introducing answer localization in MLLMs, proposing a streamlined pipeline that combines an MLLM with a text detection module, and conducting comprehensive evaluations using both textual and spatial accuracy metrics, including Intersection over Union (IoU). Experimental results on standard datasets demonstrate that DLaVA achieves SOTA performance, significantly enhancing model transparency and reliability. Our approach sets a new benchmark for Document VQA, highlighting the critical importance of precise answer localization and model interpretability.

Title: ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

Authors: Kunyang Han, Yibo Hu, Mengxue Qu, Hailin Shi, Yao Zhao, Yunchao Wei
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00153
Pdf URL: https://arxiv.org/pdf/2412.00153
Copy Paste: [[2412.00153]] ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model(https://arxiv.org/abs/2412.00153)
Keywords: segmentation
Abstract: Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.

Title: T-3DGS: Removing Transient Objects for 3D Scene Reconstruction

Authors: Vadim Pryadilshchikov, Alexander Markin, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, Evgeny Burnaev
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00155
Pdf URL: https://arxiv.org/pdf/2412.00155
Copy Paste: [[2412.00155]] T-3DGS: Removing Transient Objects for 3D Scene Reconstruction(https://arxiv.org/abs/2412.00155)
Keywords: segmentation
Abstract: We propose a novel framework to remove transient objects from input videos for 3D scene reconstruction using Gaussian Splatting. Our framework consists of the following steps. In the first step, we propose an unsupervised training strategy for a classification network to distinguish between transient objects and static scene parts based on their different training behavior inside the 3D Gaussian Splatting reconstruction. In the second step, we improve the boundary quality and stability of the detected transients by combining our results from the first step with an off-the-shelf segmentation method. We also propose a simple and effective strategy to track objects in the input video forward and backward in time. Our results show an improvement over the current state of the art in existing sparsely captured datasets and significant improvements in a newly proposed densely captured (video) dataset. More results and code are available at this https URL.

Title: VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

Authors: Taesung Kwon, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00156
Pdf URL: https://arxiv.org/pdf/2412.00156
Copy Paste: [[2412.00156]] VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models(https://arxiv.org/abs/2412.00156)
Keywords: diffusion
Abstract: In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present batch-consistent inversion, an initialization technique that incorporates informative latents from the measurement frame. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280x720) in under 2.5 minutes on a single NVIDIA 4090 GPU. Project page: this https URL.

Title: AerialGo: Walking-through City View Generation from Aerial Perspectives

Authors: Fuqiang Zhao, Yijing Guo, Siyuan Yang, Xi Chen, Luo Wang, Lan Xu, Yingliang Zhang, Yujiao Shi, Jingyi Yu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00157
Pdf URL: https://arxiv.org/pdf/2412.00157
Copy Paste: [[2412.00157]] AerialGo: Walking-through City View Generation from Aerial Perspectives(https://arxiv.org/abs/2412.00157)
Keywords: privacy, diffusion, generative
Abstract: High-quality 3D urban reconstruction is essential for applications in urban planning, navigation, and AR/VR. However, capturing detailed ground-level data across cities is both labor-intensive and raises significant privacy concerns related to sensitive information, such as vehicle plates, faces, and other personal identifiers. To address these challenges, we propose AerialGo, a novel framework that generates realistic walking-through city views from aerial images, leveraging multi-view diffusion models to achieve scalable, photorealistic urban reconstructions without direct ground-level data collection. By conditioning ground-view synthesis on accessible aerial data, AerialGo bypasses the privacy risks inherent in ground-level imagery. To support the model training, we introduce AerialGo dataset, a large-scale dataset containing diverse aerial and ground-view images, paired with camera and depth information, designed to support generative urban reconstruction. Experiments show that AerialGo significantly enhances ground-level realism and structural coherence, providing a privacy-conscious, scalable solution for city-scale 3D modeling.

Title: STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Authors: Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, Tat-Seng Chua
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00161
Pdf URL: https://arxiv.org/pdf/2412.00161
Copy Paste: [[2412.00161]] STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training(https://arxiv.org/abs/2412.00161)
Keywords: large language model
Abstract: Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, and events. The hurdles to enhancing this capability include extensive manual labor, the lack of spatio-temporal compositionality in existing data and the absence of explicit reasoning supervision. In this paper, we propose STEP, a novel graph-guided self-training method that enables Video-LLMs to generate reasoning-rich fine-tuning data from any raw videos to improve itself. Specifically, we first induce Spatio-Temporal Scene Graph (STSG) representation of diverse videos to capture fine-grained, multi-granular video semantics. Then, the STSGs guide the derivation of multi-step reasoning Question-Answer (QA) data with Chain-of-Thought (CoT) rationales. Both answers and rationales are integrated as training objective, aiming to enhance model's reasoning abilities by supervision over explicit reasoning steps. Experimental results demonstrate the effectiveness of STEP across models of varying scales, with a significant 21.3\% improvement in tasks requiring three or more reasoning steps. Furthermore, it achieves superior performance with a minimal amount of self-generated rationale-enriched training samples in both compositional reasoning and comprehensive understanding benchmarks, highlighting the broad applicability and vast potential.

Title: To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models

Authors: Fouad Trad, Ali Chehab
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00166
Pdf URL: https://arxiv.org/pdf/2412.00166
Copy Paste: [[2412.00166]] To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models(https://arxiv.org/abs/2412.00166)
Keywords: large language model
Abstract: The effectiveness of Large Language Models (LLMs) significantly relies on the quality of the prompts they receive. However, even when processing identical prompts, LLMs can yield varying outcomes due to differences in their training processes. To leverage the collective intelligence of multiple LLMs and enhance their performance, this study investigates three majority voting strategies for text classification, focusing on phishing URL detection. The strategies are: (1) a prompt-based ensemble, which utilizes majority voting across the responses generated by a single LLM to various prompts; (2) a model-based ensemble, which entails aggregating responses from multiple LLMs to a single prompt; and (3) a hybrid ensemble, which combines the two methods by sending different prompts to multiple LLMs and then aggregating their responses. Our analysis shows that ensemble strategies are most suited in cases where individual components exhibit equivalent performance levels. However, when there is a significant discrepancy in individual performance, the effectiveness of the ensemble method may not exceed that of the highest-performing single LLM or prompt. In such instances, opting for ensemble techniques is not recommended.

Title: Origin-Destination Demand Prediction: An Urban Radiation and Attraction Perspective

Authors: Xuan Ma, Zepeng Bao, Ming Zhong, Yuanyuan Zhu, Chenliang Li, Jiawei Jiang, Qing Li, Tieyun Qian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00167
Pdf URL: https://arxiv.org/pdf/2412.00167
Copy Paste: [[2412.00167]] Origin-Destination Demand Prediction: An Urban Radiation and Attraction Perspective(https://arxiv.org/abs/2412.00167)
Keywords: explainability
Abstract: In recent years, origin-destination (OD) demand prediction has gained significant attention for its profound implications in urban development. Existing data-driven deep learning methods primarily focus on the spatial or temporal dependency between regions yet neglecting regions' fundamental functional difference. Though knowledge-driven physical methods have characterised regions' functions by their radiation and attraction capacities, these functions are defined on numerical factors like population without considering regions' intrinsic nominal attributes, e.g., a region is a residential or industrial district. Moreover, the complicated relationships between two types of capacities, e.g., the radiation capacity of a residential district in the morning will be transformed into the attraction capacity in the evening, are totally missing from physical methods. In this paper, we not only generalize the physical radiation and attraction capacities into the deep learning framework with the extended capability to fulfil regions' functions, but also present a new model that captures the relationships between two types of capacities. Specifically, we first model regions' radiation and attraction capacities using a bilateral branch network, each equipped with regions' attribute representations. We then describe the transformation relationship of different capacities of the same region using a hypergraph-based parameter generation method. We finally unveil the competition relationship of different regions with the same attraction capacity through cluster-based adversarial learning. Extensive experiments on two datasets demonstrate the consistent improvements of our method over the state-of-the-art baselines, as well as the good explainability of regions' functions using their nominal attributes.

Title: Spatial Clustering of Molecular Localizations with Graph Neural Networks

Authors: Jesús Pineda, Sergi Masó-Orriols, Joan Bertran, Mattias Goksör, Giovanni Volpe, Carlo Manzo
Subjects: cs.LG, physics.bio-ph, physics.data-an, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.00173
Pdf URL: https://arxiv.org/pdf/2412.00173
Copy Paste: [[2412.00173]] Spatial Clustering of Molecular Localizations with Graph Neural Networks(https://arxiv.org/abs/2412.00173)
Keywords: robust
Abstract: Single-molecule localization microscopy generates point clouds corresponding to fluorophore localizations. Spatial cluster identification and analysis of these point clouds are crucial for extracting insights about molecular organization. However, this task becomes challenging in the presence of localization noise, high point density, or complex biological structures. Here, we introduce MIRO (Multimodal Integration through Relational Optimization), an algorithm that uses recurrent graph neural networks to transform the point clouds in order to improve clustering efficiency when applying conventional clustering techniques. We show that MIRO supports simultaneous processing of clusters of different shapes and at multiple scales, demonstrating improved performance across varied datasets. Our comprehensive evaluation demonstrates MIRO's transformative potential for single-molecule localization applications, showcasing its capability to revolutionize cluster analysis and provide accurate, reliable details of molecular architecture. In addition, MIRO's robust clustering capabilities hold promise for applications in various fields such as neuroscience, for the analysis of neural connectivity patterns, and environmental science, for studying spatial distributions of ecological data.

Title: Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning

Authors: Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata
Subjects: cs.CV, cs.LG, cs.SD, eess.AS, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00175
Pdf URL: https://arxiv.org/pdf/2412.00175
Copy Paste: [[2412.00175]] Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning(https://arxiv.org/abs/2412.00175)
Keywords: robust
Abstract: Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.

Title: Art-Free Generative Models: Art Creation Without Graphic Art Knowledge

Authors: Hui Ren, Joanna Materzynska, Rohit Gandikota, David Bau, Antonio Torralba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00176
Pdf URL: https://arxiv.org/pdf/2412.00176
Copy Paste: [[2412.00176]] Art-Free Generative Models: Art Creation Without Graphic Art Knowledge(https://arxiv.org/abs/2412.00176)
Keywords: generative
Abstract: We explore the question: "How much prior art knowledge is needed to create art?" To investigate this, we propose a text-to-image generation model trained without access to art-related content. We then introduce a simple yet effective method to learn an art adapter using only a few examples of selected artistic styles. Our experiments show that art generated using our method is perceived by users as comparable to art produced by models trained on large, art-rich datasets. Finally, through data attribution techniques, we illustrate how examples from both artistic and non-artistic datasets contributed to the creation of new artistic styles.

Title: LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Authors: Xiaoyan Xing, Konrad Groh, Sezer Karagolu, Theo Gevers, Anand Bhattad
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00177
Pdf URL: https://arxiv.org/pdf/2412.00177
Copy Paste: [[2412.00177]] LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting(https://arxiv.org/abs/2412.00177)
Keywords: diffusion, generative
Abstract: We introduce LumiNet, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LumiNet synthesizes a relit version of the source scene that captures the target's lighting. Our approach makes two key contributions: a data curation strategy from the StyleGAN-based relighting model for our training, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target's latent extrinsic properties via cross-attention and fine-tuning. Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LumiNet processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.

Title: Diffusion Model Guided Sampling with Pixel-Wise Aleatoric Uncertainty Estimation

Authors: Michele De Vita, Vasileios Belagiannis
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00205
Pdf URL: https://arxiv.org/pdf/2412.00205
Copy Paste: [[2412.00205]] Diffusion Model Guided Sampling with Pixel-Wise Aleatoric Uncertainty Estimation(https://arxiv.org/abs/2412.00205)
Keywords: diffusion, generative
Abstract: Despite the remarkable progress in generative modelling, current diffusion models lack a quantitative approach to assess image quality. To address this limitation, we propose to estimate the pixel-wise aleatoric uncertainty during the sampling phase of diffusion models and utilise the uncertainty to improve the sample generation quality. The uncertainty is computed as the variance of the denoising scores with a perturbation scheme that is specifically designed for diffusion models. We then show that the aleatoric uncertainty estimates are related to the second-order derivative of the diffusion noise distribution. We evaluate our uncertainty estimation algorithm and the uncertainty-guided sampling on the ImageNet and CIFAR-10 datasets. In our comparisons with the related work, we demonstrate promising results in filtering out low quality samples. Furthermore, we show that our guided approach leads to better sample generation in terms of FID scores.

Title: Train Once for All: A Transitional Approach for Efficient Aspect Sentiment Triplet Extraction

Authors: Xinmeng Hou, Lingyue Fu, Chenhao Meng, Hai Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.00208
Pdf URL: https://arxiv.org/pdf/2412.00208
Copy Paste: [[2412.00208]] Train Once for All: A Transitional Approach for Efficient Aspect Sentiment Triplet Extraction(https://arxiv.org/abs/2412.00208)
Keywords: robust, extraction
Abstract: Aspect-Opinion Pair Extraction (AOPE) and Aspect Sentiment Triplet Extraction (ASTE) have gained significant attention in natural language processing. However, most existing methods are a pipelined framework, which extracts aspects/opinions and identifies their relations separately, leading to a drawback of error propagation and high time complexity. Towards this problem, we propose a transition-based pipeline to mitigate token-level bias and capture position-aware aspect-opinion relations. With the use of a fused dataset and contrastive learning optimization, our model learns robust action patterns and can optimize separate subtasks jointly, often with linear-time complexity. The results show that our model achieves the best performance on both the ASTE and AOPE tasks, outperforming the state-of-the-art methods by at least 6.98\% in the F1 measure. The code is available at this https URL.

Title: N\"ushuRescue: Revitalization of the endangered N\"ushu Language with AI

Authors: Ivory Yang, Weicheng Ma, Soroush Vosoughi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00218
Pdf URL: https://arxiv.org/pdf/2412.00218
Copy Paste: [[2412.00218]] N\"ushuRescue: Revitalization of the endangered N\"ushu Language with AI(https://arxiv.org/abs/2412.00218)
Keywords: large language model
Abstract: The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by Nüshu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce NüshuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. NüshuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence Nüshu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to Nüshu and only 35 short examples from NCGold, NüshuRescue achieved 48.69\% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on Nüshu. NüshuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.

Title: MATTER: Multi-stage Adaptive Thermal Trojan for Efficiency & Resilience degradation

Authors: Mehdi Elahi, Mohamed R. Elshamy, Abdel-Hameed Badawy, Mahdi Fazeli, Ahmad Patooghy
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2412.00226
Pdf URL: https://arxiv.org/pdf/2412.00226
Copy Paste: [[2412.00226]] MATTER: Multi-stage Adaptive Thermal Trojan for Efficiency & Resilience degradation(https://arxiv.org/abs/2412.00226)
Keywords: security, attack
Abstract: As mobile systems become more advanced, the security of System-on-Chips (SoCs) is increasingly threatened by thermal attacks. This research introduces a new attack method called the Multi-stage Adaptive Thermal Trojan for Efficiency and Resilience Degradation (MATTER). MATTER takes advantage of weaknesses in Dynamic Thermal Management (DTM) systems by manipulating temperature sensor interfaces, which leads to incorrect thermal sensing and disrupts the SoC's ability to manage heat effectively. Our experiments show that this attack can degrade DTM performance by as much as 73%, highlighting serious vulnerabilities in modern mobile devices. By exploiting the trust placed in temperature sensors, MATTER causes DTM systems to make poor decisions i.e., failing to activate cooling when needed. This not only affects how well the system works but also threatens the lifespan of the hardware. This paper provides a thorough analysis of how MATTER works and emphasizes the need for stronger thermal management systems in SoCs.

Title: Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data

Authors: Udo Hahn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.00230
Pdf URL: https://arxiv.org/pdf/2412.00230
Copy Paste: [[2412.00230]] Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data(https://arxiv.org/abs/2412.00230)
Keywords: privacy
Abstract: We survey clinical document corpora, with focus on German textual data. Due to rigid data privacy legislation in Germany these resources, with only few exceptions, are stored in safe clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing where easy accessibility and reuse of data collections are common practice. Hence, alternative corpus designs have been examined to escape from this data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several other types of domain proxies have come up as substitutes for authentic clinical documents. Common instances of close proxies are medical journal publications, clinical therapy guidelines, drug labels, etc., more distant proxies include online encyclopedic medical articles or medical contents from social media channels. After PRISM-conformant screening of 359 hits from four bibliographic systems, 75 relevant documents were finally selected for this review and 59 distinct corpora were determined. We identified 24 real clinical corpora (from 40 publications) out of which only 5 are publicly distributable. 2 translations of real corpora and 3 synthetic ones complement the set of clinical corpora. 14 corpora were categorized as close domain proxies, 16 as distant ones. There is a clear divide between the large number of non-accessible authentic clinical German-language corpora and their publicly accessible substitutes: translated or synthetic, close or more distant proxies. So on first sight, the data bottleneck seems broken. Intuitively yet, differences in genre-specific writing style, wording and medical domain expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are.

Title: Hybrid Spiking Neural Network -- Transformer Video Classification Model

Authors: Aaron Bateni
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00237
Pdf URL: https://arxiv.org/pdf/2412.00237
Copy Paste: [[2412.00237]] Hybrid Spiking Neural Network -- Transformer Video Classification Model(https://arxiv.org/abs/2412.00237)
Keywords: transformer
Abstract: In recent years, Spiking Neural Networks (SNNs) have gathered significant interest due to their temporal understanding capabilities. This work introduces, to the best of our knowledge, the first Cortical Column like hybrid architecture for the Time-Series Data Classification Task that leverages SNNs and is inspired by the brain structure, inspired from the previous hybrid models. We introduce several encoding methods to use with this model. Finally, we develop a procedure for training this network on the training dataset. As an effort to make using these models simpler, we make all the implementations available to the public.

Title: Twisted Convolutional Networks (TCNs): Enhancing Feature Interactions for Non-Spatial Data Classification

Authors: Junbo Jacob Lian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00238
Pdf URL: https://arxiv.org/pdf/2412.00238
Copy Paste: [[2412.00238]] Twisted Convolutional Networks (TCNs): Enhancing Feature Interactions for Non-Spatial Data Classification(https://arxiv.org/abs/2412.00238)
Keywords: transformer
Abstract: Twisted Convolutional Networks (TCNs) are introduced as a novel neural network architecture designed to effectively process one-dimensional data with arbitrary feature order and minimal spatial relationships. Unlike traditional Convolutional Neural Networks (CNNs), which excel at handling structured two-dimensional data like images, TCNs reduce dependency on feature order by combining input features in innovative ways to create new representations. By explicitly enhancing feature interactions and employing diverse feature combinations, TCNs generate richer and more informative representations, making them especially effective for classification tasks on datasets with arbitrary feature arrangements. This paper details the TCN architecture and its feature combination strategy, providing a comprehensive comparison with traditional CNNs, DeepSets, Transformers, and Graph Neural Networks (GNNs). Extensive experiments on benchmark datasets demonstrate that TCNs achieve superior performance, particularly in classification scenarios involving one-dimensional data.

Title: Robust Testing for Deep Learning using Human Label Noise

Authors: Gordon Lim, Stefan Larson, Kevin Leach
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.00244
Pdf URL: https://arxiv.org/pdf/2412.00244
Copy Paste: [[2412.00244]] Robust Testing for Deep Learning using Human Label Noise(https://arxiv.org/abs/2412.00244)
Keywords: robust
Abstract: In deep learning (DL) systems, label noise in training datasets often degrades model performance, as models may learn incorrect patterns from mislabeled data. The area of Learning with Noisy Labels (LNL) has introduced methods to effectively train DL models in the presence of noisily-labeled datasets. Traditionally, these methods are tested using synthetic label noise, where ground truth labels are randomly (and automatically) flipped. However, recent findings highlight that models perform substantially worse under human label noise than synthetic label noise, indicating a need for more realistic test scenarios that reflect noise introduced due to imperfect human labeling. This underscores the need for generating realistic noisy labels that simulate human label noise, enabling rigorous testing of deep neural networks without the need to collect new human-labeled datasets. To address this gap, we present Cluster-Based Noise (CBN), a method for generating feature-dependent noise that simulates human-like label noise. Using insights from our case study of label memorization in the CIFAR-10N dataset, we design CBN to create more realistic tests for evaluating LNL methods. Our experiments demonstrate that current LNL methods perform worse when tested using CBN, highlighting its use as a rigorous approach to testing neural networks. Next, we propose Soft Neighbor Label Sampling (SNLS), a method designed to handle CBN, demonstrating its improvement over existing techniques in tackling this more challenging type of noise.

Title: Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks

Authors: Simon Mielke, Anthony Stein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00256
Pdf URL: https://arxiv.org/pdf/2412.00256
Copy Paste: [[2412.00256]] Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks(https://arxiv.org/abs/2412.00256)
Keywords: robust, transformer
Abstract: Animal excretions in form of urine puddles and feces are a significant source of emissions in livestock farming. Automated detection of soiled floor in barns can contribute to improved management processes but also the derived information can be used to model emission dynamics. Previous research approaches to determine the puddle area require manual detection of the puddle in the barn. While humans can detect animal excretions on thermal images of a livestock barn, automated approaches using thresholds fail due to other objects of the same temperature, such as the animals themselves. In addition, various parameters such as the type of housing, animal species, age, sex, weather and unknown factors can influence the type and shape of excretions. Due to this heterogeneity, a method for automated detection of excretions must therefore be not only be accurate but also robust to varying conditions. These requirements can be met by using contemporary deep learning models from the field of artificial intelligence. This work is the first to investigate the suitability of different deep learning models for the detection of excretions in pigsties, thereby comparing established convolutional architectures with recent transformer-based approaches. The detection models Faster R-CNN, YOLOv8, DETR and DAB-DETR are compared and statistically assessed on two created training datasets representing two pig houses. We apply a method derived from nested cross-validation and report on the results in terms of eight common detection metrics. Our work demonstrates that all investigated deep learning models are generally suitable for reliably detecting excretions with an average precision of over 90%. The models also show robustness on out of distribution data that possesses differences from the conditions in the training data, however, with expected slight decreases in the overall detection performance.

Title: Facial Expression Recognition with Controlled Privacy Preservation and Feature Compensation

Authors: Feng Xu, David Ahmedt-Aristizabal, Peterson Lars, Dadong Wang, Xun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00277
Pdf URL: https://arxiv.org/pdf/2412.00277
Copy Paste: [[2412.00277]] Facial Expression Recognition with Controlled Privacy Preservation and Feature Compensation(https://arxiv.org/abs/2412.00277)
Keywords: secure, privacy
Abstract: Facial expression recognition (FER) systems raise significant privacy concerns due to the potential exposure of sensitive identity information. This paper presents a study on removing identity information while preserving FER capabilities. Drawing on the observation that low-frequency components predominantly contain identity information and high-frequency components capture expression, we propose a novel two-stream framework that applies privacy enhancement to each component separately. We introduce a controlled privacy enhancement mechanism to optimize performance and a feature compensator to enhance task-relevant features without compromising privacy. Furthermore, we propose a novel privacy-utility trade-off, providing a quantifiable measure of privacy preservation efficacy in closed-set FER tasks. Extensive experiments on the benchmark CREMA-D dataset demonstrate that our framework achieves 78.84% recognition accuracy with a privacy (facial identity) leakage ratio of only 2.01%, highlighting its potential for secure and reliable video-based FER applications.

Title: SS Linear Fusion Model: Hyperspectral Imaging Efficient Spatial and Spectral Linear Model with Bidirectional Feature Learning

Authors: Judy X Yang, Jing Wang, Zekun Long, Chenhong Sui, Jun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00283
Pdf URL: https://arxiv.org/pdf/2412.00283
Copy Paste: [[2412.00283]] SS Linear Fusion Model: Hyperspectral Imaging Efficient Spatial and Spectral Linear Model with Bidirectional Feature Learning(https://arxiv.org/abs/2412.00283)
Keywords: extraction, transformer
Abstract: Classifying hyperspectral images (HSIs) is a complex task in remote sensing due to the high-dimensional nature and volume of data involved. To address these challenges, we propose the Spectral-Spatial Linear (SS Linear) Model, a novel framework that significantly reduces data volume while enhancing classification accuracy. Our model employs a bidirectional reversed convolutional neural network (CNN) to efficiently extract spectral features, complemented by a specialized block for spatial feature analysis. This hybrid approach leverages the operational efficiency of CNNs and incorporates dynamic feature extraction inspired by attention mechanisms, optimizing performance without the high computational demands typically associated with transformer-based models. The SS Linear Model is designed to process hyperspectral data bidirectionally, achieving notable classification and efficiency improvements by fusing spectral and spatial features effectively. This approach yields superior classification accuracy compared to existing benchmarks while maintaining computational efficiency, making it suitable for resource-constrained environments. We validate the SS Linear Model on three widely recognized datasets, Houston 2013, Indian Pines, and Pavia University, demonstrating its ability to outperform current state-of-the-art models in HSI classification and efficiency. This work highlights the innovative methodology of the SS Linear Model and its practical benefits for remote sensing applications, where both data efficiency and classification accuracy are critical. For further details, please refer to our code repository on GitHub: HSILinearModel.

Title: HSLiNets: Hyperspectral Image and LiDAR Data Fusion Using Efficient Dual Linear Feature Learning Networks

Authors: Judy X Yang, Jing Wang, Chen Hong Sui, Zekun Long, Jun Zhou
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00302
Pdf URL: https://arxiv.org/pdf/2412.00302
Copy Paste: [[2412.00302]] HSLiNets: Hyperspectral Image and LiDAR Data Fusion Using Efficient Dual Linear Feature Learning Networks(https://arxiv.org/abs/2412.00302)
Keywords: transformer
Abstract: The integration of hyperspectral imaging (HSI) and LiDAR data within new linear feature spaces offers a promising solution to the challenges posed by the high-dimensionality and redundancy inherent in HSIs. This study introduces a dual linear fused space framework that capitalizes on bidirectional reversed convolutional neural network (CNN) pathways, coupled with a specialized spatial analysis block. This approach combines the computational efficiency of CNNs with the adaptability of attention mechanisms, facilitating the effective fusion of spectral and spatial information. The proposed method not only enhances data processing and classification accuracy, but also mitigates the computational burden typically associated with advanced models such as Transformers. Evaluations of the Houston 2013 dataset demonstrate that our approach surpasses existing state-of-the-art models. This advancement underscores the potential of the framework in resource-constrained environments and its significant contributions to the field of remote sensing.

Title: Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

Authors: Yizhi Song, Liu He, Zhifei Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Zhe Lin, Brian Price, Scott Cohen, Jianming Zhang, Daniel Aliaga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00306
Pdf URL: https://arxiv.org/pdf/2412.00306
Copy Paste: [[2412.00306]] Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment(https://arxiv.org/abs/2412.00306)
Keywords: diffusion, generative
Abstract: Personalized image generation has emerged from the recent advancements in generative models. However, these generated personalized images often suffer from localized artifacts such as incorrect logos, reducing fidelity and fine-grained identity details of the generated results. Furthermore, there is little prior work tackling this problem. To help improve these identity details in the personalized image generation, we introduce a new task: reference-guided artifacts refinement. We present Refine-by-Align, a first-of-its-kind model that employs a diffusion-based framework to address this challenge. Our model consists of two stages: Alignment Stage and Refinement Stage, which share weights of a unified neural network model. Given a generated image, a masked artifact region, and a reference image, the alignment stage identifies and extracts the corresponding regional features in the reference, which are then used by the refinement stage to fix the artifacts. Our model-agnostic pipeline requires no test-time tuning or optimization. It automatically enhances image fidelity and reference identity in the generated image, generalizing well to existing models on various tasks including but not limited to customization, generative compositing, view synthesis, and virtual try-on. Extensive experiments and comparisons demonstrate that our pipeline greatly pushes the boundary of fine details in the image synthesis models.

Title: Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach

Authors: Feiyang Liu, Dan Guo, Jingyuan Xu, Zihao He, Shengeng Tang, Kun Li, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00309
Pdf URL: https://arxiv.org/pdf/2412.00309
Copy Paste: [[2412.00309]] Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach(https://arxiv.org/abs/2412.00309)
Keywords: segmentation
Abstract: Following the gaze of other people and analyzing the target they are looking at can help us understand what they are thinking, and doing, and predict the actions that may follow. Existing methods for gaze following struggle to perform well in natural scenes with diverse objects, and focus on gaze points rather than objects, making it difficult to deliver clear semantics and accurate scope of the targets. To address this shortcoming, we propose a novel gaze target prediction solution named GazeSeg, that can fully utilize the spatial visual field of the person as guiding information and lead to a progressively coarse-to-fine gaze target segmentation and recognition process. Specifically, a prompt-based visual foundation model serves as the encoder, working in conjunction with three distinct decoding modules (e.g. FoV perception, heatmap generation, and segmentation) to form the framework for gaze target prediction. Then, with the head bounding box performed as an initial prompt, GazeSeg obtains the FoV map, heatmap, and segmentation map progressively, leading to a unified framework for multiple tasks (e.g. direction estimation, gaze target segmentation, and recognition). In particular, to facilitate this research, we construct and release a new dataset, comprising 72k images with pixel-level annotations and 270 categories of gaze targets, built upon the GazeFollow dataset. The quantitative evaluation shows that our approach achieves the Dice of 0.325 in gaze target segmentation and 71.7% top-5 recognition. Meanwhile, our approach also outperforms previous state-of-the-art methods, achieving 0.953 in AUC on the gaze-following task. The dataset and code will be released.

Title: HiMoE: Heterogeneity-Informed Mixture-of-Experts for Fair Spatial-Temporal Forecasting

Authors: Shaohan Yu, Pan Deng, Yu Zhao, Junting Liu, Zi'ang Wang
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2412.00316
Pdf URL: https://arxiv.org/pdf/2412.00316
Copy Paste: [[2412.00316]] HiMoE: Heterogeneity-Informed Mixture-of-Experts for Fair Spatial-Temporal Forecasting(https://arxiv.org/abs/2412.00316)
Keywords: fair
Abstract: Spatial-temporal forecasting has various applications in transportation, climate, and human activity domains. Current spatial-temporal forecasting models primarily adopt a macro perspective, focusing on achieving strong overall prediction performance for the entire system. However, most of these models overlook the importance of enhancing the uniformity of prediction performance across different nodes, leading to poor prediction capabilities for certain nodes and rendering some results impractical. This task is particularly challenging due to the inherent heterogeneity of spatial-temporal data. To address this issue, in this paper, we propose a novel Heterogeneity-informed Mixture-of-Experts (HiMoE) for fair spatial-temporal forecasting. Specifically, we design a Heterogeneity-Informed Graph Convolutional Network (HiGCN), integrated into each expert model to enhance the flexibility of the experts. To adapt to the heterogeneity of spatial-temporal data, we design a Node-wise Mixture-of-Experts (NMoE). This model decouples the spatial-temporal prediction task into sub-tasks at the spatial scale, which are then assigned to different experts. To allocate these sub-tasks, we use a mean-based graph decoupling method to distinguish the graph structure for each expert. The results are then aggregated using an output gating mechanism based on a dense Mixture-of-Experts (dMoE). Additionally, fairness-aware loss and evaluation functions are proposed to train the model with uniformity and accuracy as objectives. Experiments conducted on four datasets, encompassing diverse data types and spatial scopes, validate HiMoE's ability to scale across various real-world scenarios. Furthermore, HiMoE consistently outperforms baseline models, achieving superior performance in both accuracy and uniformity.

Title: Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments

Authors: Yasuaki Sumita, Koh Takeuchi, Hisashi Kashima
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00323
Pdf URL: https://arxiv.org/pdf/2412.00323
Copy Paste: [[2412.00323]] Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments(https://arxiv.org/abs/2412.00323)
Keywords: large language model
Abstract: Large Language Models (LLMs) are trained on large corpora written by humans and demonstrate high performance on various tasks. However, as humans are susceptible to cognitive biases, which can result in irrational judgments, LLMs can also be influenced by these biases, leading to irrational decision-making. For example, changing the order of options in multiple-choice questions affects the performance of LLMs due to order bias. In our research, we first conducted an extensive survey of existing studies examining LLMs' cognitive biases and their mitigation. The mitigation techniques in LLMs have the disadvantage that they are limited in the type of biases they can apply or require lengthy inputs or outputs. We then examined the effectiveness of two mitigation methods for humans, SoPro and AwaRe, when applied to LLMs, inspired by studies in crowdsourcing. To test the effectiveness of these methods, we conducted experiments on GPT-3.5 and GPT-4 to evaluate the influence of six biases on the outputs before and after applying these methods. The results demonstrate that while SoPro has little effect, AwaRe enables LLMs to mitigate the effect of these biases and make more rational responses.

Title: EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Edge Devices

Authors: Meihan Wu, Tao Chang, Cui Miao, Jie Zhou, Chun Li, Xiangyu Xu, Ming Li, Xiaodong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00334
Pdf URL: https://arxiv.org/pdf/2412.00334
Copy Paste: [[2412.00334]] EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Edge Devices(https://arxiv.org/abs/2412.00334)
Keywords: privacy, protect, federate, transformer
Abstract: Federated learning research has recently shifted from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) due to their superior capacity. ViTs training demands higher computational resources due to the lack of 2D inductive biases inherent in CNNs. However, efficient federated training of ViTs on resource-constrained edge devices remains unexplored in the community. In this paper, we propose EFTViT, a hierarchical federated framework that leverages masked images to enable efficient, full-parameter training on resource-constrained edge devices, offering substantial benefits for learning on heterogeneous data. In general, we patchify images and randomly mask a portion of the patches, observing that excluding them from training has minimal impact on performance while substantially reducing computation costs and enhancing data content privacy protection. Specifically, EFTViT comprises a series of lightweight local modules and a larger global module, updated independently on clients and the central server, respectively. The local modules are trained on masked image patches, while the global module is trained on intermediate patch features uploaded from the local client, balanced through a proposed median sampling strategy to erase client data distribution privacy. We analyze the computational complexity and privacy protection of EFTViT. Extensive experiments on popular benchmarks show that EFTViT achieves up to 28.17% accuracy improvement, reduces local training computational cost by up to 2.8$\times$, and cuts local training time by up to 4.4$\times$ compared to existing methods.

Title: Fusing Physics-Driven Strategies and Cross-Modal Adversarial Learning: Toward Multi-Domain Applications

Authors: Hana Satou, Alan Mitkiy
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00341
Pdf URL: https://arxiv.org/pdf/2412.00341
Copy Paste: [[2412.00341]] Fusing Physics-Driven Strategies and Cross-Modal Adversarial Learning: Toward Multi-Domain Applications(https://arxiv.org/abs/2412.00341)
Keywords: security, robust
Abstract: The convergence of cross-modal adversarial learning and physics-driven methods represents a cutting-edge direction for tackling challenges in complex multi-modal tasks and scientific computing. This review focuses on systematically analyzing how these two approaches can be synergistically integrated to enhance performance and robustness across diverse application domains. By addressing key obstacles such as modality discrepancies, limited data availability, and insufficient model robustness, this paper highlights the role of physics-based optimization frameworks in facilitating efficient and interpretable adversarial perturbation generation. The review also explores significant advancements in cross-modal adversarial learning, including applications in tasks such as image cross-modal retrieval (e.g., infrared and RGB matching), scientific computing (e.g., solving partial differential equations), and optimization under physical consistency constraints in vision systems. By examining theoretical foundations and experimental outcomes, this study demonstrates the potential of combining these approaches to handle complex scenarios and improve the security of multi-modal systems. Finally, we outline future directions, proposing a novel framework that unifies physical principles with adversarial optimization, providing a pathway for researchers to develop robust and adaptable cross-modal learning methods with both theoretical and practical significance.

Title: Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection

Authors: Shanu Kumar, Saish Mendke, Karody Lubna Abdul Rahman, Santosh Kurasa, Parag Agrawal, Sandipan Dandapat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00353
Pdf URL: https://arxiv.org/pdf/2412.00353
Copy Paste: [[2412.00353]] Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection(https://arxiv.org/abs/2412.00353)
Keywords: robust, large language model
Abstract: Chain-of-thought (CoT) prompting has significantly enhanced the capability of large language models (LLMs) by structuring their reasoning processes. However, existing methods face critical limitations: handcrafted demonstrations require extensive human expertise, while trigger phrases are prone to inaccuracies. In this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method, a novel approach that improves CoT prompting by utilizing uncertainty estimates to select effective demonstrations without needing access to model parameters. Unlike traditional methods, ZEUS offers high sensitivity in distinguishing between helpful and ineffective questions, ensuring more precise and reliable selection. Our extensive evaluation shows that ZEUS consistently outperforms existing CoT strategies across four challenging reasoning benchmarks, demonstrating its robustness and scalability.

Title: Does Self-Attention Need Separate Weights in Transformers?

Authors: Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.00359
Pdf URL: https://arxiv.org/pdf/2412.00359
Copy Paste: [[2412.00359]] Does Self-Attention Need Separate Weights in Transformers?(https://arxiv.org/abs/2412.00359)
Keywords: transformer
Abstract: The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models, respectively. The model and source code are available at Anonymous.

Title: LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Authors: Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00364
Pdf URL: https://arxiv.org/pdf/2412.00364
Copy Paste: [[2412.00364]] LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation(https://arxiv.org/abs/2412.00364)
Keywords: extraction, large language model, segmentation
Abstract: It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.

Title: Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment

Authors: Dongfang Zhao
Subjects: cs.LG, cs.AI, math.AG
Abstract URL: https://arxiv.org/abs/2412.00373
Pdf URL: https://arxiv.org/pdf/2412.00373
Copy Paste: [[2412.00373]] Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment(https://arxiv.org/abs/2412.00373)
Keywords: robust
Abstract: Multimodal tasks, such as image-text retrieval and generation, require embedding data from diverse modalities into a shared representation space. Aligning embeddings from heterogeneous sources while preserving shared and modality-specific information is a fundamental challenge. This paper provides an initial attempt to integrate algebraic geometry into multimodal representation learning, offering a foundational perspective for further exploration. We model image and text data as polynomials over discrete rings, $ \mathbb{Z}_{256}[x] $ and $ \mathbb{Z}_{|V|}[x] $, respectively, enabling the use of algebraic tools like fiber products to analyze alignment properties. To accommodate real-world variability, we extend the classical fiber product to an approximate fiber product with a tolerance parameter $ \epsilon $, balancing precision and noise tolerance. We study its dependence on $ \epsilon $, revealing asymptotic behavior, robustness to perturbations, and sensitivity to embedding dimensionality. Additionally, we propose a decomposition of the shared embedding space into orthogonal subspaces, $ Z = Z_s \oplus Z_I \oplus Z_T $, where $ Z_s $ captures shared semantics, and $ Z_I $, $ Z_T $ encode modality-specific features. This decomposition is geometrically interpreted via manifolds and fiber bundles, offering insights into embedding structure and optimization. This framework establishes a principled foundation for analyzing multimodal alignment, uncovering connections between robustness, dimensionality allocation, and algebraic structure. It lays the groundwork for further research on embedding spaces in multimodal learning using algebraic geometry.

Title: DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation

Authors: Zhaoxing Gan, Guangnan Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00381
Pdf URL: https://arxiv.org/pdf/2412.00381
Copy Paste: [[2412.00381]] DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation(https://arxiv.org/abs/2412.00381)
Keywords: diffusion, generative
Abstract: Layout Generation aims to synthesize plausible arrangements from given elements. Currently, the predominant methods in layout generation are Generative Adversarial Networks (GANs) and diffusion models, each presenting its own set of challenges. GANs typically struggle with handling discrete data due to their requirement for differentiable generated samples and have historically circumvented the direct generation of discrete labels by treating them as fixed conditions. Conversely, diffusion-based models, despite achieving state-of-the-art performance across several metrics, require extensive sampling steps which lead to significant time costs. To address these limitations, we propose \textbf{DogLayout} (\textbf{D}en\textbf{o}ising Diffusion \textbf{G}AN \textbf{Layout} model), which integrates a diffusion process into GANs to enable the generation of discrete label data and significantly reduce diffusion's sampling time. Experiments demonstrate that DogLayout considerably reduces sampling costs by up to 175 times and cuts overlap from 16.43 to 9.59 compared to existing diffusion models, while also surpassing GAN based and other layout methods. Code is available at this https URL.

Title: Toward Fair Graph Neural Networks Via Dual-Teacher Knowledge Distillation

Authors: Chengyu Li, Debo Cheng, Guixian Zhang, Yi Li, Shichao Zhang
Subjects: cs.LG, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00382
Pdf URL: https://arxiv.org/pdf/2412.00382
Copy Paste: [[2412.00382]] Toward Fair Graph Neural Networks Via Dual-Teacher Knowledge Distillation(https://arxiv.org/abs/2412.00382)
Keywords: fair
Abstract: Graph Neural Networks (GNNs) have demonstrated strong performance in graph representation learning across various real-world applications. However, they often produce biased predictions caused by sensitive attributes, such as religion or gender, an issue that has been largely overlooked in existing methods. Recently, numerous studies have focused on reducing biases in GNNs. However, these approaches often rely on training with partial data (e.g., using either node features or graph structure alone), which can enhance fairness but frequently compromises model utility due to the limited utilization of available graph information. To address this tradeoff, we propose an effective strategy to balance fairness and utility in knowledge distillation. Specifically, we introduce FairDTD, a novel Fair representation learning framework built on Dual-Teacher Distillation, leveraging a causal graph model to guide and optimize the design of the distillation process. Specifically, FairDTD employs two fairness-oriented teacher models: a feature teacher and a structure teacher, to facilitate dual distillation, with the student model learning fairness knowledge from the teachers while also leveraging full data to mitigate utility loss. To enhance information transfer, we incorporate graph-level distillation to provide an indirect supplement of graph information during training, as well as a node-specific temperature module to improve the comprehensive transfer of fair knowledge. Experiments on diverse benchmark datasets demonstrate that FairDTD achieves optimal fairness while preserving high model utility, showcasing its effectiveness in fair representation learning for GNNs.

Title: A generalization of Burmester-Desmedt GKE based on a non-abelian finite group action

Authors: Daniel Camazón Portela, Álvaro Otero Sánchez, Juan Antonio López Ramos
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2412.00387
Pdf URL: https://arxiv.org/pdf/2412.00387
Copy Paste: [[2412.00387]] A generalization of Burmester-Desmedt GKE based on a non-abelian finite group action(https://arxiv.org/abs/2412.00387)
Keywords: secure, privacy
Abstract: The advent of large-scale quantum computers implies that our existing public-key cryptography infrastructure has become insecure. That means that the privacy of many mobile applications involving dynamic peer groups, such as multicast messaging or pay-per-view, could be compromised. In this work we propose a generalization of the well known group key exchange protocol proposed by Burmester and Desmedt to the non-abelian case by the use of finite group actions and we prove that the presented protocol is secure in Katz and Yung's model.

Title: GradiSeg: Gradient-Guided Gaussian Segmentation with Enhanced 3D Boundary Precision

Authors: Zehao Li, Wenwei Han, Yujun Cai, Hao Jiang, Baolong Bi, Shuqin Gao, Honglong Zhao, Zhaoqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00392
Pdf URL: https://arxiv.org/pdf/2412.00392
Copy Paste: [[2412.00392]] GradiSeg: Gradient-Guided Gaussian Segmentation with Enhanced 3D Boundary Precision(https://arxiv.org/abs/2412.00392)
Keywords: robust, segmentation
Abstract: While 3D Gaussian Splatting enables high-quality real-time rendering, existing Gaussian-based frameworks for 3D semantic segmentation still face significant challenges in boundary recognition accuracy. To address this, we propose a novel 3DGS-based framework named GradiSeg, incorporating Identity Encoding to construct a deeper semantic understanding of scenes. Our approach introduces two key modules: Identity Gradient Guided Densification (IGD) and Local Adaptive K-Nearest Neighbors (LA-KNN). The IGD module supervises gradients of Identity Encoding to refine Gaussian distributions along object boundaries, aligning them closely with boundary contours. Meanwhile, the LA-KNN module employs position gradients to adaptively establish locality-aware propagation of Identity Encodings, preventing irregular Gaussian spreads near boundaries. We validate the effectiveness of our method through comprehensive experiments. Results show that GradiSeg effectively addresses boundary-related issues, significantly improving segmentation accuracy without compromising scene reconstruction quality. Furthermore, our method's robust segmentation capability and decoupled Identity Encoding representation make it highly suitable for various downstream scene editing tasks, including 3D object removal, swapping and so on.

Title: On Foundation Models for Dynamical Systems from Purely Synthetic Data

Authors: Martin Ziegler, Andres Felipe Posada-Moreno, Friedrich Solowjow, Sebastian Trimpe
Subjects: cs.LG, cs.RO, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00395
Pdf URL: https://arxiv.org/pdf/2412.00395
Copy Paste: [[2412.00395]] On Foundation Models for Dynamical Systems from Purely Synthetic Data(https://arxiv.org/abs/2412.00395)
Keywords: robust, transformer
Abstract: Foundation models have demonstrated remarkable generalization, data efficiency, and robustness properties across various domains. In this paper, we explore the feasibility of foundation models for applications in the control domain. The success of these models is enabled by large-scale pretaining on Internet-scale datasets. These are available in fields like natural language processing and computer vision, but do not exist for dynamical systems. We address this challenge by pretraining a transformer-based foundation model exclusively on synthetic data and propose to sample dynamics functions from a reproducing kernel Hilbert space. Our pretrained model generalizes for prediction tasks across different dynamical systems, which we validate in simulation and hardware experiments, including cart-pole and Furuta pendulum setups. Additionally, the model can be fine-tuned effectively to new systems to increase performance even further. Our results demonstrate the feasibility of foundation models for dynamical systems that outperform specialist models in terms of generalization, data efficiency, and robustness.

Title: DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

Authors: Yatian Pang, Bin Zhu, Bin Lin, Mingzhe Zheng, Francis E. H. Tay, Ser-Nam Lim, Harry Yang, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00397
Pdf URL: https://arxiv.org/pdf/2412.00397
Copy Paste: [[2412.00397]] DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses(https://arxiv.org/abs/2412.00397)
Keywords: diffusion
Abstract: In this work, we present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. Existing approaches struggle with generating coherent, high-quality content in an efficient and user-friendly manner. Concretely, baseline methods relying on only 2D pose guidance lack the cues of 3D information, leading to suboptimal results, while methods using 3D representation as guidance achieve higher quality but involve a cumbersome and time-intensive process. To address these limitations, DreamDance enriches 3D geometry cues from 2D poses by introducing an efficient diffusion model, enabling high-quality human image animation with various guidance. Our key insight is that human images naturally exhibit multiple levels of correlation, progressing from coarse skeleton poses to fine-grained geometry cues, and further from these geometry cues to explicit appearance details. Capturing such correlations could enrich the guidance signals, facilitating intra-frame coherency and inter-frame consistency. Specifically, we construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations, including human pose, depth, and normal maps. Next, we introduce a Mutually Aligned Geometry Diffusion Model to generate fine-grained depth and normal maps for enriched guidance. Finally, a Cross-domain Controller incorporates multi-level guidance to animate human images effectively with a video diffusion model. Extensive experiments demonstrate that our method achieves state-of-the-art performance in animating human images.

Title: Hard-Label Black-Box Attacks on 3D Point Clouds

Authors: Daizong Liu, Yunbo Tao, Pan Zhou, Wei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00404
Pdf URL: https://arxiv.org/pdf/2412.00404
Copy Paste: [[2412.00404]] Hard-Label Black-Box Attacks on 3D Point Clouds(https://arxiv.org/abs/2412.00404)
Keywords: attack
Abstract: With the maturity of depth sensors in various 3D safety-critical applications, 3D point cloud models have been shown to be vulnerable to adversarial attacks. Almost all existing 3D attackers simply follow the white-box or black-box setting to iteratively update coordinate perturbations based on back-propagated or estimated gradients. However, these methods are hard to deploy in real-world scenarios (no model details are provided) as they severely rely on parameters or output logits of victim models. To this end, we propose point cloud attacks from a more practical setting, i.e., hard-label black-box attack, in which attackers can only access the prediction label of 3D input. We introduce a novel 3D attack method based on a new spectrum-aware decision boundary algorithm to generate high-quality adversarial samples. In particular, we first construct a class-aware model decision boundary, by developing a learnable spectrum-fusion strategy to adaptively fuse point clouds of different classes in the spectral domain, aiming to craft their intermediate samples without distorting the original geometry. Then, we devise an iterative coordinate-spectrum optimization method with curvature-aware boundary search to move the intermediate sample along the decision boundary for generating adversarial point clouds with trivial perturbations. Experiments demonstrate that our attack competitively outperforms existing white/black-box attackers in terms of attack performance and adversary quality.

Title: QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels for Exponential Non-Linearities

Authors: Sai Kiran Narayanaswami, Gopalakrishnan Srinivasan, Balaraman Ravindran
Subjects: cs.LG, cs.NE, math.NA
Abstract URL: https://arxiv.org/abs/2412.00408
Pdf URL: https://arxiv.org/pdf/2412.00408
Copy Paste: [[2412.00408]] QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels for Exponential Non-Linearities(https://arxiv.org/abs/2412.00408)
Keywords: transformer
Abstract: As machine learning gets deployed more and more widely, and model sizes continue to grow, improving computational efficiency during model inference has become a key challenge. In many commonly used model architectures, including Transformers, a significant portion of the inference computation is comprised of exponential non-linearities such as Softmax. In this work, we develop QuAKE, a collection of novel operators that leverage certain properties of IEEE-754 floating point representations to quickly approximate the exponential function without requiring specialized hardware, extra memory, or precomputation. We propose optimizations that enhance the efficiency of QuAKE in commonly used exponential non-linearities such as Softmax, GELU, and the Logistic function. Our benchmarks demonstrate substantial inference speed improvements between 10% and 35% on server CPUs, and 5% and 45% on embedded and mobile-scale CPUs for a variety of model architectures and sizes. Evaluations of model performance on standard datasets and tasks from various domains show that QuAKE operators are able to provide sizable speed benefits with little to no loss of performance on downstream tasks.

Title: ACTISM: Threat-informed Dynamic Security Modelling for Automotive Systems

Authors: Shaofei Huang, Christopher M. Poskitt, Lwin Khin Shar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00416
Pdf URL: https://arxiv.org/pdf/2412.00416
Copy Paste: [[2412.00416]] ACTISM: Threat-informed Dynamic Security Modelling for Automotive Systems(https://arxiv.org/abs/2412.00416)
Keywords: security
Abstract: Cybersecurity threats in automotive systems pose significant risks to safety and reliability. This article introduces a methodology integrating threat-informed dynamic security modelling with a Threat Analysis and Risk Assessment workflow. Using the example of an In-Vehicle Infotainment system, we demonstrate the methodology's application in risk management to strengthen automotive resiliency.

Title: TAROT: Targeted Data Selection via Optimal Transport

Authors: Lan Feng, Fan Nie, Yuejiang Liu, Alexandre Alahi
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00420
Pdf URL: https://arxiv.org/pdf/2412.00420
Copy Paste: [[2412.00420]] TAROT: Targeted Data Selection via Optimal Transport(https://arxiv.org/abs/2412.00420)
Keywords: segmentation
Abstract: We propose TAROT, a targeted data selection framework grounded in optimal transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, these heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary factors contributing to this limitation: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions inherent in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, providing a more reliable measure of data influence. Building on this, TAROT uses whitened feature distance to quantify and minimize the optimal transport distance between the selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, highlighting its versatility across various deep learning tasks. Code is available at this https URL.

Title: FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

Authors: Teng-Fang Hsiao, Bo-Kai Ruan, Sung-Lin Tsai, Yi-Lun Wu, Hong-Han Shuai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00427
Pdf URL: https://arxiv.org/pdf/2412.00427
Copy Paste: [[2412.00427]] FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting(https://arxiv.org/abs/2412.00427)
Keywords: diffusion
Abstract: In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.

Title: Dynamic Token Selection for Aerial-Ground Person Re-Identification

Authors: Yuhai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00433
Pdf URL: https://arxiv.org/pdf/2412.00433
Copy Paste: [[2412.00433]] Dynamic Token Selection for Aerial-Ground Person Re-Identification(https://arxiv.org/abs/2412.00433)
Keywords: transformer
Abstract: We propose a View-Decoupled Transformer (VDT) framework to address viewpoint discrepancies in person re-identification (ReID), particularly between aerial and ground views. VDT decouples view-specific and view-independent features by leveraging meta and view tokens, processed through self-attention and subtractive separation. Additionally, we introduce a Visual Token Selector (VTS) module that dynamically selects the most informative tokens, reducing redundancy and enhancing efficiency. Our approach significantly improves retrieval performance on the AGPReID dataset, while maintaining computational efficiency similar to baseline models.

Title: Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Authors: Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu, Yanfeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00440
Pdf URL: https://arxiv.org/pdf/2412.00440
Copy Paste: [[2412.00440]] Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training(https://arxiv.org/abs/2412.00440)
Keywords: interpretability
Abstract: In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.

Title: Two Models for Surface Segmentation using the Total Variation of the Normal Vector

Authors: Lukas Baumgärtner, Ronny Bergmann, Roland Herzog, Stephan Schmidt, Manuel Weiß
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2412.00445
Pdf URL: https://arxiv.org/pdf/2412.00445
Copy Paste: [[2412.00445]] Two Models for Surface Segmentation using the Total Variation of the Normal Vector(https://arxiv.org/abs/2412.00445)
Keywords: segmentation
Abstract: We consider the problem of surface segmentation, where the goal is to partition a surface represented by a triangular mesh. The segmentation is based on the similarity of the normal vector field to a given set of label vectors. We propose a variational approach and compare two different regularizers, both based on a total variation measure. The first regularizer penalizes the total variation of the assignment function directly, while the second regularizer penalizes the total variation in the label space. In order to solve the resulting optimization problems, we use variations of the split Bregman (ADMM) iteration adapted to the problem at hand. While computationally more expensive, the second regularizer yields better results in our experiments, in particular it removes noise more reliably in regions of constant curvature.

Title: ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Authors: Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00447
Pdf URL: https://arxiv.org/pdf/2412.00447
Copy Paste: [[2412.00447]] ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models(https://arxiv.org/abs/2412.00447)
Keywords: large language model
Abstract: Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods have identified redundancy in visual tokens within the Large Language Model (LLM) decoder layers and have mitigated this by pruning tokens using a pre-defined or fixed ratio, thereby reducing computational overhead. Nonetheless, we observe that the impact of pruning ratio varies across different LLM layers and instances (image-prompt pairs). Therefore, it is essential to develop a layer-wise and instance-wise vision token pruning strategy to balance computational cost and model performance effectively. We propose ATP-LLaVA, a novel approach that adaptively determines instance-specific token pruning ratios for each LLM layer. Specifically, we introduce an Adaptive Token Pruning (ATP) module, which computes the importance score and pruning threshold based on input instance adaptively. The ATP module can be seamlessly integrated between any two LLM layers with negligible computational overhead. Additionally, we develop a Spatial Augmented Pruning (SAP) strategy that prunes visual tokens with both token redundancy and spatial modeling perspectives. Our approach reduces the average token count by 75% while maintaining performance, with only a minimal 1.9% degradation across seven widely used benchmarks. The project page can be accessed via this https URL.

Title: A conditional Generative Adversarial network model for the Weather4Cast 2024 Challenge

Authors: Atharva Deshpande, Kaushik Gopalan, Jeet Shah, Hrishikesh Simu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00451
Pdf URL: https://arxiv.org/pdf/2412.00451
Copy Paste: [[2412.00451]] A conditional Generative Adversarial network model for the Weather4Cast 2024 Challenge(https://arxiv.org/abs/2412.00451)
Keywords: generative
Abstract: This study explores the application of deep learning for rainfall prediction, leveraging the Spinning Enhanced Visible and Infrared Imager (SEVIRI) High rate information transmission (HRIT) data as input and the Operational Program on the Exchange of weather RAdar information (OPERA) ground-radar reflectivity data as ground truth. We use the mean of 4 InfraRed frequency channels as the input. The radiance images are forecasted up to 4 hours into the future using a dense optical flow algorithm. A conditional generative adversarial network (GAN) model is employed to transform the predicted radiance images into rainfall images which are aggregated over the 4 hour forecast period to generate cumulative rainfall values. This model scored a value of approximately 7.5 as the Continuous Ranked Probability Score (CRPS) in the Weather4Cast 2024 competition and placed 1st on the core challenge leaderboard.

Title: Learning Locally, Revising Globally: Global Reviser for Federated Learning with Noisy Labels

Authors: Yuxin Tian, Mouxing Yang, Yuhao Zhou, Jian Wang, Qing Ye, Tongliang Liu, Gang Niu, Jiancheng Lv
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.00452
Pdf URL: https://arxiv.org/pdf/2412.00452
Copy Paste: [[2412.00452]] Learning Locally, Revising Globally: Global Reviser for Federated Learning with Noisy Labels(https://arxiv.org/abs/2412.00452)
Keywords: robust, federate
Abstract: The success of most federated learning (FL) methods heavily depends on label quality, which is often inaccessible in real-world scenarios, such as medicine, leading to the federated label-noise (F-LN) problem. In this study, we observe that the global model of FL memorizes the noisy labels slowly. Based on the observations, we propose a novel approach dubbed Global Reviser for Federated Learning with Noisy Labels (FedGR) to enhance the label-noise robustness of FL. In brief, FedGR employs three novel modules to achieve noisy label sniffing and refining, local knowledge revising, and local model regularization. Specifically, the global model is adopted to infer local data proxies for global sample selection and refine incorrect labels. To maximize the utilization of local knowledge, we leverage the global model to revise the local exponential moving average (EMA) model of each client and distill it into the clients' models. Additionally, we introduce a global-to-local representation regularization to mitigate the overfitting of noisy labels. Extensive experiments on three F-LNL benchmarks against seven baseline methods demonstrate the effectiveness of the proposed FedGR.

Title: Non-native speakers of English or ChatGPT: Who thinks better?

Authors: Mohammed Q. Shormani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.00457
Pdf URL: https://arxiv.org/pdf/2412.00457
Copy Paste: [[2412.00457]] Non-native speakers of English or ChatGPT: Who thinks better?(https://arxiv.org/abs/2412.00457)
Keywords: large language model
Abstract: This study sets out to answer one major question: Who thinks better, non-native speakers of English or ChatGPT?, providing evidence from processing and interpreting center-embedding English constructions that human brain surpasses ChatGPT, and that ChatGPT cannot be regarded as a theory of language. Fifteen non-native speakers of English were recruited as participants of the study. A center-embedding English sentence was presented to both the study participants and ChatGPT. The study findings unveil that human brain is still far ahead of Large Language Models, specifically ChatGPT, even in the case of non-native speakers of an L2, here English. The study concludes that human brain's ability to process and interpret natural language data is unique and that ChatGPT still lags behind this human unique ability.

Title: BGM: Background Mixup for X-ray Prohibited Items Detection

Authors: Weizhe Liu, Renshuai Tao, Hongguang Zhu, Yunda Sun, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00460
Pdf URL: https://arxiv.org/pdf/2412.00460
Copy Paste: [[2412.00460]] BGM: Background Mixup for X-ray Prohibited Items Detection(https://arxiv.org/abs/2412.00460)
Keywords: security
Abstract: Prohibited item detection is crucial for ensuring public safety, yet current X-ray image-based detection methods often lack comprehensive data-driven exploration. This paper introduces a novel data augmentation approach tailored for prohibited item detection, leveraging unique characteristics inherent to X-ray imagery. Our method is motivated by observations of physical properties including: 1) X-ray Transmission Imagery: Unlike reflected light images, transmitted X-ray pixels represent composite information from multiple materials along the imaging path. 2) Material-based Pseudo-coloring: Pseudo-color rendering in X-ray images correlates directly with material properties, aiding in material distinction. Building on a novel perspective from physical properties, we propose a simple yet effective X-ray image augmentation technique, Background Mixup (BGM), for prohibited item detection in security screening contexts. The essence is the rich background simulation of X-ray images to induce the model to increase its attention to the foreground. The approach introduces 1) contour information of baggage and 2) variation of material information into the original image by Mixup at patch level. Background Mixup is plug-and-play, parameter-free, highly generalizable and provides an effective solution to the limitations of classical visual augmentations in non-reflected light imagery. When implemented with different high-performance detectors, our augmentation method consistently boosts performance across diverse X-ray datasets from various devices and environments. Extensive experimental results demonstrate that our approach surpasses strong baselines while maintaining similar training resources.

Title: AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models

Authors: Yutong Zhou, Masahiro Ryo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00465
Pdf URL: https://arxiv.org/pdf/2412.00465
Copy Paste: [[2412.00465]] AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models(https://arxiv.org/abs/2412.00465)
Keywords: large language model, segmentation
Abstract: We introduce AgriBench, the first agriculture benchmark designed to evaluate MultiModal Large Language Models (MM-LLMs) for agriculture applications. To further address the agriculture knowledge-based dataset limitation problem, we propose MM-LUCAS, a multimodal agriculture dataset, that includes 1,784 landscape images, segmentation masks, depth maps, and detailed annotations (geographical location, country, date, land cover and land use taxonomic details, quality scores, aesthetic scores, etc), based on the Land Use/Cover Area Frame Survey (LUCAS) dataset, which contains comparable statistics on land use and land cover for the European Union (EU) territory. This work presents a groundbreaking perspective in advancing agriculture MM-LLMs and is still in progress, offering valuable insights for future developments and innovations in specific expert knowledge-based MM-LLMs.

Title: Enhancing Skin Cancer Diagnosis (SCD) Using Late Discrete Wavelet Transform (DWT) and New Swarm-Based Optimizers

Authors: Ramin Mousa, Saeed Chamani, Mohammad Morsali, Mohammad Kazzazi, Parsa Hatami, Soroush Sarabi
Subjects: cs.CV, cs.LG, cs.NE, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00472
Pdf URL: https://arxiv.org/pdf/2412.00472
Copy Paste: [[2412.00472]] Enhancing Skin Cancer Diagnosis (SCD) Using Late Discrete Wavelet Transform (DWT) and New Swarm-Based Optimizers(https://arxiv.org/abs/2412.00472)
Keywords: extraction
Abstract: Skin cancer (SC) stands out as one of the most life-threatening forms of cancer, with its danger amplified if not diagnosed and treated promptly. Early intervention is critical, as it allows for more effective treatment approaches. In recent years, Deep Learning (DL) has emerged as a powerful tool in the early detection and skin cancer diagnosis (SCD). Although the DL seems promising for the diagnosis of skin cancer, still ample scope exists for improving model efficiency and accuracy. This paper proposes a novel approach to skin cancer detection, utilizing optimization techniques in conjunction with pre-trained networks and wavelet transformations. First, normalized images will undergo pre-trained networks such as Densenet-121, Inception, Xception, and MobileNet to extract hierarchical features from input images. After feature extraction, the feature maps are passed through a Discrete Wavelet Transform (DWT) layer to capture low and high-frequency components. Then the self-attention module is integrated to learn global dependencies between features and focus on the most relevant parts of the feature maps. The number of neurons and optimization of the weight vectors are performed using three new swarm-based optimization techniques, such as Modified Gorilla Troops Optimizer (MGTO), Improved Gray Wolf Optimization (IGWO), and Fox optimization algorithm. Evaluation results demonstrate that optimizing weight vectors using optimization algorithms can enhance diagnostic accuracy and make it a highly effective approach for SCD. The proposed method demonstrates substantial improvements in accuracy, achieving top rates of 98.11% with the MobileNet + Wavelet + FOX and DenseNet + Wavelet + Fox combination on the ISIC-2016 dataset and 97.95% with the Inception + Wavelet + MGTO combination on the ISIC-2017 dataset, which improves accuracy by at least 1% compared to other methods.

Title: Jailbreak Large Visual Language Models Through Multi-Modal Linkage

Authors: Yu Wang, Xiaofei Zhou, Yichen Wang, Geyuan Zhang, Tianxing He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00473
Pdf URL: https://arxiv.org/pdf/2412.00473
Copy Paste: [[2412.00473]] Jailbreak Large Visual Language Models Through Multi-Modal Linkage(https://arxiv.org/abs/2412.00473)
Keywords: attack, steal
Abstract: With the significant advancement of Large Vision-Language Models (VLMs), concerns about their potential misuse and abuse have grown rapidly. Previous studies have highlighted VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, existing methods struggle against state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content and lack of stealthy malicious guidance. In this work, we propose a novel jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. To align the model's output with malicious intent covertly, MML employs a technique called "evil alignment", framing the attack within a video game production scenario. Comprehensive experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset. Our code is available at this https URL.

Title: Automatic Differentiation-based Full Waveform Inversion with Flexible Workflows

Authors: Feng Liu, Haipeng Li, Guangyuan Zou, Junlun Li
Subjects: cs.LG, eess.SP, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2412.00486
Pdf URL: https://arxiv.org/pdf/2412.00486
Copy Paste: [[2412.00486]] Automatic Differentiation-based Full Waveform Inversion with Flexible Workflows(https://arxiv.org/abs/2412.00486)
Keywords: robust
Abstract: Full waveform inversion (FWI) is able to construct high-resolution subsurface models by iteratively minimizing discrepancies between observed and simulated seismic data. However, its implementation can be rather involved for complex wave equations, objective functions, or regularization. Recently, automatic differentiation (AD) has proven to be effective in simplifying solutions of various inverse problems, including FWI. In this study, we present an open-source AD-based FWI framework (ADFWI), which is designed to simplify the design, development, and evaluation of novel approaches in FWI with flexibility. The AD-based framework not only includes forword modeling and associated gradient computations for wave equations in various types of media from isotropic acoustic to vertically or horizontally transverse isotropic elastic, but also incorporates a suite of objective functions, regularization techniques, and optimization algorithms. By leveraging state-of-the-art AD, objective functions such as soft dynamic time warping and Wasserstein distance, which are difficult to apply in traditional FWI are also easily integrated into ADFWI. In addition, ADFWI is integrated with deep learning for implicit model reparameterization via neural networks, which not only introduces learned regularization but also allows rapid estimation of uncertainty through dropout. To manage high memory demands in large-scale inversion associated with AD, the proposed framework adopts strategies such as mini-batch and checkpointing. Through comprehensive evaluations, we demonstrate the novelty, practicality and robustness of ADFWI, which can be used to address challenges in FWI and as a workbench for prompt experiments and the development of new inversion strategies.

Title: Density-aware Global-Local Attention Network for Point Cloud Segmentation

Authors: Chade Li, Pengju Zhang, Yihong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00489
Pdf URL: https://arxiv.org/pdf/2412.00489
Copy Paste: [[2412.00489]] Density-aware Global-Local Attention Network for Point Cloud Segmentation(https://arxiv.org/abs/2412.00489)
Keywords: segmentation
Abstract: 3D point cloud segmentation has a wide range of applications in areas such as autonomous driving, augmented reality, virtual reality and digital twins. The point cloud data collected in real scenes often contain small objects and categories with small sample sizes, which are difficult to handle by existing networks. In this regard, we propose a point cloud segmentation network that fuses local attention based on density perception with global attention. The core idea is to increase the effective receptive field of each point while reducing the loss of information about small objects in dense areas. Specifically, we divide different sized windows for local areas with different densities to compute attention within the window. Furthermore, we consider each local area as an independent token for the global attention of the entire input. A category-response loss is also proposed to balance the processing of different categories and sizes of objects. In particular, we set up an additional fully connected layer in the middle of the network for prediction of the presence of object categories, and construct a binary cross-entropy loss to respond to the presence of categories in the scene. In experiments, our method achieves competitive results in semantic segmentation and part segmentation tasks on several publicly available datasets. Experiments on point cloud data obtained from complex real-world scenes filled with tiny objects also validate the strong segmentation capability of our method for small objects as well as small sample categories.

Title: Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Authors: Duo Zheng, Shijia Huang, Liwei Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00493
Pdf URL: https://arxiv.org/pdf/2412.00493
Copy Paste: [[2412.00493]] Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding(https://arxiv.org/abs/2412.00493)
Keywords: large language model
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. Additionally, we have implemented a maximum coverage sampling technique to optimize the balance between computational costs and performance efficiency. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

Title: Distributed Differentially Private Data Analytics via Secure Sketching

Authors: Jakob Burkhardt, Hannah Keller, Claudio Orlandi, Chris Schwiegelshohn
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00497
Pdf URL: https://arxiv.org/pdf/2412.00497
Copy Paste: [[2412.00497]] Distributed Differentially Private Data Analytics via Secure Sketching(https://arxiv.org/abs/2412.00497)
Keywords: secure, privacy
Abstract: We explore the use of distributed differentially private computations across multiple servers, balancing the tradeoff between the error introduced by the differentially private mechanism and the computational efficiency of the resulting distributed algorithm. We introduce the linear-transformation model, where clients have access to a trusted platform capable of applying a public matrix to their inputs. Such computations can be securely distributed across multiple servers using simple and efficient secure multiparty computation techniques. The linear-transformation model serves as an intermediate model between the highly expressive central model and the minimal local model. In the central model, clients have access to a trusted platform capable of applying any function to their inputs. However, this expressiveness comes at a cost, as it is often expensive to distribute such computations, leading to the central model typically being implemented by a single trusted server. In contrast, the local model assumes no trusted platform, which forces clients to add significant noise to their data. The linear-transformation model avoids the single point of failure for privacy present in the central model, while also mitigating the high noise required in the local model. We demonstrate that linear transformations are very useful for differential privacy, allowing for the computation of linear sketches of input data. These sketches largely preserve utility for tasks such as private low-rank approximation and private ridge regression, while introducing only minimal error, critically independent of the number of clients. Previously, such accuracy had only been achieved in the more expressive central model.

Title: Homeostazis and Sparsity in Transformer

Authors: Leonid Kotyuzanskiy, Artem Klimov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00503
Pdf URL: https://arxiv.org/pdf/2412.00503
Copy Paste: [[2412.00503]] Homeostazis and Sparsity in Transformer(https://arxiv.org/abs/2412.00503)
Keywords: transformer
Abstract: The transformer architecture has become an integral part of the field of modern neural networks, playing a crucial role in a variety of tasks, such as text generation, machine translation, image and audio processing, among others. There is also an alternative approach to building intelligent systems, proposed by Jeff Hawkins and inspired by the processes occurring in the neocortex. In our article we want to combine some of these ideas and to propose the use of homeostazis mechanisms, such as RFB-kWTA and "Smart" Inhibition, in the attention mechanism of the transformer and at the output of the transformer block, as well as conducting an experiment involving the introduction of sparse distributed representations of the transformer at various points. RFB-kWTA utilizes statistics of layer activations across time to adjust the entire layer, enhancing the values of rare activations while reducing those of frequent ones. "Smart" Inhibition also uses activation statistics to sample sparsity masks, with rarer activation times are more likely to be activated. Our proposed mechanisms significantly outperform the classical transformer 0.2768 BLEU and a model that only makes use of dropout in the attention mechanism and output of the transformer block 0.3007 BLEU, achieving a score of 0.3062 on the Multi30K dataset.

Title: Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion

Authors: Jona Ballé, Luca Versari, Emilien Dupont, Hyunjik Kim, Matthias Bauer
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00505
Pdf URL: https://arxiv.org/pdf/2412.00505
Copy Paste: [[2412.00505]] Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion(https://arxiv.org/abs/2412.00505)
Keywords: generative
Abstract: Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.

Title: Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence

Authors: Lukas Schulze Balhorn, Kevin Degens, Artur M. Schweidtmann
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2412.00508
Pdf URL: https://arxiv.org/pdf/2412.00508
Copy Paste: [[2412.00508]] Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence(https://arxiv.org/abs/2412.00508)
Keywords: generative
Abstract: Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P&ID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate P&ID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated.

Title: Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects

Authors: Amir Barda, Matheus Gadelha, Vladimir G. Kim, Noam Aigerman, Amit H. Bermano, Thibault Groueix
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00518
Pdf URL: https://arxiv.org/pdf/2412.00518
Copy Paste: [[2412.00518]] Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects(https://arxiv.org/abs/2412.00518)
Keywords: diffusion, generative
Abstract: We propose a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in approximately 3 seconds, without the need for running an SDS type of optimization. Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. We explore different fine-tuning strategies to obtain both multiview generation and inpainting capabilities within the same diffusion model. In particular, the design of the inpainting mask is an important factor of training an inpainting model, and we propose several masking strategies to mimic the types of edits a user would perform on a 3D shape. Our approach takes 3D generative editing from hours to seconds and produces higher-quality results compared to previous works.

Title: Human Action CLIPS: Detecting AI-generated Human Motion

Authors: Matyas Bohacek, Hany Farid
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00526
Pdf URL: https://arxiv.org/pdf/2412.00526
Copy Paste: [[2412.00526]] Human Action CLIPS: Detecting AI-generated Human Motion(https://arxiv.org/abs/2412.00526)
Keywords: robust
Abstract: Full-blown AI-generated video generation continues its journey through the uncanny valley to produce content that is perceptually indistinguishable from reality. Intermixed with many exciting and creative applications are malicious applications that harm individuals, organizations, and democracies. We describe an effective and robust technique for distinguishing real from AI-generated human motion. This technique leverages a multi-modal semantic embedding, making it robust to the types of laundering that typically confound more low- to mid-level approaches. This method is evaluated against a custom-built dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage.

Title: Exact Certification of (Graph) Neural Networks Against Label Poisoning

Authors: Mahalakshmi Sabanayagam, Lukas Gosch, Stephan Günnemann, Debarghya Ghoshdastidar
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.00537
Pdf URL: https://arxiv.org/pdf/2412.00537
Copy Paste: [[2412.00537]] Exact Certification of (Graph) Neural Networks Against Label Poisoning(https://arxiv.org/abs/2412.00537)
Keywords: attack, robust
Abstract: Machine learning models are highly vulnerable to label flipping, i.e., the adversarial modification (poisoning) of training labels to compromise performance. Thus, deriving robustness certificates is important to guarantee that test predictions remain unaffected and to understand worst-case robustness behavior. However, for Graph Neural Networks (GNNs), the problem of certifying label flipping has so far been unsolved. We change this by introducing an exact certification method, deriving both sample-wise and collective certificates. Our method leverages the Neural Tangent Kernel (NTK) to capture the training dynamics of wide networks enabling us to reformulate the bilevel optimization problem representing label flipping into a Mixed-Integer Linear Program (MILP). We apply our method to certify a broad range of GNN architectures in node classification tasks. Thereby, concerning the worst-case robustness to label flipping: $(i)$ we establish hierarchies of GNNs on different benchmark graphs; $(ii)$ quantify the effect of architectural choices such as activations, depth and skip-connections; and surprisingly, $(iii)$ uncover a novel phenomenon of the robustness plateauing for intermediate perturbation budgets across all investigated datasets and architectures. While we focus on GNNs, our certificates are applicable to sufficiently wide NNs in general through their NTK. Thus, our work presents the first exact certificate to a poisoning attack ever derived for neural networks, which could be of independent interest.

Title: TextClass Benchmark: A Continuous Elo Rating of LLMs in Social Sciences

Authors: Bastián González-Bustamante
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00539
Pdf URL: https://arxiv.org/pdf/2412.00539
Copy Paste: [[2412.00539]] TextClass Benchmark: A Continuous Elo Rating of LLMs in Social Sciences(https://arxiv.org/abs/2412.00539)
Keywords: fair, transformer
Abstract: The TextClass Benchmark project is an ongoing, continuous benchmarking process that aims to provide a comprehensive, fair, and dynamic evaluation of LLMs and transformers for text classification tasks. This evaluation spans various domains and languages in social sciences disciplines engaged in NLP and text-as-data approach. The leaderboards present performance metrics and relative ranking using a tailored Elo rating system. With each leaderboard cycle, novel models are added, fixed test sets can be replaced for unseen, equivalent data to test generalisation power, ratings are updated, and a Meta-Elo leaderboard combines and weights domain-specific leaderboards. This article presents the rationale and motivation behind the project, explains the Elo rating system in detail, and estimates Meta-Elo across different classification tasks in social science disciplines. We also present a snapshot of the first cycle of classification tasks on incivility data in Chinese, English, German and Russian. This ongoing benchmarking process includes not only additional languages such as Arabic, Hindi, and Spanish but also a classification of policy agenda topics, misinformation, among others.

Title: Evaluating the Consistency of LLM Evaluators

Authors: Noah Lee, Jiwoo Hong, James Thorne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.00543
Pdf URL: https://arxiv.org/pdf/2412.00543
Copy Paste: [[2412.00543]] Evaluating the Consistency of LLM Evaluators(https://arxiv.org/abs/2412.00543)
Keywords: large language model
Abstract: Large language models (LLMs) have shown potential as general evaluators along with the evident benefits of speed and cost. While their correlation against human annotators has been widely studied, consistency as evaluators is still understudied, raising concerns about the reliability of LLM evaluators. In this paper, we conduct extensive studies on the two aspects of consistency in LLM evaluations, Self-Consistency (SC) and Inter-scale Consistency (IC), on different scoring scales and criterion granularity with open-source and proprietary models. Our comprehensive analysis demonstrates that strong proprietary models are not necessarily consistent evaluators, highlighting the importance of considering consistency in assessing the capability of LLM evaluators.

Title: RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification

Authors: Daniel Kyselica, Marek Šuppa, Jiří Šilha, Roman Ďurikovič
Subjects: cs.CV, astro-ph.IM, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00544
Pdf URL: https://arxiv.org/pdf/2412.00544
Copy Paste: [[2412.00544]] RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification(https://arxiv.org/abs/2412.00544)
Keywords: robust, transformer
Abstract: Space debris presents a critical challenge for the sustainability of future space missions, emphasizing the need for robust and standardized identification methods. However, a comprehensive benchmark for rocket body classification remains absent. This paper addresses this gap by introducing the RoBo6 dataset for rocket body classification based on light curves. The dataset, derived from the Mini Mega Tortora database, includes light curves for six rocket body classes: CZ-3B, Atlas 5 Centaur, Falcon 9, H-2A, Ariane 5, and Delta 4. With 5,676 training and 1,404 test samples, it addresses data inconsistencies using resampling, normalization, and filtering techniques. Several machine learning models were evaluated, including CNN and transformer-based approaches, with Astroconformer reporting the best performance. The dataset establishes a common benchmark for future comparisons and advancements in rocket body classification tasks.

Title: Rank It, Then Ask It: Input Reranking for Maximizing the Performance of LLMs on Symmetric Tasks

Authors: Mohsen Dehghankar, Abolfazl Asudeh
Subjects: cs.LG, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2412.00546
Pdf URL: https://arxiv.org/pdf/2412.00546
Copy Paste: [[2412.00546]] Rank It, Then Ask It: Input Reranking for Maximizing the Performance of LLMs on Symmetric Tasks(https://arxiv.org/abs/2412.00546)
Keywords: large language model
Abstract: Large language models (LLMs) have quickly emerged as practical and versatile tools that provide new solutions for a wide range of domains. In this paper, we consider the application of LLMs on symmetric tasks where a query is asked on an (unordered) bag of elements. Examples of such tasks include answering aggregate queries on a database table. In general, when the bag contains a large number of elements, LLMs tend to overlook some elements, leading to challenges in generating accurate responses to the query. LLMs receive their inputs as ordered sequences. However, in this problem, we leverage the fact that the symmetric input is not ordered, and reordering should not affect the LLM's response. Observing that LLMs are less likely to miss elements at certain positions of the input, we introduce the problem of LLM input reranking: to find a ranking of the input that maximizes the LLM's accuracy for the given query without making explicit assumptions about the query. Finding the optimal ranking requires identifying (i) the relevance of each input element for answering the query and (ii) the importance of each rank position for the LLM's attention. We develop algorithms for estimating these values efficiently utilizing a helper LLM. We conduct comprehensive experiments on different synthetic and real datasets to validate our proposal and to evaluate the effectiveness of our proposed algorithms. Our experiments confirm that our reranking approach improves the accuracy of the LLMs on symmetric tasks by up to $99\%$ proximity to the optimum upper bound.

Title: Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning

Authors: Tianshuo Xu, Zhifei Chen, Leyi Wu, Hao Lu, Yuying Chen, Lihui Jiang, Bingbing Liu, Yingcong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00547
Pdf URL: https://arxiv.org/pdf/2412.00547
Copy Paste: [[2412.00547]] Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning(https://arxiv.org/abs/2412.00547)
Keywords: segmentation
Abstract: Recent numerous video generation models, also known as world models, have demonstrated the ability to generate plausible real-world videos. However, many studies have shown that these models often produce motion results lacking logical or physical coherence. In this paper, we revisit video generation models and find that single-stage approaches struggle to produce high-quality results while maintaining coherent motion reasoning. To address this issue, we propose \textbf{Motion Dreamer}, a two-stage video generation framework. In Stage I, the model generates an intermediate motion representation-such as a segmentation map or depth map-based on the input image and motion conditions, focusing solely on the motion itself. In Stage II, the model uses this intermediate motion representation as a condition to generate a high-detail video. By decoupling motion reasoning from high-fidelity video synthesis, our approach allows for more accurate and physically plausible motion generation. We validate the effectiveness of our approach on the Physion dataset and in autonomous driving scenarios. For example, given a single push, our model can synthesize the sequential toppling of a set of dominoes. Similarly, by varying the movements of ego-cars, our model can produce different effects on other vehicles. Our work opens new avenues in creating models that can reason about physical interactions in a more coherent and realistic manner.

Title: SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains

Authors: Jebish Purbey, Siddhant Gupta, Nikhil Manali, Siddartha Pullakhandam, Drishti Sharma, Ashay Srivastava, Ram Mohan Rao Kadiyala
Subjects: cs.CL, cs.CE, cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2412.00549
Pdf URL: https://arxiv.org/pdf/2412.00549
Copy Paste: [[2412.00549]] SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains(https://arxiv.org/abs/2412.00549)
Keywords: robust, large language model
Abstract: This paper presents the system description of our entry for the COLING 2025 FMD challenge, focusing on misinformation detection in financial domains. We experimented with a combination of large language models, including Qwen, Mistral, and Gemma-2, and leveraged pre-processing and sequential learning for not only identifying fraudulent financial content but also generating coherent, and concise explanations that clarify the rationale behind the classifications. Our approach achieved competitive results with an F1-score of 0.8283 for classification, and ROUGE-1 of 0.7253 for explanations. This work highlights the transformative potential of LLMs in financial applications, offering insights into their capabilities for combating misinformation and enhancing transparency while identifying areas for future improvement in robustness and domain adaptation.

Title: Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective

Authors: Yue Zhou, Barbara Di Eugenio, Lu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00554
Pdf URL: https://arxiv.org/pdf/2412.00554
Copy Paste: [[2412.00554]] Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective(https://arxiv.org/abs/2412.00554)
Keywords: fair, large language model
Abstract: This paper studies the performance of large language models (LLMs), particularly regarding demographic fairness, in solving real-world healthcare tasks. We evaluate state-of-the-art LLMs with three prevalent learning frameworks across six diverse healthcare tasks and find significant challenges in applying LLMs to real-world healthcare tasks and persistent fairness issues across demographic groups. We also find that explicitly providing demographic information yields mixed results, while LLM's ability to infer such details raises concerns about biased health predictions. Utilizing LLMs as autonomous agents with access to up-to-date guidelines does not guarantee performance improvement. We believe these findings reveal the critical limitations of LLMs in healthcare fairness and the urgent need for specialized research in this area.

Title: Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction

Authors: Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00556
Pdf URL: https://arxiv.org/pdf/2412.00556
Copy Paste: [[2412.00556]] Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction(https://arxiv.org/abs/2412.00556)
Keywords: large language model
Abstract: Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our approach can significantly accelerate those popular MLLMs, e.g. LLaVA, and InternVL2 models, by more than $2 \times$ without performance drops. Our approach also far outperforms other token reduction methods when budgets are limited, achieving a better trade-off between efficiency and effectiveness.

Title: Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

Authors: Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00557
Pdf URL: https://arxiv.org/pdf/2412.00557
Copy Paste: [[2412.00557]] Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion(https://arxiv.org/abs/2412.00557)
Keywords: diffusion
Abstract: Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-image diffusion models to solve blind inverse problems with minimal assumptions. By leveraging natural language prompts, LADiBI jointly models priors for both the target image and operator, allowing for flexible adaptation across a variety of tasks. Additionally, we propose a novel posterior sampling approach that combines effective operator initialization with iterative refinement, enabling LADiBI to operate without predefined operator forms. Our experiments show that LADiBI is capable of solving a broad range of image restoration tasks, including both linear and nonlinear problems, on diverse target image distributions.

Title: Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment

Authors: Łukasz Grzybowski, Jakub Pokrywka, Michał Ciesiółka, Jeremi I. Kaczmarek, Marek Kubis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00559
Pdf URL: https://arxiv.org/pdf/2412.00559
Copy Paste: [[2412.00559]] Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment(https://arxiv.org/abs/2412.00559)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing specialization. The dataset was web-scraped from publicly available resources provided by the Medical Examination Center and the Chief Medical Chamber. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora, where the English portion was professionally translated by the examination center for foreign candidates. By creating a structured benchmark from these existing exam questions, we systematically evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students. Our analysis reveals that while models like GPT-4o achieve near-human performance, significant challenges persist in cross-lingual translation and domain-specific understanding. These findings underscore disparities in model performance across languages and medical specialties, highlighting the limitations and ethical considerations of deploying LLMs in clinical practice.

Title: Friend or Foe? Harnessing Controllable Overfitting for Anomaly Detection

Authors: Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00560
Pdf URL: https://arxiv.org/pdf/2412.00560
Copy Paste: [[2412.00560]] Friend or Foe? Harnessing Controllable Overfitting for Anomaly Detection(https://arxiv.org/abs/2412.00560)
Keywords: robust
Abstract: Overfitting has long been stigmatized as detrimental to model performance, especially in the context of anomaly detection. Our work challenges this conventional view by introducing a paradigm shift, recasting overfitting as a controllable and strategic mechanism for enhancing model discrimination capabilities. In this paper, we present Controllable Overfitting-based Anomaly Detection (COAD), a novel framework designed to leverage overfitting for optimized anomaly detection. We propose the Aberrance Retention Quotient (ARQ), a novel metric that systematically quantifies the extent of overfitting, enabling the identification of an optimal "golden overfitting interval." Within this interval, overfitting is leveraged to significantly amplify the model's sensitivity to anomalous patterns, while preserving generalization to normal samples. Additionally, we present the Relative Anomaly Distribution Index (RADI), an innovative metric designed to complement AUROC pixel by providing a more versatile and theoretically robust framework for assessing model performance. RADI leverages ARQ to track and evaluate how overfitting impacts anomaly detection, offering an integrated approach to understanding the relationship between overfitting dynamics and model efficacy. Our theoretical work also rigorously validates the use of Gaussian noise in pseudo anomaly synthesis, providing the foundation for its broader applicability across diverse domains. Empirical evaluations demonstrate that our controllable overfitting method not only achieves State of the Art (SOTA) performance in both one-class and multi-class anomaly detection tasks but also redefines overfitting from a modeling challenge into a powerful tool for optimizing anomaly detection.

Title: Continuous Concepts Removal in Text-to-image Diffusion Models

Authors: Tingxu Han, Weisong Sun, Yanrong Hu, Chunrong Fang, Yonglong Zhang, Shiqing Ma, Tao Zheng, Zhenyu Chen, Zhenting Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00580
Pdf URL: https://arxiv.org/pdf/2412.00580
Copy Paste: [[2412.00580]] Continuous Concepts Removal in Text-to-image Diffusion Models(https://arxiv.org/abs/2412.00580)
Keywords: diffusion
Abstract: Text-to-image diffusion models have shown an impressive ability to generate high-quality images from input textual descriptions. However, concerns have been raised about the potential for these models to create content that infringes on copyrights or depicts disturbing subject matter. Removing specific concepts from these models is a promising potential solution to this problem. However, existing methods for concept removal do not work well in practical but challenging scenarios where concepts need to be continuously removed. Specifically, these methods lead to poor alignment between the text prompts and the generated image after the continuous removal process. To address this issue, we propose a novel approach called CCRT that includes a designed knowledge distillation paradigm. It constrains the text-image alignment behavior during the continuous concept removal process by using a set of text prompts generated through our genetic algorithm, which employs a designed fuzzing strategy. We conduct extensive experiments involving the removal of various concepts. The results evaluated through both algorithmic metrics and human studies demonstrate that our CCRT can effectively remove the targeted concepts in a continuous manner while maintaining the high generation quality (e.g., text-image alignment) of the model.

Title: Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects

Authors: Fred Heiding, Simon Lermen, Andrew Kao, Bruce Schneier, Arun Vishwanath
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00586
Pdf URL: https://arxiv.org/pdf/2412.00586
Copy Paste: [[2412.00586]] Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects(https://arxiv.org/abs/2412.00586)
Keywords: attack, large language model
Abstract: In this paper, we evaluate the capability of large language models to conduct personalized phishing attacks and compare their performance with human experts and AI models from last year. We include four email groups with a combined total of 101 participants: A control group of arbitrary phishing emails, which received a click-through rate (recipient pressed a link in the email) of 12%, emails generated by human experts (54% click-through), fully AI-automated emails 54% (click-through), and AI emails utilizing a human-in-the-loop (56% click-through). Thus, the AI-automated attacks performed on par with human experts and 350% better than the control group. The results are a significant improvement from similar studies conducted last year, highlighting the increased deceptive capabilities of AI models. Our AI-automated emails were sent using a custom-built tool that automates the entire spear phishing process, including information gathering and creating personalized vulnerability profiles for each target. The AI-gathered information was accurate and useful in 88% of cases and only produced inaccurate profiles for 4% of the participants. We also use language models to detect the intention of emails. Claude 3.5 Sonnet scored well above 90% with low false-positive rates and detected several seemingly benign emails that passed human detection. Lastly, we analyze the economics of phishing, highlighting how AI enables attackers to target more individuals at lower cost and increase profitability by up to 50 times for larger audiences.

Title: Generative LiDAR Editing with Controllable Novel Object Layouts

Authors: Shing-Hei Ho, Bao Thach, Minghan Zhu
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00592
Pdf URL: https://arxiv.org/pdf/2412.00592
Copy Paste: [[2412.00592]] Generative LiDAR Editing with Controllable Novel Object Layouts(https://arxiv.org/abs/2412.00592)
Keywords: generative
Abstract: We propose a framework to edit real-world Lidar scans with novel object layouts while preserving a realistic background environment. Compared to the synthetic data generation frameworks where Lidar point clouds are generated from scratch, our framework focuses on new scenario generation in a given background environment, and our method also provides labels for the generated data. This approach ensures the generated data remains relevant to the specific environment, aiding both the development and the evaluation of algorithms in real-world scenarios. Compared with novel view synthesis, our framework allows the creation of counterfactual scenarios with significant changes in the object layout and does not rely on multi-frame optimization. In our framework, the object removal and insertion are supported by generative background inpainting and object point cloud completion, and the entire pipeline is built upon spherical voxelization, which realizes the correct Lidar projective geometry by construction. Experiments show that our framework generates realistic Lidar scans with object layout changes and benefits the development of Lidar-based self-driving systems.

Title: PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Authors: Qiyao Xue, Xiangyu Yin, Boyuan Yang, Wei Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00596
Pdf URL: https://arxiv.org/pdf/2412.00596
Copy Paste: [[2412.00596]] PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation(https://arxiv.org/abs/2412.00596)
Keywords: diffusion, transformer
Abstract: Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: this https URL.

Title: TraCS: Trajectory Collection in Continuous Space under Local Differential Privacy

Authors: Ye Zheng, Yidan Hu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00620
Pdf URL: https://arxiv.org/pdf/2412.00620
Copy Paste: [[2412.00620]] TraCS: Trajectory Collection in Continuous Space under Local Differential Privacy(https://arxiv.org/abs/2412.00620)
Keywords: privacy
Abstract: Trajectory collection is fundamental for location-based services but often involves sensitive information, such as a user's daily routine, raising privacy concerns. Local differential privacy (LDP) provides provable privacy guarantees for users, even when the data collector is untrusted. Existing trajectory collection methods ensure LDP only for discrete location spaces, where the number of locations affects their privacy guarantees and trajectory utility. Moreover, the location space is often naturally continuous, such as in flying and sailing trajectories, making these methods unsuitable. This paper proposes two trajectory collection methods that ensure LDP for continuous spaces: TraCS-D, which perturbs the direction and distance of locations, and TraCS-C, which perturbs the Cartesian coordinates of locations. Both methods are theoretically and experimentally analyzed for trajectory utility. TraCS can also be applied to discrete spaces by rounding perturbed locations to the nearest discrete points. It is independent of the number of locations and has only $\Theta(1)$ time complexity in each perturbation generation. Evaluation results on discrete location spaces validate this advantage and show that TraCS outperforms state-of-the-art methods with improved trajectory utility, especially for large privacy parameters.

Title: Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance

Authors: Chen-Wei Chang, Shailik Sarkar, Shutonu Mitra, Qi Zhang, Hossein Salemi, Hemant Purohit, Fengxiu Zhang, Michin Hong, Jin-Hee Cho, Chang-Tien Lu
Subjects: cs.CR, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2412.00621
Pdf URL: https://arxiv.org/pdf/2412.00621
Copy Paste: [[2412.00621]] Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance(https://arxiv.org/abs/2412.00621)
Keywords: robust, large language model
Abstract: Can we trust Large Language Models (LLMs) to accurately predict scam? This paper investigates the vulnerabilities of LLMs when facing adversarial scam messages for the task of scam detection. We addressed this issue by creating a comprehensive dataset with fine-grained labels of scam messages, including both original and adversarial scam messages. The dataset extended traditional binary classes for the scam detection task into more nuanced scam types. Our analysis showed how adversarial examples took advantage of vulnerabilities of a LLM, leading to high misclassification rate. We evaluated the performance of LLMs on these adversarial scam messages and proposed strategies to improve their robustness.

Title: Visual Modality Prompt for Adapting Vision-Language Object Detectors

Authors: Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, Marco Pedersoli
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00622
Pdf URL: https://arxiv.org/pdf/2412.00622
Copy Paste: [[2412.00622]] Visual Modality Prompt for Adapting Vision-Language Object Detectors(https://arxiv.org/abs/2412.00622)
Keywords: robust
Abstract: The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches tend to compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly task residuals, facilitating more robust adaptation. Empirically, we benchmark our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) data, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Our code is available at: this https URL

Title: A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

Authors: Chensheng Peng, Ido Sobol, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu, Or Litany
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00623
Pdf URL: https://arxiv.org/pdf/2412.00623
Copy Paste: [[2412.00623]] A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision(https://arxiv.org/abs/2412.00623)
Keywords: diffusion, generative
Abstract: We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D inputs to 3D. Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data. Diffusion models have recently shown promise as powerful generative models for 3D data, including Gaussian splats; however, standard diffusion frameworks typically require the target signal and denoised signal to be in the same modality, which is challenging given the scarcity of 3D data. To overcome this, we propose a novel training strategy that decouples the denoised modality from the supervision modality. By using a deterministic model as a noisy teacher to create the noised signal and transitioning from single-step to multi-step denoising supervised by an image rendering loss, our approach significantly enhances performance compared to the deterministic teacher. Additionally, our method is flexible, as it can learn from various 3D Gaussian Splat (3DGS) teachers with minimal adaptation; we demonstrate this by surpassing the performance of two different deterministic models as teachers, highlighting the potential generalizability of our framework. Our approach further incorporates a guidance mechanism to aggregate information from multiple views, enhancing reconstruction quality when more than one view is available. Experimental results on object-level and scene-level datasets demonstrate the effectiveness of our framework.

Title: VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Authors: Yogesh Kulkarni, Pooyan Fazli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00624
Pdf URL: https://arxiv.org/pdf/2412.00624
Copy Paste: [[2412.00624]] VideoSAVi: Self-Aligned Video Language Models without Human Supervision(https://arxiv.org/abs/2412.00624)
Keywords: large language model
Abstract: Recent advances in vision-language models (VLMs) have significantly enhanced video understanding tasks. Instruction tuning (i.e., fine-tuning models on datasets of instructions paired with desired outputs) has been key to improving model performance. However, creating diverse instruction-tuning datasets is challenging due to high annotation costs and the complexity of capturing temporal information in videos. Existing approaches often rely on large language models to generate instruction-output pairs, which can limit diversity and lead to responses that lack grounding in the video content. To address this, we propose VideoSAVi (Self-Aligned Video Language Model), a novel self-training pipeline that enables VLMs to generate their own training data without extensive manual annotation. The process involves three stages: (1) generating diverse video-specific questions, (2) producing multiple candidate answers, and (3) evaluating these responses for alignment with the video content. This self-generated data is then used for direct preference optimization (DPO), allowing the model to refine its own high-quality outputs and improve alignment with video content. Our experiments demonstrate that even smaller models (0.5B and 7B parameters) can effectively use this self-training approach, outperforming previous methods and achieving results comparable to those trained on proprietary preference data. VideoSAVi shows significant improvements across multiple benchmarks: up to 28% on multi-choice QA, 8% on zero-shot open-ended QA, and 12% on temporal reasoning benchmarks. These results demonstrate the effectiveness of our self-training approach in enhancing video understanding while reducing dependence on proprietary models.

Title: ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

Authors: Yang Wu, Huayi Zhang, Yizheng Jiao, Lin Ma, Xiaozhong Liu, Jinhong Yu, Dongyu Zhang, Dezhi Yu, Wei Xu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00631
Pdf URL: https://arxiv.org/pdf/2412.00631
Copy Paste: [[2412.00631]] ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning(https://arxiv.org/abs/2412.00631)
Keywords: robust, large language model
Abstract: Instruction tuning has underscored the significant potential of large language models (LLMs) in producing more human-controllable and effective outputs in various domains. In this work, we focus on the data selection problem for task-specific instruction tuning of LLMs. Prevailing methods primarily rely on the crafted similarity metrics to select training data that aligns with the test data distribution. The goal is to minimize instruction tuning loss on the test data, ultimately improving performance on the target task. However, it has been widely observed that instruction tuning loss (i.e., cross-entropy loss for next token prediction) in LLMs often fails to exhibit a monotonic relationship with actual task performance. This misalignment undermines the effectiveness of current data selection methods for task-specific instruction tuning. To address this issue, we introduce ROSE, a novel Reward-Oriented inStruction data sElection method which leverages pairwise preference loss as a reward signal to optimize data selection for task-specific instruction tuning. Specifically, ROSE adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related training data points. Experimental results show that by selecting just 5% of the training data using ROSE, our approach can achieve competitive results compared to fine-tuning with the full training dataset, and it surpasses other state-of-the-art data selection methods for task-specific instruction tuning. Our qualitative analysis further confirms the robust generalizability of our method across multiple benchmark datasets and diverse model architectures.

Title: Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis

Authors: Hao Jin, Hengyuan Chang, Xiaoxuan Xie, Zhengyang Wang, Xusheng Du, Shaojun Hu, Haoran Xie
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00638
Pdf URL: https://arxiv.org/pdf/2412.00638
Copy Paste: [[2412.00638]] Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis(https://arxiv.org/abs/2412.00638)
Keywords: diffusion
Abstract: Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow motions. To achieve intuitive and detailed control of the generated cinemagraphs, freehand sketches can provide a better solution to convey personalized design requirements than only text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial content generation and provides hand-drawn sketch controls for both spatial and motion cues. The latent diffusion model is adopted to generate target stylized landscape images along with realistic versions. Then, a pre-trained object detection model is utilized to segment and obtain masks for the flow regions. We proposed a novel latent motion diffusion model to estimate the motion field in the fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated vector fields in the masked fluid regions with the prompt. To synthesize the cinemagraph frames, the pixels within fluid regions are subsequently warped to the target locations for each timestep using a frame generator. The results verified that Sketch2Cinemagraph can generate high-fidelity and aesthetically appealing stylized cinemagraphs with continuous temporal flow from intuitive sketch inputs. We showcase the advantages of Sketch2Cinemagraph through quantitative comparisons against the state-of-the-art generation approaches.

Title: DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation

Authors: Jingyang Xiang, Saiqian Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00648
Pdf URL: https://arxiv.org/pdf/2412.00648
Copy Paste: [[2412.00648]] DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation(https://arxiv.org/abs/2412.00648)
Keywords: large language model
Abstract: Rotating the activation and weight matrices to reduce the influence of outliers in large language models (LLMs) has recently attracted significant attention, particularly in the context of model quantization. Prior studies have shown that in low-precision quantization scenarios, such as 4-bit weights and 4-bit activations (W4A4), randomized Hadamard transforms can achieve significantly higher accuracy than randomized orthogonal transforms. Notably, the reason behind this phenomena remains unknown. In this paper, we find that these transformations show substantial improvement in eliminating outliers for common tokens and achieve similar quantization error. The primary reason for the accuracy difference lies in the fact that randomized Hadamard transforms can slightly reduce the quantization error for tokens with massive activations while randomized orthogonal transforms increase the quantization error. Due to the extreme rarity of these tokens and their critical impact on model accuracy, we consider this a long-tail optimization problem, and therefore construct a simple yet effective method: a weighted loss function. Additionally, we propose an optimization strategy for the rotation matrix that involves alternating optimization of quantization parameters while employing orthogonal Procrustes transforms to refine the rotation matrix. This makes the distribution of the rotated activation values more conducive to quantization, especially for tokens with massive activations. Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot. Extensive experiments demonstrate the effectiveness and efficiency of DFRot. By tuning the rotation matrix using just a single sample, DFRot achieves a perplexity improvement of 0.25 and 0.21 on W4A4KV4 and W4A4KV16, respectively, for LLaMA3-8B, a model known for its quantization challenges.

Title: Towards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics

Authors: Minghao Han, Dingkang Yang, Jiabei Cheng, Xukun Zhang, Linhao Qu, Zizhi Chen, Lihua Zhang
Subjects: cs.CV, q-bio.GN
Abstract URL: https://arxiv.org/abs/2412.00651
Pdf URL: https://arxiv.org/pdf/2412.00651
Copy Paste: [[2412.00651]] Towards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics(https://arxiv.org/abs/2412.00651)
Keywords: robust
Abstract: Recent advancements in multimodal pre-training models have significantly advanced computational pathology. However, current approaches predominantly rely on visual-language models, which may impose limitations from a molecular perspective and lead to performance bottlenecks. Here, we introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE). UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training, enhancing the molecular awareness of pathology image representation learning. We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings. Due to the scarcity of paired data, approximately 4 million entries of spatial transcriptomics gene expression were collected to train the gene encoder. By leveraging powerful pre-trained encoders, UMPIRE aligns the encoders across over 697K pathology image-gene expression pairs. The performance of UMPIRE is demonstrated across various molecular-related downstream tasks, including gene expression prediction, spot classification, and mutation state prediction in whole slide images. Our findings highlight the effectiveness of multimodal data integration and open new avenues for exploring computational pathology enhanced by molecular perspectives. The code and pre-trained weights are available at this https URL.

Title: Multi-Agent Collaboration in Incident Response with Large Language Models

Authors: Zefang Liu
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.00652
Pdf URL: https://arxiv.org/pdf/2412.00652
Copy Paste: [[2412.00652]] Multi-Agent Collaboration in Incident Response with Large Language Models(https://arxiv.org/abs/2412.00652)
Keywords: security, attack, large language model
Abstract: Incident response (IR) is a critical aspect of cybersecurity, requiring rapid decision-making and coordinated efforts to address cyberattacks effectively. Leveraging large language models (LLMs) as intelligent agents offers a novel approach to enhancing collaboration and efficiency in IR scenarios. This paper explores the application of LLM-based multi-agent collaboration using the Backdoors & Breaches framework, a tabletop game designed for cybersecurity training. We simulate real-world IR dynamics through various team structures, including centralized, decentralized, and hybrid configurations. By analyzing agent interactions and performance across these setups, we provide insights into optimizing multi-agent collaboration for incident response. Our findings highlight the potential of LLMs to enhance decision-making, improve adaptability, and streamline IR processes, paving the way for more effective and coordinated responses to cyber threats.

Title: Improving Decoupled Posterior Sampling for Inverse Problems using Data Consistency Constraint

Authors: Zhi Qi, Shihong Yuan, Yuyin Yuan, Linling Kuang, Yoshiyuki Kabashima, Xiangming Meng
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00664
Pdf URL: https://arxiv.org/pdf/2412.00664
Copy Paste: [[2412.00664]] Improving Decoupled Posterior Sampling for Inverse Problems using Data Consistency Constraint(https://arxiv.org/abs/2412.00664)
Keywords: diffusion
Abstract: Diffusion models have shown strong performances in solving inverse problems through posterior sampling while they suffer from errors during earlier steps. To mitigate this issue, several Decoupled Posterior Sampling methods have been recently proposed. However, the reverse process in these methods ignores measurement information, leading to errors that impede effective optimization in subsequent steps. To solve this problem, we propose Guided Decoupled Posterior Sampling (GDPS) by integrating a data consistency constraint in the reverse process. The constraint performs a smoother transition within the optimization process, facilitating a more effective convergence toward the target distribution. Furthermore, we extend our method to latent diffusion models and Tweedie's formula, demonstrating its scalability. We evaluate GDPS on the FFHQ and ImageNet datasets across various linear and nonlinear tasks under both standard and challenging conditions. Experimental results demonstrate that GDPS achieves state-of-the-art performance, improving accuracy over existing methods.

Title: Learning on Less: Constraining Pre-trained Model Learning for Generalizable Diffusion-Generated Image Detection

Authors: Yingjian Chen, Lei Zhang, Yakun Niu, Lei Tan, Pei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00665
Pdf URL: https://arxiv.org/pdf/2412.00665
Copy Paste: [[2412.00665]] Learning on Less: Constraining Pre-trained Model Learning for Generalizable Diffusion-Generated Image Detection(https://arxiv.org/abs/2412.00665)
Keywords: extraction, diffusion
Abstract: Diffusion Models enable realistic image generation, raising the risk of misinformation and eroding public trust. Currently, detecting images generated by unseen diffusion models remains challenging due to the limited generalization capabilities of existing methods. To address this issue, we rethink the effectiveness of pre-trained models trained on large-scale, real-world images. Our findings indicate that: 1) Pre-trained models can cluster the features of real images effectively. 2) Models with pre-trained weights can approximate an optimal generalization solution at a specific training step, but it is extremely unstable. Based on these facts, we propose a simple yet effective training method called Learning on Less (LoL). LoL utilizes a random masking mechanism to constrain the model's learning of the unique patterns specific to a certain type of diffusion model, allowing it to focus on less image content. This leverages the inherent strengths of pre-trained weights while enabling a more stable approach to optimal generalization, which results in the extraction of a universal feature that differentiates various diffusion-generated images from real images. Extensive experiments on the GenImage benchmark demonstrate the remarkable generalization capability of our proposed LoL. With just 1% training data, LoL significantly outperforms the current state-of-the-art, achieving a 13.6% improvement in average ACC across images generated by eight different models.

Title: FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

Authors: Yunpeng Bai, Qixing Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00671
Pdf URL: https://arxiv.org/pdf/2412.00671
Copy Paste: [[2412.00671]] FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation(https://arxiv.org/abs/2412.00671)
Keywords: robust, diffusion, generative
Abstract: Monocular Depth Estimation (MDE) is essential for applications like 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust MDE remains challenging due to noisy real-world data and distribution gaps in synthetic datasets. Existing methods often struggle with low efficiency, reduced accuracy, and lack of detail. To address this, we propose an efficient approach for leveraging diffusion priors and introduce FiffDepth, a framework that transforms diffusion-based image generators into a feedforward architecture for detailed depth estimation. By preserving key generative features and integrating the strong generalization capabilities of models like dinov2, FiffDepth achieves enhanced accuracy, stability, and fine-grained detail, offering a significant improvement in MDE performance across diverse real-world scenarios.

Title: ChainGuard: A Blockchain-based Authentication and Access Control Scheme for Distributed Networks

Authors: Faisal Haque Bappy, Joon S. Park, Kamrul Hasan, Tariqul Islam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00677
Pdf URL: https://arxiv.org/pdf/2412.00677
Copy Paste: [[2412.00677]] ChainGuard: A Blockchain-based Authentication and Access Control Scheme for Distributed Networks(https://arxiv.org/abs/2412.00677)
Keywords: security, protect
Abstract: As blockchain technology gains traction for enhancing data security and operational efficiency, traditional centralized authentication systems remain a significant bottleneck. This paper addresses the challenge of integrating decentralized authentication and access control within distributed networks. We propose a novel solution named ChainGuard, a fully decentralized authentication and access control mechanism based on smart contracts. ChainGuard eliminates the need for a central server by leveraging blockchain technology to manage user roles and permissions dynamically. Our scheme supports user interactions across multiple organizations simultaneously, enhancing security, efficiency, and transparency. By addressing key challenges such as scalability, security, and transparency, ChainGuard not only bridges the gap between traditional centralized systems and blockchain's decentralized ethos but also enhances data protection and operational efficiency.

Title: 2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

Authors: Jingwei Zhang, Anh Tien Nguyen, Xi Han, Vincent Quoc-Huy Trinh, Hong Qin, Dimitris Samaras, Mahdi S. Hosseini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00678
Pdf URL: https://arxiv.org/pdf/2412.00678
Copy Paste: [[2412.00678]] 2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification(https://arxiv.org/abs/2412.00678)
Keywords: transformer, segmentation
Abstract: Efficiently modeling large 2D contexts is essential for various fields including Giga-Pixel Whole Slide Imaging (WSI) and remote sensing. Transformer-based models offer high parallelism but face challenges due to their quadratic complexity for handling long sequences. Recently, Mamba introduced a selective State Space Model (SSM) with linear complexity and high parallelism, enabling effective and efficient modeling of wide context in 1D sequences. However, extending Mamba to vision tasks, which inherently involve 2D structures, results in spatial discrepancies due to the limitations of 1D sequence processing. On the other hand, current 2D SSMs inherently model 2D structures but they suffer from prohibitively slow computation due to the lack of efficient parallel algorithms. In this work, we propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba, with a highly optimized hardware-aware operator, adopting both spatial continuity and computational efficiency. We validate the versatility of our approach on both WSIs and natural images. Extensive experiments on 10 public datasets for WSI classification and survival analysis show that 2DMamba~improves up to $2.48\%$ in AUC, $3.11\%$ in F1 score, $2.47\%$ in accuracy and $5.52\%$ in C-index. Additionally, integrating our method with VMamba for natural imaging yields $0.5$ to $0.7$ improvements in mIoU on the ADE20k semantic segmentation dataset, and $0.2\%$ accuracy improvement on ImageNet-1K classification dataset. Our code is available at this https URL.

Title: SEAM: A Secure Automated and Maintainable Smart Contract Upgrade Framework

Authors: Tahrim Hossain, Faisal Haque Bappy, Tarannum Shaila Zaman, Tariqul Islam
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2412.00680
Pdf URL: https://arxiv.org/pdf/2412.00680
Copy Paste: [[2412.00680]] SEAM: A Secure Automated and Maintainable Smart Contract Upgrade Framework(https://arxiv.org/abs/2412.00680)
Keywords: secure, security, robust
Abstract: This work addresses the critical challenges of upgrading smart contracts, which are vital for trust in automated transactions but difficult to modify once deployed. To address this issue, we propose SEAM, a novel framework that automates the conversion of standard Solidity contracts into upgradable versions using the diamond pattern. SEAM simplifies the upgrade process and addresses two key vulnerabilities: function selector clashes and storage slot collisions. Additionally, the framework provides tools for efficiently deploying, modifying, and managing smart contract lifecycles. By enhancing contract security and reducing the learning curve for developers, SEAM lays a robust foundation for more flexible and maintainable blockchain applications.

Title: MIMIC: Multimodal Islamophobic Meme Identification and Classification

Authors: S M Jishanul Islam, Sahid Hossain Mustakim, Sadia Ahmmed, Md. Faiyaz Abdullah Sayeedi, Swapnil Khandoker, Syed Tasdid Azam Dhrubo, Nahid Hossain
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00681
Pdf URL: https://arxiv.org/pdf/2412.00681
Copy Paste: [[2412.00681]] MIMIC: Multimodal Islamophobic Meme Identification and Classification(https://arxiv.org/abs/2412.00681)
Keywords: transformer
Abstract: Anti-Muslim hate speech has emerged within memes, characterized by context-dependent and rhetorical messages using text and images that seemingly mimic humor but convey Islamophobic sentiments. This work presents a novel dataset and proposes a classifier based on the Vision-and-Language Transformer (ViLT) specifically tailored to identify anti-Muslim hate within memes by integrating both visual and textual representations. Our model leverages joint modal embeddings between meme images and incorporated text to capture nuanced Islamophobic narratives that are unique to meme culture, providing both high detection accuracy and interoperability.

Title: FlashSLAM: Accelerated RGB-D SLAM for Real-Time 3D Scene Reconstruction with Gaussian Splatting

Authors: Phu Pham, Damon Conover, Aniket Bera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00682
Pdf URL: https://arxiv.org/pdf/2412.00682
Copy Paste: [[2412.00682]] FlashSLAM: Accelerated RGB-D SLAM for Real-Time 3D Scene Reconstruction with Gaussian Splatting(https://arxiv.org/abs/2412.00682)
Keywords: robust
Abstract: We present FlashSLAM, a novel SLAM approach that leverages 3D Gaussian Splatting for efficient and robust 3D scene reconstruction. Existing 3DGS-based SLAM methods often fall short in sparse view settings and during large camera movements due to their reliance on gradient descent-based optimization, which is both slow and inaccurate. FlashSLAM addresses these limitations by combining 3DGS with a fast vision-based camera tracking technique, utilizing a pretrained feature matching model and point cloud registration for precise pose estimation in under 80 ms - a 90% reduction in tracking time compared to SplaTAM - without costly iterative rendering. In sparse settings, our method achieves up to a 92% improvement in average tracking accuracy over previous methods. Additionally, it accounts for noise in depth sensors, enhancing robustness when using unspecialized devices such as smartphones. Extensive experiments show that FlashSLAM performs reliably across both sparse and dense settings, in synthetic and real-world environments. Evaluations on benchmark datasets highlight its superior accuracy and efficiency, establishing FlashSLAM as a versatile and high-performance solution for SLAM, advancing the state-of-the-art in 3D reconstruction across diverse applications.

Title: DMFourLLIE: Dual-Stage and Multi-Branch Fourier Network for Low-Light Image Enhancement

Authors: Tongshun Zhang, Pingping Liu, Ming Zhao, Haotian Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00683
Pdf URL: https://arxiv.org/pdf/2412.00683
Copy Paste: [[2412.00683]] DMFourLLIE: Dual-Stage and Multi-Branch Fourier Network for Low-Light Image Enhancement(https://arxiv.org/abs/2412.00683)
Keywords: robust
Abstract: In the Fourier frequency domain, luminance information is primarily encoded in the amplitude component, while spatial structure information is significantly contained within the phase component. Existing low-light image enhancement techniques using Fourier transform have mainly focused on amplifying the amplitude component and simply replicating the phase component, an approach that often leads to color distortions and noise issues. In this paper, we propose a Dual-Stage Multi-Branch Fourier Low-Light Image Enhancement (DMFourLLIE) framework to address these limitations by emphasizing the phase component's role in preserving image structure and detail. The first stage integrates structural information from infrared images to enhance the phase component and employs a luminance-attention mechanism in the luminance-chrominance color space to precisely control amplitude enhancement. The second stage combines multi-scale and Fourier convolutional branches for robust image reconstruction, effectively recovering spatial structures and textures. This dual-branch joint optimization process ensures that complex image information is retained, overcoming the limitations of previous methods that neglected the interplay between amplitude and phase. Extensive experiments across multiple datasets demonstrate that DMFourLLIE outperforms current state-of-the-art methods in low-light image enhancement. Our code is available at this https URL.

Title: Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

Authors: Zilin Du, Haoxin Li, Jianfei Yu, Boyang Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00684
Pdf URL: https://arxiv.org/pdf/2412.00684
Copy Paste: [[2412.00684]] Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding(https://arxiv.org/abs/2412.00684)
Keywords: robust, generative
Abstract: Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address data scarcity, we propose a novel framework, POBF (Paint Outside the Box, then Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to identify the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Experimental results show that POBF achieves superior performance across four datasets, delivering an average improvement of 5.83% and outperforming leading baselines by 2.29% to 3.85% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, data ratios, and model architectures.

Title: LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

Authors: Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00686
Pdf URL: https://arxiv.org/pdf/2412.00686
Copy Paste: [[2412.00686]] LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models(https://arxiv.org/abs/2412.00686)
Keywords: robust
Abstract: Counting is a fundamental skill for various visual tasks in real-life applications, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) struggle with counting tasks, especially when the number of objects exceeds those commonly encountered during training. We enhance LVLMs' counting abilities using a divide-and-conquer approach, breaking counting problems into sub-counting tasks. Unlike prior methods, which do not generalize well to counting datasets on which they have not been trained, our method performs well on new datasets without any additional training or fine-tuning. We demonstrate that our approach enhances counting capabilities across various datasets and benchmarks.

Title: Towards Privacy-Preserving Medical Imaging: Federated Learning with Differential Privacy and Secure Aggregation Using a Modified ResNet Architecture

Authors: Mohamad Haj Fares, Ahmed Mohamed Saad Emam Saad
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.00687
Pdf URL: https://arxiv.org/pdf/2412.00687
Copy Paste: [[2412.00687]] Towards Privacy-Preserving Medical Imaging: Federated Learning with Differential Privacy and Secure Aggregation Using a Modified ResNet Architecture(https://arxiv.org/abs/2412.00687)
Keywords: secure, privacy, federate
Abstract: With increasing concerns over privacy in healthcare, especially for sensitive medical data, this research introduces a federated learning framework that combines local differential privacy and secure aggregation using Secure Multi-Party Computation for medical image classification. Further, we propose DPResNet, a modified ResNet architecture optimized for differential privacy. Leveraging the BloodMNIST benchmark dataset, we simulate a realistic data-sharing environment across different hospitals, addressing the distinct privacy challenges posed by federated healthcare data. Experimental results indicate that our privacy-preserving federated model achieves accuracy levels close to non-private models, surpassing traditional approaches while maintaining strict data confidentiality. By enhancing the privacy, efficiency, and reliability of healthcare data management, our approach offers substantial benefits to patients, healthcare providers, and the broader healthcare ecosystem.

Title: Collaborative Proof-of-Work: A Secure Dynamic Approach to Fair and Efficient Blockchain Mining

Authors: Rizwanul Haque, SM Tareq Aziz, Tahrim Hossain, Faisal Haque Bappy, Muhammad Nur Yanhaona, Tariqul Islam
Subjects: cs.CR, cs.DC, cs.ET
Abstract URL: https://arxiv.org/abs/2412.00690
Pdf URL: https://arxiv.org/pdf/2412.00690
Copy Paste: [[2412.00690]] Collaborative Proof-of-Work: A Secure Dynamic Approach to Fair and Efficient Blockchain Mining(https://arxiv.org/abs/2412.00690)
Keywords: secure, fair
Abstract: Proof-of-Work (PoW) systems face critical challenges, including excessive energy consumption and the centralization of mining power among entities with expensive hardware. Static mining pools exacerbate these issues by reducing competition and undermining the decentralized nature of blockchain networks, leading to economic inequality and inefficiencies in resource allocation. Their reliance on centralized pool managers further introduces vulnerabilities by creating a system that fails to ensure secure and fair reward distribution. This paper introduces a novel Collaborative Proof-of-Work (CPoW) mining approach designed to enhance efficiency and fairness in the Ethereum network. We propose a dynamic mining pool formation protocol that enables miners to collaborate based on their computational capabilities, ensuring fair and secure reward distribution by incorporating mechanisms to accurately verify and allocate rewards. By addressing the centralization and energy inefficiencies of traditional mining, this research contributes to a more sustainable blockchain ecosystem.

Title: Intermediate Outputs Are More Sensitive Than You Think

Authors: Tao Huang, Qingyu Huang, Jiayang Meng
Subjects: cs.CV, cs.CR, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00696
Pdf URL: https://arxiv.org/pdf/2412.00696
Copy Paste: [[2412.00696]] Intermediate Outputs Are More Sensitive Than You Think(https://arxiv.org/abs/2412.00696)
Keywords: privacy, protect, attack
Abstract: The increasing reliance on deep computer vision models that process sensitive data has raised significant privacy concerns, particularly regarding the exposure of intermediate results in hidden layers. While traditional privacy risk assessment techniques focus on protecting overall model outputs, they often overlook vulnerabilities within these intermediate representations. Current privacy risk assessment techniques typically rely on specific attack simulations to assess risk, which can be computationally expensive and incomplete. This paper introduces a novel approach to measuring privacy risks in deep computer vision models based on the Degrees of Freedom (DoF) and sensitivity of intermediate outputs, without requiring adversarial attack simulations. We propose a framework that leverages DoF to evaluate the amount of information retained in each layer and combines this with the rank of the Jacobian matrix to assess sensitivity to input variations. This dual analysis enables systematic measurement of privacy risks at various model layers. Our experimental validation on real-world datasets demonstrates the effectiveness of this approach in providing deeper insights into privacy risks associated with intermediate representations.

Title: The Forking Way: When TEEs Meet Consensus

Authors: Annika Wilde, Tim Niklas Gruel, Claudio Soriente, Ghassan Karame
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00706
Pdf URL: https://arxiv.org/pdf/2412.00706
Copy Paste: [[2412.00706]] The Forking Way: When TEEs Meet Consensus(https://arxiv.org/abs/2412.00706)
Keywords: secure, attack
Abstract: An increasing number of distributed platforms combine Trusted Execution Environments (TEEs) with blockchains. Indeed, many hail the combination of TEEs and blockchains a good "marriage": TEEs bring confidential computing to the blockchain while the consensus layer could help defend TEEs from forking attacks. In this paper, we systemize how current blockchain solutions integrate TEEs and to what extent they are secure against forking attacks. To do so, we thoroughly analyze 29 proposals for TEE-based blockchains, ranging from academic proposals to production-ready platforms. We uncover a lack of consensus in the community on how to combine TEEs and blockchains. In particular, we identify four broad means to interconnect TEEs with consensus, analyze their limitations, and discuss possible remedies. Our analysis also reveals previously undocumented forking attacks on three production-ready TEE-based blockchains: Ten, Phala, and the Secret Network. We leverage our analysis to propose effective countermeasures against those vulnerabilities; we responsibly disclosed our findings to the developers of each affected platform.

Title: Protect Your Secrets: Understanding and Measuring Data Exposure in VSCode Extensions

Authors: Yue Liu, Chakkrit Tantithamthavorn, Li Li
Subjects: cs.CR, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2412.00707
Pdf URL: https://arxiv.org/pdf/2412.00707
Copy Paste: [[2412.00707]] Protect Your Secrets: Understanding and Measuring Data Exposure in VSCode Extensions(https://arxiv.org/abs/2412.00707)
Keywords: security, privacy, protect, steal
Abstract: Recent years have witnessed the emerging trend of extensions in modern Integrated Development Environments (IDEs) like Visual Studio Code (VSCode) that significantly enhance developer productivity. Especially, popular AI coding assistants like GitHub Copilot and Tabnine provide conveniences like automated code completion and debugging. While these extensions offer numerous benefits, they may introduce privacy and security concerns to software developers. However, there is no existing work that systematically analyzes the security and privacy concerns, including the risks of data exposure in VSCode extensions. In this paper, we investigate on the security issues of cross-extension interactions in VSCode and shed light on the vulnerabilities caused by data exposure among different extensions. Our study uncovers high-impact security flaws that could allow adversaries to stealthily acquire or manipulate credential-related data (e.g., passwords, API keys, access tokens) from other extensions if not properly handled by extension vendors. To measure their prevalence, we design a novel automated risk detection framework that leverages program analysis and natural language processing techniques to automatically identify potential risks in VSCode extensions. By applying our tool to 27,261 real-world VSCode extensions, we discover that 8.5\% of them (i.e., 2,325 extensions) are exposed to credential-related data leakage through various vectors, such as commands, user input, and configurations. Our study sheds light on the security challenges and flaws of the extension-in-IDE paradigm and provides suggestions and recommendations for improving the security of VSCode extensions and mitigating the risks of data exposure.

Title: Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

Authors: Shuling Zhao, Fa-Ting Hong, Xiaoshui Huang, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00719
Pdf URL: https://arxiv.org/pdf/2412.00719
Copy Paste: [[2412.00719]] Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation(https://arxiv.org/abs/2412.00719)
Keywords: transformer
Abstract: Talking head video generation aims to generate a realistic talking head video that preserves the person's identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly complex to model precisely, and the one-shot source face image cannot provide sufficient appearance guidance during generation due to dynamic pose changes. To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. Specifically, the designed multi-scale motion and appearance codebooks are learned simultaneously in a unified framework to store representative global facial motion flow and appearance patterns. Then, we present a novel multi-scale motion and appearance compensation module, which utilizes a transformer-based codebook retrieval strategy to query complementary information from the two codebooks for joint motion and appearance compensation. The entire process produces motion flows of greater flexibility and appearance features with fewer distortions across different scales, resulting in a high-quality talking head video generation framework. Extensive experiments on various benchmarks validate the effectiveness of our approach and demonstrate superior generation results from both qualitative and quantitative perspectives when compared to state-of-the-art competitors.

Title: Bridging Fairness Gaps: A (Conditional) Distance Covariance Perspective in Fairness Learning

Authors: Ruifan Huang, Haixia Liu
Subjects: cs.LG, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00720
Pdf URL: https://arxiv.org/pdf/2412.00720
Copy Paste: [[2412.00720]] Bridging Fairness Gaps: A (Conditional) Distance Covariance Perspective in Fairness Learning(https://arxiv.org/abs/2412.00720)
Keywords: fair
Abstract: We bridge fairness gaps from a statistical perspective by selectively utilizing either conditional distance covariance or distance covariance statistics as measures to assess the independence between predictions and sensitive attributes. We enhance fairness by incorporating sample (conditional) distance covariance as a manageable penalty term into the machine learning process. Additionally, we present the matrix form of empirical (conditional) distance covariance for parallel calculations to enhance computational efficiency. Theoretically, we provide a proof for the convergence between empirical and population (conditional) distance covariance, establishing necessary guarantees for batch computations. Through experiments conducted on a range of real-world datasets, we have demonstrated that our method effectively bridges the fairness gap in machine learning.

Title: Decision Transformer vs. Decision Mamba: Analysing the Complexity of Sequential Decision Making in Atari Games

Authors: Ke Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00725
Pdf URL: https://arxiv.org/pdf/2412.00725
Copy Paste: [[2412.00725]] Decision Transformer vs. Decision Mamba: Analysing the Complexity of Sequential Decision Making in Atari Games(https://arxiv.org/abs/2412.00725)
Keywords: transformer
Abstract: This work analyses the disparity in performance between Decision Transformer (DT) and Decision Mamba (DM) in sequence modelling reinforcement learning tasks for different Atari games. The study first observed that DM generally outperformed DT in the games Breakout and Qbert, while DT performed better in more complicated games, such as Hero and Kung Fu Master. To understand these differences, we expanded the number of games to 12 and performed a comprehensive analysis of game characteristics, including action space complexity, visual complexity, average trajectory length, and average steps to the first non-zero reward. In order to further analyse the key factors that impact the disparity in performance between DT and DM, we employ various approaches, including quantifying visual complexity, random forest regression, correlation analysis, and action space simplification strategies. The results indicate that the performance gap between DT and DM is affected by the complex interaction of multiple factors, with the complexity of the action space and visual complexity (particularly evaluated by compression ratio) being the primary determining factors. DM performs well in environments with simple action and visual elements, while DT shows an advantage in games with higher action and visual complexity. Our findings contribute to a deeper understanding of how the game characteristics affect the performance difference in sequential modelling reinforcement learning, potentially guiding the development of future model design and applications for diverse and complex environments.

Title: Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Authors: Naman Deep Singh, Francesco Croce, Matthias Hein
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2412.00727
Pdf URL: https://arxiv.org/pdf/2412.00727
Copy Paste: [[2412.00727]] Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP(https://arxiv.org/abs/2412.00727)
Keywords: attack
Abstract: Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As training foundational models, such as CLIP, from scratch is very expensive, this paper focuses on cleaning potentially poisoned models via fine-tuning. We first show that existing cleaning techniques are not effective against simple structured triggers used in Blended or BadNet backdoor attacks, exposing a critical vulnerability for potential real-world deployment of these models. Then, we introduce PAR, Perturb and Recover, a surprisingly simple yet effective mechanism to remove backdoors from CLIP models. Through extensive experiments across different encoders and types of backdoor attacks, we show that PAR achieves high backdoor removal rate while preserving good standard performance. Finally, we illustrate that our approach is effective even only with synthetic text-image pairs, i.e. without access to real training data. The code and models are available at \href{this https URL}{this https URL}.

Title: Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention

Authors: Ajith Balakrishnan, Sreeja S, Linu Shine
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00731
Pdf URL: https://arxiv.org/pdf/2412.00731
Copy Paste: [[2412.00731]] Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention(https://arxiv.org/abs/2412.00731)
Keywords: transformer
Abstract: Generating 3D models from multi-view 2D RGB images has gained significant attention, extending the capabilities of technologies like Virtual Reality, Robotic Vision, and human-machine interaction. In this paper, we introduce a hybrid strategy combining CNNs and transformers, featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network, trained using a novel Joint Train Separate Optimization (JTSO) algorithm. Encoded features from unordered inputs are transformed into an enhanced feature map by the self-attention layer, decoded into an initial 3D volume, and further refined. Our network generates 3D voxels from single or multiple 2D images from arbitrary viewpoints. Performance evaluations using the ShapeNet datasets show that our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction, achieving the highest mean intersection over union (IOU) scores, surpassing other models by 4.2% in single-view reconstruction.

Title: Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Authors: Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00733
Pdf URL: https://arxiv.org/pdf/2412.00733
Copy Paste: [[2412.00733]] Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks(https://arxiv.org/abs/2412.00733)
Keywords: diffusion, transformer, generative
Abstract: Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: this https URL.

Title: ChatSplat: 3D Conversational Gaussian Splatting

Authors: Hanlin Chen, Fangyin Wei, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00734
Pdf URL: https://arxiv.org/pdf/2412.00734
Copy Paste: [[2412.00734]] ChatSplat: 3D Conversational Gaussian Splatting(https://arxiv.org/abs/2412.00734)
Keywords: large language model, segmentation
Abstract: Humans naturally interact with their 3D surroundings using language, and modeling 3D language fields for scene understanding and interaction has gained growing interest. This paper introduces ChatSplat, a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space. Unlike existing methods that primarily use CLIP-derived language features focused solely on segmentation, ChatSplat facilitates interaction on three levels: objects, views, and the entire 3D scene. For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model (LLM) for conversation. At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene. For object-level interaction, ChatSplat uses a patch-wise language embedding, unlike LangSplat's pixel-wise language embedding that implicitly includes mask and embedding. Here, we explicitly decouple the language embedding into separate mask and feature map representations, allowing more flexible object-level interaction. To address the challenge of learning 3D Gaussians posed by the complex and diverse distribution of language embeddings used in the LLM, we introduce a learnable normalization technique to standardize these embeddings, facilitating effective learning. Extensive experimental results demonstrate that ChatSplat supports multi-level interactions -- object, view, and scene -- within 3D space, enhancing both understanding and engagement.

Title: Precise Facial Landmark Detection by Dynamic Semantic Aggregation Transformer

Authors: Jun Wan, He Liu, Yujia Wu, Zhihui Lai, Wenwen Min, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00740
Pdf URL: https://arxiv.org/pdf/2412.00740
Copy Paste: [[2412.00740]] Precise Facial Landmark Detection by Dynamic Semantic Aggregation Transformer(https://arxiv.org/abs/2412.00740)
Keywords: transformer
Abstract: At present, deep neural network methods have played a dominant role in face alignment field. However, they generally use predefined network structures to predict landmarks, which tends to learn general features and leads to mediocre performance, e.g., they perform well on neutral samples but struggle with faces exhibiting large poses or occlusions. Moreover, they cannot effectively deal with semantic gaps and ambiguities among features at different scales, which may hinder them from learning efficient features. To address the above issues, in this paper, we propose a Dynamic Semantic-Aggregation Transformer (DSAT) for more discriminative and representative feature (i.e., specialized feature) learning. Specifically, a Dynamic Semantic-Aware (DSA) model is first proposed to partition samples into subsets and activate the specific pathways for them by estimating the semantic correlations of feature channels, making it possible to learn specialized features from each subset. Then, a novel Dynamic Semantic Specialization (DSS) model is designed to mine the homogeneous information from features at different scales for eliminating the semantic gap and ambiguities and enhancing the representation ability. Finally, by integrating the DSA model and DSS model into our proposed DSAT in both dynamic architecture and dynamic parameter manners, more specialized features can be learned for achieving more precise face alignment. It is interesting to show that harder samples can be handled by activating more feature channels. Extensive experiments on popular face alignment datasets demonstrate that our proposed DSAT outperforms state-of-the-art models in the this http URL code is available at this https URL.

Title: CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images

Authors: Jian Liu, Zhen Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00754
Pdf URL: https://arxiv.org/pdf/2412.00754
Copy Paste: [[2412.00754]] CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images(https://arxiv.org/abs/2412.00754)
Keywords: generative
Abstract: The neural radiance field (NERF) advocates learning the continuous representation of 3D geometry through a multilayer perceptron (MLP). By integrating this into a generative model, the generative neural radiance field (GRAF) is capable of producing images from random noise z without 3D supervision. In practice, the shape and appearance are modeled by z_s and z_a, respectively, to manipulate them separately during inference. However, it is challenging to represent multiple scenes using a solitary MLP and precisely control the generation of 3D geometry in terms of shape and appearance. In this paper, we introduce a controllable generative model (i.e. \textbf{CtrlNeRF}) that uses a single MLP network to represent multiple scenes with shared weights. Consequently, we manipulated the shape and appearance codes to realize the controllable generation of high-fidelity images with 3D consistency. Moreover, the model enables the synthesis of novel views that do not exist in the training sets via camera pose alteration and feature interpolation. Extensive experiments were conducted to demonstrate its superiority in 3D-aware image generation compared to its counterparts.

Title: DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling

Authors: Xin Xie, Dong Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00759
Pdf URL: https://arxiv.org/pdf/2412.00759
Copy Paste: [[2412.00759]] DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling(https://arxiv.org/abs/2412.00759)
Keywords: robust, diffusion
Abstract: Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance. We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps. Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.

Title: Learning to Forget using Hypernetworks

Authors: Jose Miguel Lara Rangel, Stefan Schoepf, Jack Foster, David Krueger, Usman Anwar
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2412.00761
Pdf URL: https://arxiv.org/pdf/2412.00761
Copy Paste: [[2412.00761]] Learning to Forget using Hypernetworks(https://arxiv.org/abs/2412.00761)
Keywords: privacy, attack, diffusion
Abstract: Machine unlearning is gaining increasing attention as a way to remove adversarial data poisoning attacks from already trained models and to comply with privacy and AI regulations. The objective is to unlearn the effect of undesired data from a trained model while maintaining performance on the remaining data. This paper introduces HyperForget, a novel machine unlearning framework that leverages hypernetworks - neural networks that generate parameters for other networks - to dynamically sample models that lack knowledge of targeted data while preserving essential capabilities. Leveraging diffusion models, we implement two Diffusion HyperForget Networks and used them to sample unlearned models in Proof-of-Concept experiments. The unlearned models obtained zero accuracy on the forget set, while preserving good accuracy on the retain sets, highlighting the potential of HyperForget for dynamic targeted data removal and a promising direction for developing adaptive machine unlearning algorithms.

Title: PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis

Authors: Hao Dong, Wei Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00763
Pdf URL: https://arxiv.org/pdf/2412.00763
Copy Paste: [[2412.00763]] PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis(https://arxiv.org/abs/2412.00763)
Keywords: generative
Abstract: Recently, generative pre-training based models have demonstrated remarkable results on Aspect-based Sentiment Analysis (ABSA) task. However, previous works overemphasize crafting various templates to paraphrase training targets for enhanced decoding, ignoring the internal optimizations on generative models. Despite notable results achieved by these target-oriented optimization methods, they struggle with the complicated long texts since the implicit long-distance relation, e.g., aspect-opinion relation, is difficult to extract under the position embedding mechanism in generative models. Thus, in this paper, we first clarify the causes of the problem and introduce two sequence optimization strategies: the rule-based static optimization and the score-based dynamic optimization. The rule-based approach relies on handcraft priority of dependency relation to reorder the context, while the score-based algorithm dynamically regulates the contextual sequence by calculating word position scores using neural network. Based on the dynamic optimization structure, we further propose a unified Prompt-based Generative Sequence Optimization network (named PGSO), which jointly optimizes the training target as well as the generative model. Specifically, PGSO contains two components, namely, prompt construction and sequence regulator. The former constructs a task-specific prompt based on unsupervised training objects to fully utilize the pre-trained model. The latter jointly leverages semantic, syntactic and original-sequence information to dynamically regulate contextual sequence. Our experiments conducted on four ABSA tasks across multiple benchmarks indicate that PGSO outperforms state-of-the-art methods, with an average improvement of 3.52% in F1 score.

Title: SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts

Authors: Aihua Pei, Zehua Yang, Shunan Zhu, Ruoxi Cheng, Ju Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00765
Pdf URL: https://arxiv.org/pdf/2412.00765
Copy Paste: [[2412.00765]] SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts(https://arxiv.org/abs/2412.00765)
Keywords: robust, large language model
Abstract: Traditional methods for evaluating the robustness of large language models (LLMs) often rely on standardized benchmarks, which can escalate costs and limit evaluations across varied domains. This paper introduces a novel framework designed to autonomously evaluate the robustness of LLMs by incorporating refined adversarial prompts and domain-constrained knowledge guidelines in the form of knowledge graphs. Our method systematically generates descriptive sentences from domain-constrained knowledge graph triplets to formulate adversarial prompts, enhancing the relevance and challenge of the evaluation. These prompts, generated by the LLM itself and tailored to evaluate its own robustness, undergo a rigorous filtering and refinement process, ensuring that only those with high textual fluency and semantic fidelity are used. This self-evaluation mechanism allows the LLM to evaluate its robustness without the need for external benchmarks. We assess the effectiveness of our framework through extensive testing on both proprietary models like ChatGPT and open-source models such as Llama-3.1, Phi-3, and Mistral. Results confirm that our approach not only reduces dependency on conventional data but also provides a targeted and efficient means of evaluating LLM robustness in constrained domains.

Title: Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting

Authors: Linhai Zhuo, Zheng Wang, Yuqian Fu, Tianwen Qian
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00767
Pdf URL: https://arxiv.org/pdf/2412.00767
Copy Paste: [[2412.00767]] Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting(https://arxiv.org/abs/2412.00767)
Keywords: robust
Abstract: The source-free cross-domain few-shot learning (CD-FSL) task aims to transfer pretrained models to target domains utilizing minimal samples, eliminating the need for source domain data. Addressing this issue requires models to have robust generalization abilities and strong feature representation, aligning with the characteristics of large-scale pretrained models. However, large-scale models tend to lose representational ability in cross-domain scenarios due to limited sample diversity. \zlh{Given the abundant diversity provided by semantic modality, this paper leverages textual modality to enhance training sample diversity with CLP model}, meanwhile improving model transfer efficiency. Specifically, we propose the SeGD-VPT framework, which is divided into two phases. The first step aims to increase feature diversity by adding diversity prompts to each support sample, thereby generating varying input and enhancing sample diversity. Furthermore, we use diversity descriptions of classes to guide semantically meaningful learning of diversity prompts, proposing random combinations and selections of texts to increase textual diversity. Additionally, deep prompt tuning is introduced to enhance the model's transfer capability. After training of the first step, support samples with different diversity prompts are input into the CLIP backbone to generate enhanced features. After generation, the second phase trains classifiers using the generated features. Extensive experimental results across several benchmarks verify our method is comparable to SOTA source-utilized models and attain the best performance under the source-free CD-FSL setting.

Title: A Wave is Worth 100 Words: Investigating Cross-Domain Transferability in Time Series

Authors: Xiangkai Ma, Xiaobin Hong, Wenzhong Li, Sanglu Lu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00772
Pdf URL: https://arxiv.org/pdf/2412.00772
Copy Paste: [[2412.00772]] A Wave is Worth 100 Words: Investigating Cross-Domain Transferability in Time Series(https://arxiv.org/abs/2412.00772)
Keywords: robust
Abstract: Time series analysis is a fundamental data mining task that supervised training methods based on empirical risk minimization have proven their effectiveness on specific tasks and datasets. However, the acquisition of well-annotated data is costly and a large amount of unlabeled series data is under-utilized. Due to distributional shifts across various domains and different patterns of interest across multiple tasks. The problem of cross-domain multi-task migration of time series remains a significant challenge. To address these problems, this paper proposes a novel cross-domain pretraining method based on Wave Quantization (termed as WQ4TS), which can be combined with any advanced time series model and applied to multiple downstream tasks. Specifically, we transfer the time series data from different domains into a common spectral latent space, and enable the model to learn the temporal pattern knowledge of different domains directly from the common space and utilize it for the inference of downstream tasks, thereby mitigating the challenge of heterogeneous cross-domains migration. The establishment of spectral latent space brings at least three benefits, cross-domain migration capability thus adapting to zero- and few-shot scenarios without relying on priori knowledge of the dataset, general compatible cross-domain migration framework without changing the existing model structure, and robust modeling capability thus achieving SOTA results in multiple downstream tasks. To demonstrate the effectiveness of the proposed approach, we conduct extensive experiments including three important tasks: forecasting, imputation, and classification. And three common real-world data scenarios are simulated: full-data, few-shot, and zero-shot. The proposed WQ4TS achieves the best performance on 87.5% of all tasks, and the average improvement of the metrics on all the tasks is up to 34.7%.

Title: DIVD: Deblurring with Improved Video Diffusion Model

Authors: Haoyang Long, Yan Wang, Wendong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00773
Pdf URL: https://arxiv.org/pdf/2412.00773
Copy Paste: [[2412.00773]] DIVD: Deblurring with Improved Video Diffusion Model(https://arxiv.org/abs/2412.00773)
Keywords: diffusion
Abstract: Video deblurring presents a considerable challenge owing to the complexity of blur, which frequently results from a combination of camera shakes, and object motions. In the field of video deblurring, many previous works have primarily concentrated on distortion-based metrics, such as PSNR. However, this approach often results in a weak correlation with human perception and yields reconstructions that lack realism. Diffusion models and video diffusion models have respectively excelled in the fields of image and video generation, particularly achieving remarkable results in terms of image authenticity and realistic perception. However, due to the computational complexity and challenges inherent in adapting diffusion models, there is still uncertainty regarding the potential of video diffusion models in video deblurring tasks. To explore the viability of video diffusion models in the task of video deblurring, we introduce a diffusion model specifically for this purpose. In this field, leveraging highly correlated information between adjacent frames and addressing the challenge of temporal misalignment are crucial research directions. To tackle these challenges, many improvements based on the video diffusion model are introduced in this work. As a result, our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics. Our model preserves a significant amount of detail in the images while maintaining competitive distortion metrics. Furthermore, to the best of our knowledge, this is the first time the diffusion model has been applied in video deblurring to overcome the limitations mentioned above.

Title: Post-Vaccination COVID-19 Data Analysis: Privacy and Ethics

Authors: Sankha Das, Amit Dua
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00774
Pdf URL: https://arxiv.org/pdf/2412.00774
Copy Paste: [[2412.00774]] Post-Vaccination COVID-19 Data Analysis: Privacy and Ethics(https://arxiv.org/abs/2412.00774)
Keywords: privacy
Abstract: The COVID-19 pandemic has severely affected the world in terms of health, economy and peace. Fortunately, the countries are trying to overcome the situation by actively carrying out vaccinations. However, like any other massive operation involving humans such as human resource management, elections, surveys, etc., the vaccination process raises several questions about citizen privacy and misuse of personal data. In most of the countries, few attempts have been made to verify the vaccination statistics as reported by the health centers. These issues collectively require the solutions of anonymity of citizens' personal information, immutability of vaccination data and easy yet restricted access by adversarial bodies such as the government for the verification and analysis of the data. This paper introduces a blockchain-based application to simulate and monitor the vaccination process. The structure of data model used in the proposed system is based on the IEEE Standard for Data Format for Blockchain Systems 2418.2TM-2020. The proposed system enables authorized stakeholders to share and access relevant information for vaccination process chain while preserving citizen privacy and accountability of the system. It is implemented on the Ethereum blockchain and uses a Python API for the simulation and validation of each step of the vaccination process.

Title: Learning Mamba as a Continual Learner

Authors: Chongyang Zhao, Dong Gong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.00776
Pdf URL: https://arxiv.org/pdf/2412.00776
Copy Paste: [[2412.00776]] Learning Mamba as a Continual Learner(https://arxiv.org/abs/2412.00776)
Keywords: transformer
Abstract: Continual learning (CL) aims to efficiently learn and accumulate knowledge from a data stream with different distributions. By formulating CL as a sequence prediction task, meta-continual learning (MCL) enables to meta-learn an efficient continual learner based on the recent advanced sequence models, e.g., Transformers. Although attention-free models (e.g., Linear Transformers) can ideally match CL's essential objective and efficiency requirements, they usually perform not well in MCL. Considering that the attention-free Mamba achieves excellent performances matching Transformers' on general sequence modeling tasks, in this paper, we aim to answer a question -- Can attention-free Mamba perform well on MCL? By formulating Mamba with a selective state space model (SSM) for MCL tasks, we propose to meta-learn Mamba as a continual learner, referred to as MambaCL. By incorporating a selectivity regularization, we can effectively train MambaCL. Through comprehensive experiments across various CL tasks, we also explore how Mamba and other models perform in different MCL scenarios. Our experiments and analyses highlight the promising performance and generalization capabilities of Mamba in MCL.

Title: Local vs. Global: Local Land-Use and Land-Cover Models Deliver Higher Quality Maps

Authors: Girmaw Abebe Tadesse, Caleb Robinson, Charles Mwangi, Esther Maina, Joshua Nyakundi, Luana Marotti, Gilles Quentin Hacheme, Hamed Alemohammad, Rahul Dodhia, Juan M. Lavista Ferres
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00777
Pdf URL: https://arxiv.org/pdf/2412.00777
Copy Paste: [[2412.00777]] Local vs. Global: Local Land-Use and Land-Cover Models Deliver Higher Quality Maps(https://arxiv.org/abs/2412.00777)
Keywords: security
Abstract: Approximately 20% of Africa's population suffered from undernourishment, and 868 million people experienced moderate to severe food insecurity in 2022. Land-use and land-cover maps provide crucial insights for addressing food insecurity, e.g., by mapping croplands. The development of global land-cover maps has been facilitated by the increasing availability of earth observation data and advancements in geospatial machine learning. However, these global maps exhibit lower accuracy and inconsistencies in Africa, partly due to the lack of representative training data. To address this issue, we propose a data-centric framework with a teacher-student model setup, which uses diverse data sources of satellite images and label examples to produce local land-cover maps. Our method trains a high-resolution teacher model on images with a resolution of 0.331 m/pixel and a low-resolution student model on publicly available images with a resolution of 10 m/pixel. The student model also utilizes the teacher model's output as its weak label examples through knowledge distillation. We evaluated our framework using Murang'a County, Kenya, as a use case and achieved significant improvements, i.e., 0.14 in the F1 score and 0.21 in Intersection-over-Union, compared to the best global map. Our evaluation also revealed inconsistencies in existing global maps, with a maximum agreement rate of 0.30 among themselves. Insights obtained from our cross-collaborative work can provide valuable guidance to local and national policymakers in making informed decisions to improve resource utilization and food security.

Title: Memories of Forgotten Concepts

Authors: Matan Rusanovsky, Shimon Malnick, Amir Jevnisek, Ohad Fried, Shai Avidan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00782
Pdf URL: https://arxiv.org/pdf/2412.00782
Copy Paste: [[2412.00782]] Memories of Forgotten Concepts(https://arxiv.org/abs/2412.00782)
Keywords: diffusion
Abstract: Diffusion models dominate the space of text-to-image generation, yet they may produce undesirable outputs, including explicit content or private data. To mitigate this, concept ablation techniques have been explored to limit the generation of certain concepts. In this paper, we reveal that the erased concept information persists in the model and that erased concept images can be generated using the right latent. Utilizing inversion methods, we show that there exist latent seeds capable of generating high quality images of erased concepts. Moreover, we show that these latents have likelihoods that overlap with those of images outside the erased concept. We extend this to demonstrate that for every image from the erased concept set, we can generate many seeds that generate the erased concept. Given the vast space of latents capable of generating ablated concept images, our results suggest that fully erasing concept information may be intractable, highlighting possible vulnerabilities in current concept ablation techniques.

Title: EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

Authors: Tong Jin, Feng Lu, Shuyu Hu, Chun Yuan, Yunpeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00784
Pdf URL: https://arxiv.org/pdf/2412.00784
Copy Paste: [[2412.00784]] EDTformer: An Efficient Decoder Transformer for Visual Place Recognition(https://arxiv.org/abs/2412.00784)
Keywords: robust, transformer
Abstract: Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability in capturing contextual dependencies and generating accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly generate robust and discriminative global representations for VPR. Specifically, we do this by formulating deep features as the keys and values, as well as a set of independent learnable parameters as the queries. EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to form the final global representations. Moreover, to provide powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-Rank Parallel Adaptation (LoPA) method to enhance it, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at this https URL.

Title: Online Poisoning Attack Against Reinforcement Learning under Black-box Environments

Authors: Jianhui Li, Bokang Zhang, Junfeng Wu
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.00797
Pdf URL: https://arxiv.org/pdf/2412.00797
Copy Paste: [[2412.00797]] Online Poisoning Attack Against Reinforcement Learning under Black-box Environments(https://arxiv.org/abs/2412.00797)
Keywords: attack
Abstract: This paper proposes an online environment poisoning algorithm tailored for reinforcement learning agents operating in a black-box setting, where an adversary deliberately manipulates training data to lead the agent toward a mischievous policy. In contrast to prior studies that primarily investigate white-box settings, we focus on a scenario characterized by \textit{unknown} environment dynamics to the attacker and a \textit{flexible} reinforcement learning algorithm employed by the targeted agent. We first propose an attack scheme that is capable of poisoning the reward functions and state transitions. The poisoning task is formalized as a constrained optimization problem, following the framework of \cite{ma2019policy}. Given the transition probabilities are unknown to the attacker in a black-box environment, we apply a stochastic gradient descent algorithm, where the exact gradients are approximated using sample-based estimates. A penalty-based method along with a bilevel reformulation is then employed to transform the problem into an unconstrained counterpart and to circumvent the double-sampling issue. The algorithm's effectiveness is validated through a maze environment.

Title: A Comprehensive Guide to Explainable AI: From Classical Models to LLMs

Authors: Weiche Hsieh, Ziqian Bi, Chuanqi Jiang, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Yizhu Wen, Xinyuan Song, Tianyang Wang, Junjie Yang, Ming Li, Bowen Jing, Jintao Ren, Junhao Song, Han Xu, Hong-Ming Tseng, Yichao Zhang, Lawrence K.Q. Yan, Qian Niu, Silin Chen, Yunze Wang, Chia Xin Liang, Ming Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00800
Pdf URL: https://arxiv.org/pdf/2412.00800
Copy Paste: [[2412.00800]] A Comprehensive Guide to Explainable AI: From Classical Models to LLMs(https://arxiv.org/abs/2412.00800)
Keywords: federate, fair, interpretability, large language model
Abstract: Explainable Artificial Intelligence (XAI) addresses the growing need for transparency and interpretability in AI systems, enabling trust and accountability in decision-making processes. This book offers a comprehensive guide to XAI, bridging foundational concepts with advanced methodologies. It explores interpretability in traditional models such as Decision Trees, Linear Regression, and Support Vector Machines, alongside the challenges of explaining deep learning architectures like CNNs, RNNs, and Large Language Models (LLMs), including BERT, GPT, and T5. The book presents practical techniques such as SHAP, LIME, Grad-CAM, counterfactual explanations, and causal inference, supported by Python code examples for real-world applications. Case studies illustrate XAI's role in healthcare, finance, and policymaking, demonstrating its impact on fairness and decision support. The book also covers evaluation metrics for explanation quality, an overview of cutting-edge XAI tools and frameworks, and emerging research directions, such as interpretability in federated learning and ethical AI considerations. Designed for a broad audience, this resource equips readers with the theoretical insights and practical skills needed to master XAI. Hands-on examples and additional resources are available at the companion GitHub repository: this https URL.

Title: Generative Model for Synthesizing Ionizable Lipids: A Monte Carlo Tree Search Approach

Authors: Jingyi Zhao, Yuxuan Ou, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato
Subjects: cs.LG, cs.AI, q-bio.BM, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.00807
Pdf URL: https://arxiv.org/pdf/2412.00807
Copy Paste: [[2412.00807]] Generative Model for Synthesizing Ionizable Lipids: A Monte Carlo Tree Search Approach(https://arxiv.org/abs/2412.00807)
Keywords: generative
Abstract: Ionizable lipids are essential in developing lipid nanoparticles (LNPs) for effective messenger RNA (mRNA) delivery. While traditional methods for designing new ionizable lipids are typically time-consuming, deep generative models have emerged as a powerful solution, significantly accelerating the molecular discovery process. However, a practical challenge arises as the molecular structures generated can often be difficult or infeasible to synthesize. This project explores Monte Carlo tree search (MCTS)-based generative models for synthesizable ionizable lipids. Leveraging a synthetically accessible lipid building block dataset and two specialized predictors to guide the search through chemical space, we introduce a policy network guided MCTS generative model capable of producing new ionizable lipids with available synthesis pathways.

Title: Categorical Keypoint Positional Embedding for Robust Animal Re-Identification

Authors: Yuhao Lin, Lingqiao Liu, Javen Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00818
Pdf URL: https://arxiv.org/pdf/2412.00818
Copy Paste: [[2412.00818]] Categorical Keypoint Positional Embedding for Robust Animal Re-Identification(https://arxiv.org/abs/2412.00818)
Keywords: robust, diffusion, transformer
Abstract: Animal re-identification (ReID) has become an indispensable tool in ecological research, playing a critical role in tracking population dynamics, analyzing behavioral patterns, and assessing ecological impacts, all of which are vital for informed conservation strategies. Unlike human ReID, animal ReID faces significant challenges due to the high variability in animal poses, diverse environmental conditions, and the inability to directly apply pre-trained models to animal data, making the identification process across species more complex. This work introduces an innovative keypoint propagation mechanism, which utilizes a single annotated image and a pre-trained diffusion model to propagate keypoints across an entire dataset, significantly reducing the cost of manual annotation. Additionally, we enhance the Vision Transformer (ViT) by implementing Keypoint Positional Encoding (KPE) and Categorical Keypoint Positional Embedding (CKPE), enabling the ViT to learn more robust and semantically-aware representations. This provides more comprehensive and detailed keypoint representations, leading to more accurate and efficient re-identification. Our extensive experimental evaluations demonstrate that this approach significantly outperforms existing state-of-the-art methods across four wildlife datasets. The code will be publicly released.

Title: EventGPT: Event Stream Understanding with Multimodal Large Language Models

Authors: Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, Ming Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00832
Pdf URL: https://arxiv.org/pdf/2412.00832
Copy Paste: [[2412.00832]] EventGPT: Event Stream Understanding with Multimodal Large Language Models(https://arxiv.org/abs/2412.00832)
Keywords: large language model
Abstract: Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, to the best of our knowledge, marking a pioneering attempt to integrate large language models (LLMs) with event stream comprehension. To mitigate the huge domain gaps, we develop a three-stage optimization paradigm to gradually equip a pre-trained LLM with the capability of understanding event-based scenes. Our EventGPT comprises an event encoder, followed by a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, RGB image-text pairs generated by GPT are leveraged to warm up the linear projector, referring to LLaVA, as the gap between natural image and language modalities is relatively smaller. Secondly, we construct a synthetic yet large dataset, N-ImageNet-Chat, consisting of event frames and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive benchmark, and experiments show that EventGPT surpasses previous state-of-the-art MLLMs in generation quality, descriptive accuracy, and reasoning capability.

Title: AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

Authors: Yan Li, Yifei Xing, Xiangyuan Lan, Xin Li, Haifeng Chen, Dongmei Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00833
Pdf URL: https://arxiv.org/pdf/2412.00833
Copy Paste: [[2412.00833]] AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment(https://arxiv.org/abs/2412.00833)
Keywords: transformer
Abstract: Cross-modal alignment is crucial for multimodal representation fusion due to the inherent heterogeneity between modalities. While Transformer-based methods have shown promising results in modeling inter-modal relationships, their quadratic computational complexity limits their applicability to long-sequence or large-scale data. Although recent Mamba-based approaches achieve linear complexity, their sequential scanning mechanism poses fundamental challenges in comprehensively modeling cross-modal relationships. To address this limitation, we propose AlignMamba, an efficient and effective method for multimodal fusion. Specifically, grounded in Optimal Transport, we introduce a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities. Moreover, we propose a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions. Finally, the unimodal representations after local and global alignment are passed to the Mamba backbone for further cross-modal interaction and multimodal fusion. Extensive experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.

Title: Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models

Authors: Christian Möller, Niklas Funk, Jan Peters
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00835
Pdf URL: https://arxiv.org/pdf/2412.00835
Copy Paste: [[2412.00835]] Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models(https://arxiv.org/abs/2412.00835)
Keywords: diffusion, generative
Abstract: Object pose estimation from a single view remains a challenging problem. In particular, partial observability, occlusions, and object symmetries eventually result in pose ambiguity. To account for this multimodality, this work proposes training a diffusion-based generative model for 6D object pose estimation. During inference, the trained generative model allows for sampling multiple particles, i.e., pose hypotheses. To distill this information into a single pose estimate, we propose two novel and effective pose selection strategies that do not require any additional training or computationally intensive operations. Moreover, while many existing methods for pose estimation primarily focus on the image domain and only incorporate depth information for final pose refinement, our model solely operates on point cloud data. The model thereby leverages recent advancements in point cloud processing and operates upon an SE(3)-equivariant latent space that forms the basis for the particle selection strategies and allows for improved inference times. Our thorough experimental results demonstrate the competitive performance of our approach on the Linemod dataset and showcase the effectiveness of our design choices. Code is available at this https URL .

Title: AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer

Authors: Jin Lyu, Tianyi Zhu, Yi Gu, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang, Liang An
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00837
Pdf URL: https://arxiv.org/pdf/2412.00837
Copy Paste: [[2412.00837]] AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer(https://arxiv.org/abs/2412.00837)
Keywords: diffusion, transformer
Abstract: Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate the effectiveness of our network design and CtrlAni3D in enhancing the performance of AniMer for in-the-wild applications. The project page of AniMer is this https URL.

Title: Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion

Authors: Bohai Gu, Hao Luo, Song Guo, Peiran Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00857
Pdf URL: https://arxiv.org/pdf/2412.00857
Copy Paste: [[2412.00857]] Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion(https://arxiv.org/abs/2412.00857)
Keywords: diffusion
Abstract: Recently, diffusion-based methods have achieved great improvements in the video inpainting task. However, these methods still face many challenges, such as maintaining temporal consistency and the time-consuming issue. This paper proposes an advanced video inpainting framework using optical Flow-guided Efficient Diffusion, called FloED. Specifically, FloED employs a dual-branch architecture, where a flow branch first restores corrupted flow and a multi-scale flow adapter provides motion guidance to the main inpainting branch. Additionally, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. Further introducing a flow attention cache mechanism, FLoED efficiently reduces the computational cost brought by incorporating optical flow. Comprehensive experiments in both background restoration and object removal tasks demonstrate that FloED outperforms state-of-the-art methods from the perspective of both performance and efficiency.

Title: Deep evolving semi-supervised anomaly detection

Authors: Jack Belham, Aryan Bhosale, Samrat Mukherjee, Biplab Banerjee, Fabio Cuzzolin
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00860
Pdf URL: https://arxiv.org/pdf/2412.00860
Copy Paste: [[2412.00860]] Deep evolving semi-supervised anomaly detection(https://arxiv.org/abs/2412.00860)
Keywords: generative
Abstract: The aim of this paper is to formalise the task of continual semi-supervised anomaly detection (CSAD), with the aim of highlighting the importance of such a problem formulation which assumes as close to real-world conditions as possible. After an overview of the relevant definitions of continual semi-supervised learning, its components, anomaly detection extension, and the training protocols; the paper introduces a baseline model of a variational autoencoder (VAE) to work with semi-supervised data along with a continual learning method of deep generative replay with outlier rejection. The results show that such a use of extreme value theory (EVT) applied to anomaly detection can provide promising results even in comparison to an upper baseline of joint training. The results explore the effects of how much labelled and unlabelled data is present, of which class, and where it is located in the data stream. Outlier rejection shows promising initial results where it often surpasses a baseline method of Elastic Weight Consolidation (EWC). A baseline for CSAD is put forward along with the specific dataset setups used for reproducability and testability for other practitioners. Future research directions include other CSAD settings and further research into efficient continual hyperparameter tuning.

Title: Thermal Vision: Pioneering Non-Invasive Temperature Tracking in Congested Spaces

Authors: Arijit Samal, Haroon R Lone
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00863
Pdf URL: https://arxiv.org/pdf/2412.00863
Copy Paste: [[2412.00863]] Thermal Vision: Pioneering Non-Invasive Temperature Tracking in Congested Spaces(https://arxiv.org/abs/2412.00863)
Keywords: robust
Abstract: Non-invasive temperature monitoring of individuals plays a crucial role in identifying and isolating symptomatic individuals. Temperature monitoring becomes particularly vital in settings characterized by close human proximity, often referred to as dense settings. However, existing research on non-invasive temperature estimation using thermal cameras has predominantly focused on sparse settings. Unfortunately, the risk of disease transmission is significantly higher in dense settings like movie theaters or classrooms. Consequently, there is an urgent need to develop robust temperature estimation methods tailored explicitly for dense settings. Our study proposes a non-invasive temperature estimation system that combines a thermal camera with an edge device. Our system employs YOLO models for face detection and utilizes a regression framework for temperature estimation. We evaluated the system on a diverse dataset collected in dense and sparse settings. Our proposed face detection model achieves an impressive mAP score of over 84 in both in-dataset and cross-dataset evaluations. Furthermore, the regression framework demonstrates remarkable performance with a mean square error of 0.18$^{\circ}$C and an impressive $R^2$ score of 0.96. Our experiments' results highlight the developed system's effectiveness, positioning it as a promising solution for continuous temperature monitoring in real-world applications. With this paper, we release our dataset and programming code publicly.

Title: Quantifying perturbation impacts for large language models

Authors: Paulius Rauba, Qiyao Wei, Mihaela van der Schaar
Subjects: cs.LG, cs.CL, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00868
Pdf URL: https://arxiv.org/pdf/2412.00868
Copy Paste: [[2412.00868]] Quantifying perturbation impacts for large language models(https://arxiv.org/abs/2412.00868)
Keywords: interpretability, large language model
Abstract: We consider the problem of quantifying how an input perturbation impacts the outputs of large language models (LLMs), a fundamental task for model reliability and post-hoc interpretability. A key obstacle in this domain is disentangling the meaningful changes in model responses from the intrinsic stochasticity of LLM outputs. To overcome this, we introduce Distribution-Based Perturbation Analysis (DBPA), a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. DBPA constructs empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling. Comparisons of Monte Carlo estimates in the reduced dimensionality space enables tractable frequentist inference without relying on restrictive distributional assumptions. The framework is model-agnostic, supports the evaluation of arbitrary input perturbations on any black-box LLM, yields interpretable p-values, supports multiple perturbation testing via controlled error rates, and provides scalar effect sizes for any chosen similarity or distance metric. We demonstrate the effectiveness of DBPA in evaluating perturbation impacts, showing its versatility for perturbation analysis.

Title: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting

Authors: Thilini Wijesiriwardene, Ruwan Wickramarachchi, Sreeram Vennam, Vinija Jain, Aman Chadha, Amitava Das, Ponnurangam Kumaraguru, Amit Sheth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00869
Pdf URL: https://arxiv.org/pdf/2412.00869
Copy Paste: [[2412.00869]] Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting(https://arxiv.org/abs/2412.00869)
Keywords: large language model
Abstract: Making analogies is fundamental to cognition. Proportional analogies, which consist of four terms, are often used to assess linguistic and cognitive abilities. For instance, completing analogies like "Oxygen is to Gas as is to " requires identifying the semantic relationship (e.g., "type of") between the first pair of terms ("Oxygen" and "Gas") and finding a second pair that shares the same relationship (e.g., "Aluminum" and "Metal"). In this work, we introduce a 15K Multiple-Choice Question Answering (MCQA) dataset for proportional analogy completion and evaluate the performance of contemporary Large Language Models (LLMs) in various knowledge-enhanced prompt settings. Specifically, we augment prompts with three types of knowledge: exemplar, structured, and targeted. Our results show that despite extensive training data, solving proportional analogies remains challenging for current LLMs, with the best model achieving an accuracy of 55%. Notably, we find that providing targeted knowledge can better assist models in completing proportional analogies compared to providing exemplars or collections of structured knowledge.

Title: Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Authors: Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00876
Pdf URL: https://arxiv.org/pdf/2412.00876
Copy Paste: [[2412.00876]] Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification(https://arxiv.org/abs/2412.00876)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by $\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption under decoding without KV cache, while saving $\sim$50\% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at this https URL .

Title: Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration

Authors: Haoze Sun, Wenbo Li, Jiayue Liu, Kaiwen Zhou, Yongqiang Chen, Yong Guo, Yanwei Li, Renjing Pei, Long Peng, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00878
Pdf URL: https://arxiv.org/pdf/2412.00878
Copy Paste: [[2412.00878]] Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration(https://arxiv.org/abs/2412.00878)
Keywords: diffusion, generative
Abstract: Generalization has long been a central challenge in real-world image restoration. While recent diffusion-based restoration methods, which leverage generative priors from text-to-image models, have made progress in recovering more realistic details, they still encounter "generative capability deactivation" when applied to out-of-distribution real-world data. To address this, we propose using text as an auxiliary invariant representation to reactivate the generative capabilities of these models. We begin by identifying two key properties of text input: richness and relevance, and examine their respective influence on model performance. Building on these insights, we introduce Res-Captioner, a module that generates enhanced textual descriptions tailored to image content and degradation levels, effectively mitigating response failures. Additionally, we present RealIR, a new benchmark designed to capture diverse real-world scenarios. Extensive experiments demonstrate that Res-Captioner significantly enhances the generalization abilities of diffusion-based restoration models, while remaining fully plug-and-play.

Title: SyncVIS: Synchronized Video Instance Segmentation

Authors: Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00882
Pdf URL: https://arxiv.org/pdf/2412.00882
Copy Paste: [[2412.00882]] SyncVIS: Synchronized Video Instance Segmentation(https://arxiv.org/abs/2412.00882)
Keywords: transformer, segmentation
Abstract: Recent DETR-based methods have advanced the development of Video Instance Segmentation (VIS) through transformers' efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video-level queries only or adopting query-sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame- and video-level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. Extensive experimental evaluations are conducted on the challenging YouTube-VIS 2019 & 2021 & 2022, and OVIS benchmarks and SyncVIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code is available at this https URL.

Title: Leveraging Intermediate Neural Collapse with Simplex ETFs for Efficient Deep Neural Networks

Authors: Emily Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.00884
Pdf URL: https://arxiv.org/pdf/2412.00884
Copy Paste: [[2412.00884]] Leveraging Intermediate Neural Collapse with Simplex ETFs for Efficient Deep Neural Networks(https://arxiv.org/abs/2412.00884)
Keywords: robust, interpretability, transformer
Abstract: Neural collapse is a phenomenon observed during the terminal phase of neural network training, characterized by the convergence of network activations, class means, and linear classifier weights to a simplex equiangular tight frame (ETF), a configuration of vectors that maximizes mutual distance within a subspace. This phenomenon has been linked to improved interpretability, robustness, and generalization in neural networks. However, its potential to guide neural network training and regularization remains underexplored. Previous research has demonstrated that constraining the final layer of a neural network to a simplex ETF can reduce the number of trainable parameters without sacrificing model accuracy. Furthermore, deep fully connected networks exhibit neural collapse not only in the final layer but across all layers beyond a specific effective depth. Using these insights, we propose two novel training approaches: Adaptive-ETF, a generalized framework that enforces simplex ETF constraints on all layers beyond the effective depth, and ETF-Transformer, which applies simplex ETF constraints to the feedforward layers within transformer blocks. We show that these approaches achieve training and testing performance comparable to those of their baseline counterparts while significantly reducing the number of learnable parameters.

Title: Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection

Authors: Kun Qian, Tianyu Sun, Wenhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00890
Pdf URL: https://arxiv.org/pdf/2412.00890
Copy Paste: [[2412.00890]] Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection(https://arxiv.org/abs/2412.00890)
Keywords: robust, interpretability
Abstract: Industrial anomaly detection (IAD) plays a crucial role in the maintenance and quality control of manufacturing processes. In this paper, we propose a novel approach, Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD), which leverages large vision-language models (LVLMs) to improve both anomaly detection and localization in industrial settings. CLAD aligns visual and textual features into a shared embedding space using contrastive learning, ensuring that normal instances are grouped together while anomalies are pushed apart. Through extensive experiments on two benchmark industrial datasets, MVTec-AD and VisA, we demonstrate that CLAD outperforms state-of-the-art methods in both image-level anomaly detection and pixel-level anomaly localization. Additionally, we provide ablation studies and human evaluation to validate the importance of key components in our method. Our approach not only achieves superior performance but also enhances interpretability by accurately localizing anomalies, making it a promising solution for real-world industrial applications.

Title: Symbolic Quantitative Information Flow for Probabilistic Programs

Authors: Philipp Schröer, Francesca Randone, Raúl Pardo, Andrzej Wąsowski
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00907
Pdf URL: https://arxiv.org/pdf/2412.00907
Copy Paste: [[2412.00907]] Symbolic Quantitative Information Flow for Probabilistic Programs(https://arxiv.org/abs/2412.00907)
Keywords: privacy
Abstract: It is of utmost importance to ensure that modern data intensive systems do not leak sensitive information. In this paper, the authors, who met thanks to Joost-Pieter Katoen, discuss symbolic methods to compute information-theoretic measures of leakage: entropy, conditional entropy, Kullback-Leibler divergence, and mutual information. We build on two semantic frameworks for symbolic execution of probabilistic programs. For discrete programs, we use weakest pre-expectation calculus to compute exact symbolic expressions for the leakage measures. Using Second Order Gaussian Approximation (SOGA), we handle programs that combine discrete and continuous distributions. However, in the SOGA setting, we approximate the exact semantics using Gaussian mixtures and compute bounds for the measures. We demonstrate the use of our methods in two widely used mechanisms to ensure differential privacy: randomized response and the Gaussian mechanism.

Title: SOUL: A Semi-supervised Open-world continUal Learning method for Network Intrusion Detection

Authors: Suresh Kumar Amalapuram, Shreya Kumar, Bheemarjuna Reddy Tamma, Sumohana Channappayya
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.00911
Pdf URL: https://arxiv.org/pdf/2412.00911
Copy Paste: [[2412.00911]] SOUL: A Semi-supervised Open-world continUal Learning method for Network Intrusion Detection(https://arxiv.org/abs/2412.00911)
Keywords: security, attack
Abstract: Fully supervised continual learning methods have shown improved attack traffic detection in a closed-world learning setting. However, obtaining fully annotated data is an arduous task in the security domain. Further, our research finds that after training a classifier on two days of network traffic, the performance decay of attack class detection over time (computed using the area under the time on precision-recall AUC of the attack class) drops from 0.985 to 0.506 on testing with three days of new test samples. In this work, we focus on label scarcity and open-world learning (OWL) settings to improve the attack class detection of the continual learning-based network intrusion detection (NID). We formulate OWL for NID as a semi-supervised continual learning-based method, dubbed SOUL, to achieve the classifier performance on par with fully supervised models while using limited annotated data. The proposed method is motivated by our empirical observation that using gradient projection memory (constructed using buffer memory samples) can significantly improve the detection performance of the attack (minority) class when trained using partially labeled data. Further, using the classifier's confidence in conjunction with buffer memory, SOUL generates high-confidence labels whenever it encounters OWL tasks closer to seen tasks, thus acting as a label generator. Interestingly, SOUL efficiently utilizes samples in the buffer memory for sample replay to avoid catastrophic forgetting, construct the projection memory, and assist in generating labels for unseen tasks. The proposed method is evaluated on four standard network intrusion detection datasets, and the performance results are closer to the fully supervised baselines using at most 20% labeled data while reducing the data annotation effort in the range of 11 to 45% for unseen data.

Title: A Deep Generative Model for the Design of Synthesizable Ionizable Lipids

Authors: Yuxuan Ou, Jingyi Zhao, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00928
Pdf URL: https://arxiv.org/pdf/2412.00928
Copy Paste: [[2412.00928]] A Deep Generative Model for the Design of Synthesizable Ionizable Lipids(https://arxiv.org/abs/2412.00928)
Keywords: protect, generative
Abstract: Lipid nanoparticles (LNPs) are vital in modern biomedicine, enabling the effective delivery of mRNA for vaccines and therapies by protecting it from rapid degradation. Among the components of LNPs, ionizable lipids play a key role in RNA protection and facilitate its delivery into the cytoplasm. However, designing ionizable lipids is complex. Deep generative models can accelerate this process and explore a larger candidate space compared to traditional methods. Due to the structural differences between lipids and small molecules, existing generative models used for small molecule generation are unsuitable for lipid generation. To address this, we developed a deep generative model specifically tailored for the discovery of ionizable lipids. Our model generates novel ionizable lipid structures and provides synthesis paths using synthetically accessible building blocks, addressing synthesizability. This advancement holds promise for streamlining the development of lipid-based delivery systems, potentially accelerating the deployment of new therapeutic agents, including mRNA vaccines and gene therapies.

Title: Calibration through the Lens of Interpretability

Authors: Alireza Torabian, Ruth Urner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.00943
Pdf URL: https://arxiv.org/pdf/2412.00943
Copy Paste: [[2412.00943]] Calibration through the Lens of Interpretability(https://arxiv.org/abs/2412.00943)
Keywords: interpretability
Abstract: Calibration is a frequently invoked concept when useful label probability estimates are required on top of classification accuracy. A calibrated model is a function whose values correctly reflect underlying label probabilities. Calibration in itself however does not imply classification accuracy, nor human interpretable estimates, nor is it straightforward to verify calibration from finite data. There is a plethora of evaluation metrics (and loss functions) that each assess a specific aspect of a calibration model. In this work, we initiate an axiomatic study of the notion of calibration. We catalogue desirable properties of calibrated models as well as corresponding evaluation metrics and analyze their feasibility and correspondences. We complement this analysis with an empirical evaluation, comparing common calibration methods to employing a simple, interpretable decision tree.

Title: Bilinear Convolution Decomposition for Causal RL Interpretability

Authors: Narmeen Oozeer, Sinem Erisken, Alice Rigg
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00944
Pdf URL: https://arxiv.org/pdf/2412.00944
Copy Paste: [[2412.00944]] Bilinear Convolution Decomposition for Causal RL Interpretability(https://arxiv.org/abs/2412.00944)
Keywords: interpretability
Abstract: Efforts to interpret reinforcement learning (RL) models often rely on high-level techniques such as attribution or probing, which provide only correlational insights and coarse causal control. This work proposes replacing nonlinearities in convolutional neural networks (ConvNets) with bilinear variants, to produce a class of models for which these limitations can be addressed. We show bilinear model variants perform comparably in model-free reinforcement learning settings, and give a side by side comparison on ProcGen environments. Bilinear layers' analytic structure enables weight-based decomposition. Previous work has shown bilinearity enables quantifying functional importance through eigendecomposition, to identify interpretable low rank structure. We show how to adapt the decomposition to convolution layers by applying singular value decomposition to vectors of interest, to separate the channel and spatial dimensions. Finally, we propose a methodology for causally validating concept-based probes, and illustrate its utility by studying a maze-solving agent's ability to track a cheese object.

Title: Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Authors: Edward Bayes, Israel Abebe Azime, Jesujoba O. Alabi, Jonas Kgomo, Tyna Eloundou, Elizabeth Proehl, Kai Chen, Imaan Khadir, Naome A. Etori, Shamsuddeen Hassan Muhammad, Choice Mpanza, Igneciah Pocia Thete, Dietrich Klakow, David Ifeoluwa Adelani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.00948
Pdf URL: https://arxiv.org/pdf/2412.00948
Copy Paste: [[2412.00948]] Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages(https://arxiv.org/abs/2412.00948)
Keywords: large language model
Abstract: Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.

Title: STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Authors: Nicholas Lenzen, Amogh Raut, Andrew Melnik
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00949
Pdf URL: https://arxiv.org/pdf/2412.00949
Copy Paste: [[2412.00949]] STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft(https://arxiv.org/abs/2412.00949)
Keywords: generative
Abstract: Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.

Title: ESCAPE: Equivariant Shape Completion via Anchor Point Encoding

Authors: Burak Bekci, Nassir Navab, Federico Tombari, Mahdi Saleh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00952
Pdf URL: https://arxiv.org/pdf/2412.00952
Copy Paste: [[2412.00952]] ESCAPE: Equivariant Shape Completion via Anchor Point Encoding(https://arxiv.org/abs/2412.00952)
Keywords: robust, transformer
Abstract: Shape completion, a crucial task in 3D computer vision, involves predicting and filling the missing regions of scanned or partially observed objects. Current methods expect known pose or canonical coordinates and do not perform well under varying rotations, limiting their real-world applicability. We introduce ESCAPE (Equivariant Shape Completion via Anchor Point Encoding), a novel framework designed to achieve rotation-equivariant shape completion. Our approach employs a distinctive encoding strategy by selecting anchor points from a shape and representing all points as a distance to all anchor points. This enables the model to capture a consistent, rotation-equivariant understanding of the object's geometry. ESCAPE leverages a transformer architecture to encode and decode the distance transformations, ensuring that generated shape completions remain accurate and equivariant under rotational transformations. Subsequently, we perform optimization to calculate the predicted shapes from the encodings. Experimental evaluations demonstrate that ESCAPE achieves robust, high-quality reconstructions across arbitrary rotations and translations, showcasing its effectiveness in real-world applications without additional pose estimation modules.

Title: WAFFLE: Multimodal Floorplan Understanding in the Wild

Authors: Keren Ganon, Morris Alper, Rachel Mikulinsky, Hadar Averbuch-Elor
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00955
Pdf URL: https://arxiv.org/pdf/2412.00955
Copy Paste: [[2412.00955]] WAFFLE: Multimodal Floorplan Understanding in the Wild(https://arxiv.org/abs/2412.00955)
Keywords: generative, large language model
Abstract: Buildings are a central feature of human culture and are increasingly being analyzed with computational methods. However, recent works on computational building understanding have largely focused on natural imagery of buildings, neglecting the fundamental element defining a building's structure -- its floorplan. Conversely, existing works on floorplan understanding are extremely limited in scope, often focusing on floorplans of a single semantic category and region (e.g. floorplans of apartments from a single country). In this work, we introduce WAFFLE, a novel multimodal floorplan understanding dataset of nearly 20K floorplan images and metadata curated from Internet data spanning diverse building types, locations, and data formats. By using a large language model and multimodal foundation models, we curate and extract semantic information from these images and their accompanying noisy metadata. We show that WAFFLE enables progress on new building understanding tasks, both discriminative and generative, which were not feasible using prior datasets. We will publicly release WAFFLE along with our code and trained models, providing the research community with a new foundation for learning the semantics of buildings.

Title: Token Cropr: Faster ViTs for Quite a Few Tasks

Authors: Benjamin Bergner, Christoph Lippert, Aravindh Mahendran
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00965
Pdf URL: https://arxiv.org/pdf/2412.00965
Copy Paste: [[2412.00965]] Token Cropr: Faster ViTs for Quite a Few Tasks(https://arxiv.org/abs/2412.00965)
Keywords: transformer, segmentation
Abstract: The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.

Title: Optimal Algorithms for Augmented Testing of Discrete Distributions

Authors: Maryam Aliakbarpour, Piotr Indyk, Ronitt Rubinfeld, Sandeep Silwal
Subjects: cs.LG, cs.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00974
Pdf URL: https://arxiv.org/pdf/2412.00974
Copy Paste: [[2412.00974]] Optimal Algorithms for Augmented Testing of Discrete Distributions(https://arxiv.org/abs/2412.00974)
Keywords: robust
Abstract: We consider the problem of hypothesis testing for discrete distributions. In the standard model, where we have sample access to an underlying distribution $p$, extensive research has established optimal bounds for uniformity testing, identity testing (goodness of fit), and closeness testing (equivalence or two-sample testing). We explore these problems in a setting where a predicted data distribution, possibly derived from historical data or predictive machine learning models, is available. We demonstrate that such a predictor can indeed reduce the number of samples required for all three property testing tasks. The reduction in sample complexity depends directly on the predictor's quality, measured by its total variation distance from $p$. A key advantage of our algorithms is their adaptability to the precision of the prediction. Specifically, our algorithms can self-adjust their sample complexity based on the accuracy of the available prediction, operating without any prior knowledge of the estimation's accuracy (i.e. they are consistent). Additionally, we never use more samples than the standard approaches require, even if the predictions provide no meaningful information (i.e. they are also robust). We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal. Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach.

Title: Hierarchical Prompt Decision Transformer: Improving Few-Shot Policy Generalization with Global and Adaptive

Authors: Zhe Wang, Haozhu Wang, Yanjun Qi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.00979
Pdf URL: https://arxiv.org/pdf/2412.00979
Copy Paste: [[2412.00979]] Hierarchical Prompt Decision Transformer: Improving Few-Shot Policy Generalization with Global and Adaptive(https://arxiv.org/abs/2412.00979)
Keywords: transformer
Abstract: Decision transformers recast reinforcement learning as a conditional sequence generation problem, offering a simple but effective alternative to traditional value or policy-based methods. A recent key development in this area is the integration of prompting in decision transformers to facilitate few-shot policy generalization. However, current methods mainly use static prompt segments to guide rollouts, limiting their ability to provide context-specific guidance. Addressing this, we introduce a hierarchical prompting approach enabled by retrieval augmentation. Our method learns two layers of soft tokens as guiding prompts: (1) global tokens encapsulating task-level information about trajectories, and (2) adaptive tokens that deliver focused, timestep-specific instructions. The adaptive tokens are dynamically retrieved from a curated set of demonstration segments, ensuring context-aware guidance. Experiments across seven benchmark tasks in the MuJoCo and MetaWorld environments demonstrate the proposed approach consistently outperforms all baseline methods, suggesting that hierarchical prompting for decision transformers is an effective strategy to enable few-shot policy generalization.

Title: Incentivizing Truthful Collaboration in Heterogeneous Federated Learning

Authors: Dimitar Chakarov, Nikita Tsoy, Kristian Minchev, Nikola Konstantinov
Subjects: cs.LG, cs.GT, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00980
Pdf URL: https://arxiv.org/pdf/2412.00980
Copy Paste: [[2412.00980]] Incentivizing Truthful Collaboration in Heterogeneous Federated Learning(https://arxiv.org/abs/2412.00980)
Keywords: federate
Abstract: It is well-known that Federated Learning (FL) is vulnerable to manipulated updates from clients. In this work we study the impact of data heterogeneity on clients' incentives to manipulate their updates. We formulate a game in which clients may upscale their gradient updates in order to ``steer'' the server model to their advantage. We develop a payment rule that disincentivizes sending large gradient updates, and steers the clients towards truthfully reporting their gradients. We also derive explicit bounds on the clients' payments and the convergence rate of the global model, which allows us to study the trade-off between heterogeneity, payments and convergence.

Title: TGTOD: A Global Temporal Graph Transformer for Outlier Detection at Scale

Authors: Kay Liu, Jiahao Ding, MohamadAli Torkamani, Philip S. Yu
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2412.00984
Pdf URL: https://arxiv.org/pdf/2412.00984
Copy Paste: [[2412.00984]] TGTOD: A Global Temporal Graph Transformer for Outlier Detection at Scale(https://arxiv.org/abs/2412.00984)
Keywords: extraction, transformer
Abstract: While Transformers have revolutionized machine learning on various data, existing Transformers for temporal graphs face limitations in (1) restricted receptive fields, (2) overhead of subgraph extraction, and (3) suboptimal generalization capability beyond link prediction. In this paper, we rethink temporal graph Transformers and propose TGTOD, a novel end-to-end Temporal Graph Transformer for Outlier Detection. TGTOD employs global attention to model both structural and temporal dependencies within temporal graphs. To tackle scalability, our approach divides large temporal graphs into spatiotemporal patches, which are then processed by a hierarchical Transformer architecture comprising Patch Transformer, Cluster Transformer, and Temporal Transformer. We evaluate TGTOD on three public datasets under two settings, comparing with a wide range of baselines. Our experimental results demonstrate the effectiveness of TGTOD, achieving AP improvement of 61% on Elliptic. Furthermore, our efficiency evaluation shows that TGTOD reduces training time by 44x compared to existing Transformers for temporal graphs. To foster reproducibility, we make our implementation publicly available at this https URL.

Title: Seldom: An Anonymity Network with Selective Deanonymization

Authors: Eric Wagner, Roman Matzutt, Martin Henze
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2412.00990
Pdf URL: https://arxiv.org/pdf/2412.00990
Copy Paste: [[2412.00990]] Seldom: An Anonymity Network with Selective Deanonymization(https://arxiv.org/abs/2412.00990)
Keywords: privacy, protect
Abstract: While anonymity networks such as Tor provide invaluable privacy guarantees to society, they also enable all kinds of criminal activities. Consequently, many blameless citizens shy away from protecting their privacy using such technology for the fear of being associated with criminals. To grasp the potential for alternative privacy protection for those users, we design Seldom, an anonymity network with integrated selective deanonymization that disincentivizes criminal activity. Seldom enables law enforcement agencies to selectively access otherwise anonymized identities of misbehaving users, while providing technical guarantees preventing these access rights from being misused. Seldom further ensures translucency, as each access request is approved by a trustworthy consortium of impartial entities and eventually disclosed to the public (without interfering with ongoing investigations). To demonstrate Seldom's feasibility and applicability, we base our implementation on Tor, the most widely used anonymity network. Our evaluation indicates minimal latency, processing, and bandwidth overheads compared to Tor, while Seldom's main costs stem from storing flow records and encrypted identities. With at most 636 TB of storage required in total to retain the encrypted identifiers of a Tor-sized network for two years, Seldom provides a practical and deployable technical solution to the inherent problem of criminal activities in anonymity networks. As such, Seldom sheds new light on the potentials and limitations when integrating selective deanonymization into anonymity networks.

Title: DSSRNN: Decomposition-Enhanced State-Space Recurrent Neural Network for Time-Series Analysis

Authors: Ahmad Mohammadshirazi, Ali Nosratifiroozsalari, Rajiv Ramnath
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00994
Pdf URL: https://arxiv.org/pdf/2412.00994
Copy Paste: [[2412.00994]] DSSRNN: Decomposition-Enhanced State-Space Recurrent Neural Network for Time-Series Analysis(https://arxiv.org/abs/2412.00994)
Keywords: transformer
Abstract: Time series forecasting is a crucial yet challenging task in machine learning, requiring domain-specific knowledge due to its wide-ranging applications. While recent Transformer models have improved forecasting capabilities, they come with high computational costs. Linear-based models have shown better accuracy than Transformers but still fall short of ideal performance. To address these challenges, we introduce the Decomposition State-Space Recurrent Neural Network (DSSRNN), a novel framework designed for both long-term and short-term time series forecasting. DSSRNN uniquely combines decomposition analysis to capture seasonal and trend components with state-space models and physics-based equations. We evaluate DSSRNN's performance on indoor air quality datasets, focusing on CO2 concentration prediction across various forecasting horizons. Results demonstrate that DSSRNN consistently outperforms state-of-the-art models, including transformer-based architectures, in terms of both Mean Squared Error (MSE) and Mean Absolute Error (MAE). For example, at the shortest horizon (T=96) in Office 1, DSSRNN achieved an MSE of 0.378 and an MAE of 0.401, significantly lower than competing models. Additionally, DSSRNN exhibits superior computational efficiency compared to more complex models. While not as lightweight as the DLinear model, DSSRNN achieves a balance between performance and efficiency, with only 0.11G MACs and 437MiB memory usage, and an inference time of 0.58ms for long-term forecasting. This work not only showcases DSSRNN's success but also establishes a new benchmark for physics-informed machine learning in environmental forecasting and potentially other domains.

Title: Competition Dynamics Shape Algorithmic Phases of In-Context Learning

Authors: Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, Hidenori Tanaka
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.01003
Pdf URL: https://arxiv.org/pdf/2412.01003
Copy Paste: [[2412.01003]] Competition Dynamics Shape Algorithmic Phases of In-Context Learning(https://arxiv.org/abs/2412.01003)
Keywords: large language model
Abstract: In-Context Learning (ICL) has significantly expanded the general-purpose nature of large language models, allowing them to adapt to novel tasks using merely the inputted context. This has motivated a series of papers that analyze tractable synthetic domains and postulate precise mechanisms that may underlie ICL. However, the use of relatively distinct setups that often lack a sequence modeling nature to them makes it unclear how general the reported insights from such studies are. Motivated by this, we propose a synthetic sequence modeling task that involves learning to simulate a finite mixture of Markov chains. As we show, models trained on this task reproduce most well-known results on ICL, hence offering a unified setting for studying the concept. Building on this setup, we demonstrate we can explain a model's behavior by decomposing it into four broad algorithms that combine a fuzzy retrieval vs. inference approach with either unigram or bigram statistics of the context. These algorithms engage in a competition dynamics to dominate model behavior, with the precise experimental conditions dictating which algorithm ends up superseding others: e.g., we find merely varying context size or amount of training yields (at times sharp) transitions between which algorithm dictates the model behavior, revealing a mechanism that explains the transient nature of ICL. In this sense, we argue ICL is best thought of as a mixture of different algorithms, each with its own peculiarities, instead of a monolithic capability. This also implies that making general claims about ICL that hold universally across all settings may be infeasible.

Title: e-Fold Cross-Validation for Recommender-System Evaluation

Authors: Moritz Baumgart, Lukas Wegmeth, Tobias Vente, Joeran Beel
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2412.01011
Pdf URL: https://arxiv.org/pdf/2412.01011
Copy Paste: [[2412.01011]] e-Fold Cross-Validation for Recommender-System Evaluation(https://arxiv.org/abs/2412.01011)
Keywords: robust
Abstract: To combat the rising energy consumption of recommender systems we implement a novel alternative for k-fold cross validation. This alternative, named e-fold cross validation, aims to minimize the number of folds to achieve a reduction in power usage while keeping the reliability and robustness of the test results high. We tested our method on 5 recommender system algorithms across 6 datasets and compared it with 10-fold cross validation. On average e-fold cross validation only needed 41.5% of the energy that 10-fold cross validation would need, while it's results only differed by 1.81%. We conclude that e-fold cross validation is a promising approach that has the potential to be an energy efficient but still reliable alternative to k-fold cross validation.

Title: Detecting Memorization in Large Language Models

Authors: Eduardo Slonski
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.01014
Pdf URL: https://arxiv.org/pdf/2412.01014
Copy Paste: [[2412.01014]] Detecting Memorization in Large Language Models(https://arxiv.org/abs/2412.01014)
Keywords: privacy, interpretability, large language model
Abstract: Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit generalization. Traditional methods for detecting memorization rely on output probabilities or loss functions, often lacking precision due to confounding factors like common language patterns. In this paper, we introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM. By identifying specific activation patterns that differentiate between memorized and not memorized tokens, we train classification probes that achieve near-perfect accuracy. The approach can also be applied to other mechanisms, such as repetition, as demonstrated in this study, highlighting its versatility. Intervening on these activations allows us to suppress memorization without degrading overall performance, enhancing evaluation integrity by ensuring metrics reflect genuine generalization. Additionally, our method supports large-scale labeling of tokens and sequences, crucial for next-generation AI models, improving training efficiency and results. Our findings contribute to model interpretability and offer practical tools for analyzing and controlling internal mechanisms in LLMs.

Title: Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Authors: Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01027
Pdf URL: https://arxiv.org/pdf/2412.01027
Copy Paste: [[2412.01027]] Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation(https://arxiv.org/abs/2412.01027)
Keywords: diffusion
Abstract: Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed $\textbf{InstaManip}$, that can $\textbf{insta}$ntly learn a new image $\textbf{manip}$ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin ($\geq$19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.

Title: Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings

Authors: Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, Tanveer Syeda-Mahmood
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.01031
Pdf URL: https://arxiv.org/pdf/2412.01031
Copy Paste: [[2412.01031]] Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings(https://arxiv.org/abs/2412.01031)
Keywords: robust, generative
Abstract: Several evaluation metrics have been developed recently to automatically assess the quality of generative AI reports for chest radiographs based only on textual information using lexical, semantic, or clinical named entity recognition methods. In this paper, we develop a new method of report quality evaluation by first extracting fine-grained finding patterns capturing the location, laterality, and severity of a large number of clinical findings. We then performed phrasal grounding to localize their associated anatomical regions on chest radiograph images. The textual and visual measures are then combined to rate the quality of the generated reports. We present results that compare this evaluation metric with other textual metrics on a gold standard dataset derived from the MIMIC collection and show its robustness and sensitivity to factual errors.

Title: SAUP: Situation Awareness Uncertainty Propagation on LLM Agent

Authors: Qiwei Zhao, Xujiang Zhao, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Haifeng Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01033
Pdf URL: https://arxiv.org/pdf/2412.01033
Copy Paste: [[2412.01033]] SAUP: Situation Awareness Uncertainty Propagation on LLM Agent(https://arxiv.org/abs/2412.01033)
Keywords: large language model
Abstract: Large language models (LLMs) integrated into multistep agent systems enable complex decision-making processes across various applications. However, their outputs often lack reliability, making uncertainty estimation crucial. Existing uncertainty estimation methods primarily focus on final-step outputs, which fail to account for cumulative uncertainty over the multistep decision-making process and the dynamic interactions between agents and their environments. To address these limitations, we propose SAUP (Situation Awareness Uncertainty Propagation), a novel framework that propagates uncertainty through each step of an LLM-based agent's reasoning process. SAUP incorporates situational awareness by assigning situational weights to each step's uncertainty during the propagation. Our method, compatible with various one-step uncertainty estimation techniques, provides a comprehensive and accurate uncertainty measure. Extensive experiments on benchmark datasets demonstrate that SAUP significantly outperforms existing state-of-the-art methods, achieving up to 20% improvement in AUROC.

Title: TruncFormer: Private LLM Inference Using Only Truncations

Authors: Patrick Yubeaton, Jianqiao Cambridge Mo, Karthik Garimella, Nandan Kumar Jha, Brandon Reagen, Chinmay Hegde, Siddharth Garg
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01042
Pdf URL: https://arxiv.org/pdf/2412.01042
Copy Paste: [[2412.01042]] TruncFormer: Private LLM Inference Using Only Truncations(https://arxiv.org/abs/2412.01042)
Keywords: privacy
Abstract: Private inference (PI) serves an important role in guaranteeing the privacy of user data when interfacing with proprietary machine learning models such as LLMs. However, PI remains practically intractable due to the massive latency costs associated with nonlinear functions present in LLMs. Existing works have focused on improving latency of specific LLM nonlinearities (such as the Softmax, or the GeLU) via approximations. However, new types of nonlinearities are regularly introduced with new LLM architectures, and this has led to a constant game of catch-up where PI researchers attempt to optimize the newest nonlinear function. We introduce TruncFormer, a framework for taking any LLM and transforming it into a plaintext emulation of PI. Our framework leverages the fact that nonlinearities in LLMs are differentiable and can be accurately approximated with a sequence of additions, multiplications, and truncations. Further, we decouple the add/multiply and truncation operations, and statically determine where truncations should be inserted based on a given field size and input representation size. This leads to latency improvements over existing cryptographic protocols that enforce truncation after every multiplication operation. We open source our code for community use.

Title: Classifying Simulated Gait Impairments using Privacy-preserving Explainable Artificial Intelligence and Mobile Phone Videos

Authors: Lauhitya Reddy, Ketan Anand, Shoibolina Kaushik, Corey Rodrigo, J. Lucas McKay, Trisha M. Kesar, Hyeokhyen Kwon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01056
Pdf URL: https://arxiv.org/pdf/2412.01056
Copy Paste: [[2412.01056]] Classifying Simulated Gait Impairments using Privacy-preserving Explainable Artificial Intelligence and Mobile Phone Videos(https://arxiv.org/abs/2412.01056)
Keywords: privacy
Abstract: Accurate diagnosis of gait impairments is often hindered by subjective or costly assessment methods, with current solutions requiring either expensive multi-camera equipment or relying on subjective clinical observation. There is a critical need for accessible, objective tools that can aid in gait assessment while preserving patient privacy. In this work, we present a mobile phone-based, privacy-preserving artificial intelligence (AI) system for classifying gait impairments and introduce a novel dataset of 743 videos capturing seven distinct gait patterns. The dataset consists of frontal and sagittal views of trained subjects simulating normal gait and six types of pathological gait (circumduction, Trendelenburg, antalgic, crouch, Parkinsonian, and vaulting), recorded using standard mobile phone cameras. Our system achieved 86.5% accuracy using combined frontal and sagittal views, with sagittal views generally outperforming frontal views except for specific gait patterns like Circumduction. Model feature importance analysis revealed that frequency-domain features and entropy measures were critical for classifcation performance, specifically lower limb keypoints proved most important for classification, aligning with clinical understanding of gait assessment. These findings demonstrate that mobile phone-based systems can effectively classify diverse gait patterns while preserving privacy through on-device processing. The high accuracy achieved using simulated gait data suggests their potential for rapid prototyping of gait analysis systems, though clinical validation with patient data remains necessary. This work represents a significant step toward accessible, objective gait assessment tools for clinical, community, and tele-rehabilitation settings

Title: Blindfold: Confidential Memory Management by Untrusted Operating System

Authors: Caihua Li, Seung-seob Lee, Lin Zhong
Subjects: cs.CR, cs.OS
Abstract URL: https://arxiv.org/abs/2412.01059
Pdf URL: https://arxiv.org/pdf/2412.01059
Copy Paste: [[2412.01059]] Blindfold: Confidential Memory Management by Untrusted Operating System(https://arxiv.org/abs/2412.01059)
Keywords: secure, protect
Abstract: Confidential Computing (CC) has received increasing attention in recent years as a mechanism to protect user data from untrusted operating systems (OSes). Existing CC solutions hide confidential memory from the OS and/or encrypt it to achieve confidentiality. In doing so, they render OS memory optimization unusable or complicate the trusted computing base (TCB) required for optimization. This paper presents our results toward overcoming these limitations, synthesized in a CC design named Blindfold. Like many other CC solutions, Blindfold relies on a small trusted software component running at a higher privilege level than the kernel, called Guardian. It features three techniques that can enhance existing CC solutions. First, instead of nesting page tables, Guardian mediates how the OS accesses memory and handles exceptions by switching page and interrupt tables. Second, Blindfold employs a lightweight capability system to regulate the kernel semantic access to user memory, unifying case-by-case approaches in previous work. Finally, Blindfold provides carefully designed secure ABI for confidential memory management without encryption. We report an implementation of Blindfold that works on ARMv8-A/Linux. Using Blindfold prototype, we are able to evaluate the cost of enabling confidential memory management by the untrusted Linux kernel. We show Blindfold has a smaller runtime TCB than related systems and enjoys competitive performance. More importantly, we show that the Linux kernel, including all of its memory optimizations except memory compression, can function properly for confidential memory. This requires only about 400 lines of kernel modifications.

Title: Research on Optimizing Real-Time Data Processing in High-Frequency Trading Algorithms using Machine Learning

Authors: Yuxin Fan, Zhuohuan Hu, Lei Fu, Yu Cheng, Liyang Wang, Yuxiang Wang
Subjects: cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2412.01062
Pdf URL: https://arxiv.org/pdf/2412.01062
Copy Paste: [[2412.01062]] Research on Optimizing Real-Time Data Processing in High-Frequency Trading Algorithms using Machine Learning(https://arxiv.org/abs/2412.01062)
Keywords: extraction
Abstract: High-frequency trading (HFT) represents a pivotal and intensely competitive domain within the financial markets. The velocity and accuracy of data processing exert a direct influence on profitability, underscoring the significance of this field. The objective of this work is to optimise the real-time processing of data in high-frequency trading algorithms. The dynamic feature selection mechanism is responsible for monitoring and analysing market data in real time through clustering and feature weight analysis, with the objective of automatically selecting the most relevant features. This process employs an adaptive feature extraction method, which enables the system to respond and adjust its feature set in a timely manner when the data input changes, thus ensuring the efficient utilisation of data. The lightweight neural networks are designed in a modular fashion, comprising fast convolutional layers and pruning techniques that facilitate the expeditious completion of data processing and output prediction. In contrast to conventional deep learning models, the neural network architecture has been specifically designed to minimise the number of parameters and computational complexity, thereby markedly reducing the inference time. The experimental results demonstrate that the model is capable of maintaining consistent performance in the context of varying market conditions, thereby illustrating its advantages in terms of processing speed and revenue enhancement.

Title: FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Authors: Taekyung Ki, Dongchan Min, Gyoungsu Chae
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01064
Pdf URL: https://arxiv.org/pdf/2412.01064
Copy Paste: [[2412.01064]] FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait(https://arxiv.org/abs/2412.01064)
Keywords: diffusion, transformer, generative
Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

Title: Lookahead Counterfactual Fairness

Authors: Zhiqun Zuo, Tian Xie, Xuwei Tan, Xueru Zhang, Mohammad Mahdi Khalili
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01065
Pdf URL: https://arxiv.org/pdf/2412.01065
Copy Paste: [[2412.01065]] Lookahead Counterfactual Fairness(https://arxiv.org/abs/2412.01065)
Keywords: fair
Abstract: As machine learning (ML) algorithms are used in applications that involve humans, concerns have arisen that these algorithms may be biased against certain social groups. \textit{Counterfactual fairness} (CF) is a fairness notion proposed in Kusner et al. (2017) that measures the unfairness of ML predictions; it requires that the prediction perceived by an individual in the real world has the same marginal distribution as it would be in a counterfactual world, in which the individual belongs to a different group. Although CF ensures fair ML predictions, it fails to consider the downstream effects of ML predictions on individuals. Since humans are strategic and often adapt their behaviors in response to the ML system, predictions that satisfy CF may not lead to a fair future outcome for the individuals. In this paper, we introduce \textit{lookahead counterfactual fairness} (LCF), a fairness notion accounting for the downstream effects of ML models which requires the individual \textit{future status} to be counterfactually fair. We theoretically identify conditions under which LCF can be satisfied and propose an algorithm based on the theorems. We also extend the concept to path-dependent fairness. Experiments on both synthetic and real data validate the proposed method.

Title: TRUST: A Toolkit for TEE-Assisted Secure Outsourced Computation over Integers

Authors: Bowen Zhao, Jiuhui Li, Peiming Xu, Xiaoguo Li, Qingqi Pei, Yulong Shen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.01073
Pdf URL: https://arxiv.org/pdf/2412.01073
Copy Paste: [[2412.01073]] TRUST: A Toolkit for TEE-Assisted Secure Outsourced Computation over Integers(https://arxiv.org/abs/2412.01073)
Keywords: secure, security, privacy, protect, attack
Abstract: Secure outsourced computation (SOC) provides secure computing services by taking advantage of the computation power of cloud computing and the technology of privacy computing (e.g., homomorphic encryption). Expanding computational operations on encrypted data (e.g., enabling complex calculations directly over ciphertexts) and broadening the applicability of SOC across diverse use cases remain critical yet challenging research topics in the field. Nevertheless, previous SOC solutions frequently lack the computational efficiency and adaptability required to fully meet evolving demands. To this end, in this paper, we propose a toolkit for TEE-assisted (Trusted Execution Environment) SOC over integers, named TRUST. In terms of system architecture, TRUST falls in a single TEE-equipped cloud server only through seamlessly integrating the computation of REE (Rich Execution Environment) and TEE. In consideration of TEE being difficult to permanently store data and being vulnerable to attacks, we introduce a (2, 2)-threshold homomorphic cryptosystem to fit the hybrid computation between REE and TEE. Additionally, we carefully design a suite of SOC protocols supporting unary, binary and ternary operations. To achieve applications, we present \texttt{SEAT}, secure data trading based on TRUST. Security analysis demonstrates that TRUST enables SOC, avoids collusion attacks among multiple cloud servers, and mitigates potential secret leakage risks within TEE (e.g., from side-channel attacks). Experimental evaluations indicate that TRUST outperforms the state-of-the-art and requires no alignment of data as well as any network communications. Furthermore, \texttt{SEAT} is as effective as the \texttt{Baseline} without any data protection.

Title: Multi-Agent Deep Reinforcement Learning for Distributed and Autonomous Platoon Coordination via Speed-regulation over Large-scale Transportation Networks

Authors: Dixiao Wei (1), Peng Yi (1 and 2), Jinlong Lei (1 and 2), Xingyi Zhu (3) ((1) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, China, (2) Department of Control Science and Engineering, Tongji University, China, (3) Key Laboratory of Road and Traffic Engineering of the Ministry of Education, Tongji University, China)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01075
Pdf URL: https://arxiv.org/pdf/2412.01075
Copy Paste: [[2412.01075]] Multi-Agent Deep Reinforcement Learning for Distributed and Autonomous Platoon Coordination via Speed-regulation over Large-scale Transportation Networks(https://arxiv.org/abs/2412.01075)
Keywords: robust
Abstract: Truck platooning technology enables a group of trucks to travel closely together, with which the platoon can save fuel, improve traffic flow efficiency, and improve safety. In this paper, we consider the platoon coordination problem in a large-scale transportation network, to promote cooperation among trucks and optimize the overall efficiency. Involving the regulation of both speed and departure times at hubs, we formulate the coordination problem as a complicated dynamic stochastic integer programming under network and information constraints. To get an autonomous, distributed, and robust platoon coordination policy, we formulate the problem into a model of the Decentralized-Partial Observable Markov Decision Process. Then, we propose a Multi-Agent Deep Reinforcement Learning framework named Trcuk Attention-QMIX (TA-QMIX) to train an efficient online decision policy. TA-QMIX utilizes the attention mechanism to enhance the representation of truck fuel gains and delay times, and provides explicit truck cooperation information during the training process, promoting trucks' willingness to cooperate. The training framework adopts centralized training and distributed execution, thus training a policy for trucks to make decisions online using only nearby information. Hence, the policy can be autonomously executed on a large-scale network. Finally, we perform comparison experiments and ablation experiments in the transportation network of the Yangtze River Delta region in China to verify the effectiveness of the proposed framework. In a repeated comparative experiment with 5,000 trucks, our method average saves 19.17\% of fuel with an average delay of only 9.57 minutes per truck and a decision time of 0.001 seconds.

Title: Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data

Authors: Shuaijiang Zhao, Tingwei Guo, Bajian Xiang, Tongtang Wan, Qiang Niu, Wei Zou, Xiangang Li
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2412.01078
Pdf URL: https://arxiv.org/pdf/2412.01078
Copy Paste: [[2412.01078]] Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data(https://arxiv.org/abs/2412.01078)
Keywords: large language model
Abstract: The GPT-4o represents a significant milestone in enabling real-time interaction with large language models (LLMs) through speech, its remarkable low latency and high fluency not only capture attention but also stimulate research interest in the field. This real-time speech interaction is particularly valuable in scenarios requiring rapid feedback and immediate responses, dramatically enhancing user experience. However, there is a notable lack of research focused on real-time large speech language models, particularly for Chinese. In this work, we present KE-Omni, a seamless large speech language model built upon Ke-SpeechChat, a large-scale high-quality synthetic speech interaction dataset consisting of 7 million Chinese and English conversations, featuring 42,002 speakers, and totaling over 60,000 hours, This contributes significantly to the advancement of research and development in this field. The model, dataset, code and demo can be accessed at \url{this https URL}.

Title: Federated Motor Imagery Classification for Privacy-Preserving Brain-Computer Interfaces

Authors: Tianwang Jia, Lubin Meng, Siyang Li, Jiajing Liu, Dongrui Wu
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2412.01079
Pdf URL: https://arxiv.org/pdf/2412.01079
Copy Paste: [[2412.01079]] Federated Motor Imagery Classification for Privacy-Preserving Brain-Computer Interfaces(https://arxiv.org/abs/2412.01079)
Keywords: privacy, protect, federate
Abstract: Training an accurate classifier for EEG-based brain-computer interface (BCI) requires EEG data from a large number of users, whereas protecting their data privacy is a critical consideration. Federated learning (FL) is a promising solution to this challenge. This paper proposes Federated classification with local Batch-specific batch normalization and Sharpness-aware minimization (FedBS) for privacy protection in EEG-based motor imagery (MI) classification. FedBS utilizes local batch-specific batch normalization to reduce data discrepancies among different clients, and sharpness-aware minimization optimizer in local training to improve model generalization. Experiments on three public MI datasets using three popular deep learning models demonstrated that FedBS outperformed six state-of-the-art FL approaches. Remarkably, it also outperformed centralized training, which does not consider privacy protection at all. In summary, FedBS protects user EEG data privacy, enabling multiple BCI users to participate in large-scale machine learning model training, which in turn improves the BCI decoding accuracy.

Title: STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation

Authors: Sunghun Yang, Minhyeok Lee, Suhwan Cho, Jungho Lee, Sangyoun Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01090
Pdf URL: https://arxiv.org/pdf/2412.01090
Copy Paste: [[2412.01090]] STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation(https://arxiv.org/abs/2412.01090)
Keywords: transformer
Abstract: Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.

Title: DuoCast: Duo-Probabilistic Meteorology-Aware Model for Extended Precipitation Nowcasting

Authors: Penghui Wen, Lei Bai, Mengwei He, Patrick Filippi, Feng Zhang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01091
Pdf URL: https://arxiv.org/pdf/2412.01091
Copy Paste: [[2412.01091]] DuoCast: Duo-Probabilistic Meteorology-Aware Model for Extended Precipitation Nowcasting(https://arxiv.org/abs/2412.01091)
Keywords: diffusion
Abstract: Recently, extended short-term precipitation nowcasting struggles with decreasing precision because of insufficient consideration of meteorological knowledge, such as weather fronts which significantly influence precipitation intensity, duration, and spatial distribution. Therefore, in this paper, we present DuoCast, a novel dual-probabilistic meteorology-aware model designed to address both broad weather evolution and micro-scale fluctuations using two diffusion models, PrecipFlow and MicroDynamic, respectively. Our PrecipFlow model captures evolution trends through an Extreme Precipitation-Aware Encoder (EPA-Encoder), which includes AirConvolution and FrontAttention blocks to process two levels of precipitation data: general and extreme. The output conditions a UNet-based diffusion to produce prediction maps enriched with weather front information. The MicroDynamic model further refines the results to capture micro-scale variability. Extensive experiments on four public benchmarks demonstrate the effectiveness of our DuoCast, achieving superior performance over state-of-the-art methods. Our code is available at this https URL.

Title: Automated Extraction of Acronym-Expansion Pairs from Scientific Papers

Authors: Izhar Ali, Million Haileyesus, Serhiy Hnatyshyn, Jan-Lucas Ott, Vasil Hnatyshin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.01093
Pdf URL: https://arxiv.org/pdf/2412.01093
Copy Paste: [[2412.01093]] Automated Extraction of Acronym-Expansion Pairs from Scientific Papers(https://arxiv.org/abs/2412.01093)
Keywords: extraction, large language model
Abstract: This project addresses challenges posed by the widespread use of abbreviations and acronyms in digital texts. We propose a novel method that combines document preprocessing, regular expressions, and a large language model to identify abbreviations and map them to their corresponding expansions. The regular expressions alone are often insufficient to extract expansions, at which point our approach leverages GPT-4 to analyze the text surrounding the acronyms. By limiting the analysis to only a small portion of the surrounding text, we mitigate the risk of obtaining incorrect or multiple expansions for an acronym. There are several known challenges in processing text with acronyms, including polysemous acronyms, non-local and ambiguous acronyms. Our approach enhances the precision and efficiency of NLP techniques by addressing these issues with automated acronym identification and disambiguation. This study highlights the challenges of working with PDF files and the importance of document preprocessing. Furthermore, the results of this work show that neither regular expressions nor GPT-4 alone can perform well. Regular expressions are suitable for identifying acronyms but have limitations in finding their expansions within the paper due to a variety of formats used for expressing acronym-expansion pairs and the tendency of authors to omit expansions within the text. GPT-4, on the other hand, is an excellent tool for obtaining expansions but struggles with correctly identifying all relevant acronyms. Additionally, GPT-4 poses challenges due to its probabilistic nature, which may lead to slightly different results for the same input. Our algorithm employs preprocessing to eliminate irrelevant information from the text, regular expressions for identifying acronyms, and a large language model to help find acronym expansions to provide the most accurate and consistent results.

Title: Hiding Faces in Plain Sight: Defending DeepFakes by Disrupting Face Detection

Authors: Delong Zhu, Yuezun Li, Baoyuan Wu, Jiaran Zhou, Zhibo Wang, Siwei Lyu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.01101
Pdf URL: https://arxiv.org/pdf/2412.01101
Copy Paste: [[2412.01101]] Hiding Faces in Plain Sight: Defending DeepFakes by Disrupting Face Detection(https://arxiv.org/abs/2412.01101)
Keywords: defense, attack
Abstract: This paper investigates the feasibility of a proactive DeepFake defense framework, {\em FacePosion}, to prevent individuals from becoming victims of DeepFake videos by sabotaging face detection. The motivation stems from the reliance of most DeepFake methods on face detectors to automatically extract victim faces from videos for training or synthesis (testing). Once the face detectors malfunction, the extracted faces will be distorted or incorrect, subsequently disrupting the training or synthesis of the DeepFake model. To achieve this, we adapt various adversarial attacks with a dedicated design for this purpose and thoroughly analyze their feasibility. Based on FacePoison, we introduce {\em VideoFacePoison}, a strategy that propagates FacePoison across video frames rather than applying them individually to each frame. This strategy can largely reduce the computational overhead while retaining the favorable attack performance. Our method is validated on five face detectors, and extensive experiments against eleven different DeepFake models demonstrate the effectiveness of disrupting face detectors to hinder DeepFake generation.

Title: One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Authors: Jun Xiang, Yudong Guo, Leipeng Hu, Boyang Guo, Yancheng Yuan, Juyong Zhang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.01106
Pdf URL: https://arxiv.org/pdf/2412.01106
Copy Paste: [[2412.01106]] One Shot, One Talk: Whole-body Talking Avatar from a Single Image(https://arxiv.org/abs/2412.01106)
Keywords: diffusion
Abstract: Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

Title: DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

Authors: Hao Wu, Zhihang Zhong, Xiao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01115
Pdf URL: https://arxiv.org/pdf/2412.01115
Copy Paste: [[2412.01115]] DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding(https://arxiv.org/abs/2412.01115)
Keywords: diffusion
Abstract: Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding of the visual content. Our approach introduces two key innovations: (1) diffusion-guided retrieval enhancement, where a pretrained diffusion model guides image feature learning by reconstructing noisy images, allowing the model to capture more comprehensive and fine-grained visual information beyond standard annotated captions; and (2) a high-quality retrieval database, which provides comprehensive semantic information to enhance caption generation, especially in out-of-domain scenarios. Extensive experiments demonstrate that DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.

Title: Look Ma, No Ground Truth! Ground-Truth-Free Tuning of Structure from Motion and Visual SLAM

Authors: Alejandro Fontan, Javier Civera, Tobias Fischer, Michael Milford
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01116
Pdf URL: https://arxiv.org/pdf/2412.01116
Copy Paste: [[2412.01116]] Look Ma, No Ground Truth! Ground-Truth-Free Tuning of Structure from Motion and Visual SLAM(https://arxiv.org/abs/2412.01116)
Keywords: generative
Abstract: Evaluation is critical to both developing and tuning Structure from Motion (SfM) and Visual SLAM (VSLAM) systems, but is universally reliant on high-quality geometric ground truth -- a resource that is not only costly and time-intensive but, in many cases, entirely unobtainable. This dependency on ground truth restricts SfM and SLAM applications across diverse environments and limits scalability to real-world scenarios. In this work, we propose a novel ground-truth-free (GTF) evaluation methodology that eliminates the need for geometric ground truth, instead using sensitivity estimation via sampling from both original and noisy versions of input images. Our approach shows strong correlation with traditional ground-truth-based benchmarks and supports GTF hyperparameter tuning. Removing the need for ground truth opens up new opportunities to leverage a much larger number of dataset sources, and for self-supervised and online tuning, with the potential for a data-driven breakthrough analogous to what has occurred in generative AI.

Title: LoyalDiffusion: A Diffusion Model Guarding Against Data Replication

Authors: Chenghao Li, Yuke Zhang, Dake Chen, Jingqi Xu, Peter A. Beerel
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.01118
Pdf URL: https://arxiv.org/pdf/2412.01118
Copy Paste: [[2412.01118]] LoyalDiffusion: A Diffusion Model Guarding Against Data Replication(https://arxiv.org/abs/2412.01118)
Keywords: privacy, diffusion
Abstract: Diffusion models have demonstrated significant potential in image generation. However, their ability to replicate training data presents a privacy risk, particularly when the training data includes confidential information. Existing mitigation strategies primarily focus on augmenting the training dataset, leaving the impact of diffusion model architecture under explored. In this paper, we address this gap by examining and mitigating the impact of the model structure, specifically the skip connections in the diffusion model's U-Net model. We first present our observation on a trade-off in the skip connections. While they enhance image generation quality, they also reinforce the memorization of training data, increasing the risk of replication. To address this, we propose a replication-aware U-Net (RAU-Net) architecture that incorporates information transfer blocks into skip connections that are less essential for image quality. Recognizing the potential impact of RAU-Net on generation quality, we further investigate and identify specific timesteps during which the impact on memorization is most pronounced. By applying RAU-Net selectively at these critical timesteps, we couple our novel diffusion model with a targeted training and inference strategy, forming a framework we refer to as LoyalDiffusion. Extensive experiments demonstrate that LoyalDiffusion outperforms the state-of-the-art replication mitigation method achieving a 48.63% reduction in replication while maintaining comparable image quality.

Title: Object Tracking in a $360^o$ View: A Novel Perspective on Bridging the Gap to Biomedical Advancements

Authors: Mojtaba S. Fazli, Shannon Quinn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01119
Pdf URL: https://arxiv.org/pdf/2412.01119
Copy Paste: [[2412.01119]] Object Tracking in a $360^o$ View: A Novel Perspective on Bridging the Gap to Biomedical Advancements(https://arxiv.org/abs/2412.01119)
Keywords: defense, robust
Abstract: Object tracking is a fundamental tool in modern innovation, with applications in defense systems, autonomous vehicles, and biomedical research. It enables precise identification, monitoring, and spatiotemporal analysis of objects across sequential frames, providing insights into dynamic behaviors. In cell biology, object tracking is vital for uncovering cellular mechanisms, such as migration, interactions, and responses to drugs or pathogens. These insights drive breakthroughs in understanding disease progression and therapeutic interventions. Over time, object tracking methods have evolved from traditional feature-based approaches to advanced machine learning and deep learning frameworks. While classical methods are reliable in controlled settings, they struggle in complex environments with occlusions, variable lighting, and high object density. Deep learning models address these challenges by delivering greater accuracy, adaptability, and robustness. This review categorizes object tracking techniques into traditional, statistical, feature-based, and machine learning paradigms, with a focus on biomedical applications. These methods are essential for tracking cells and subcellular structures, advancing our understanding of health and disease. Key performance metrics, including accuracy, efficiency, and adaptability, are discussed. The paper explores limitations of current methods and highlights emerging trends to guide the development of next-generation tracking systems for biomedical research and broader scientific domains.

Title: RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

Authors: Geonho Lee, Janghwan Lee, Sukjin Hong, Minsoo Kim, Euijai Ahn, Du-Seong Chang, Jungwook Choi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01129
Pdf URL: https://arxiv.org/pdf/2412.01129
Copy Paste: [[2412.01129]] RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy(https://arxiv.org/abs/2412.01129)
Keywords: robust, large language model
Abstract: Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss's rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance.

Title: Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation

Authors: Yi-Chang Chen, Po-Chun Hsu, Chan-Jan Hsu, Da-shan Shiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.01130
Pdf URL: https://arxiv.org/pdf/2412.01130
Copy Paste: [[2412.01130]] Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation(https://arxiv.org/abs/2412.01130)
Keywords: large language model
Abstract: Large language models (LLMs) have significantly advanced autonomous agents, particularly in zero-shot tool usage, also known as function calling. This research delves into enhancing the function-calling capabilities of LLMs by exploring different approaches, including prompt formats for integrating function descriptions, blending function-calling and instruction-following data, introducing a novel Decision Token for conditional prompts, leveraging chain-of-thought reasoning, and overcoming multilingual challenges with a translation pipeline. Our key findings and contributions are as follows: (1) Instruction-following data improves both function-calling accuracy and relevance detection. (2) The use of the newly proposed Decision Token, combined with synthetic non-function-call data, enhances relevance detection. (3) A tailored translation pipeline effectively overcomes multilingual limitations, demonstrating significant improvements in Traditional Chinese. These insights highlight the potential for improved function-calling capabilities and multilingual applications in LLMs.

Title: A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

Authors: Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.01131
Pdf URL: https://arxiv.org/pdf/2412.01131
Copy Paste: [[2412.01131]] A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans(https://arxiv.org/abs/2412.01131)
Keywords: fair
Abstract: Recently, much work has concerned itself with the enigma of what exactly PLMs (pretrained language models) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Only one relation was considered, namely hypernymy. Furthermore, previous work did not measure humans' performance on the same task as that solved by the PLMs. This means that at this point in time, there is only an incomplete view of models' semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use six metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, asymmetry, prototypicality, and distinguishability and fairly compare humans and models on the same task. Our extensive experiments involve 16 PLMs, eight masked and eight causal language models. Up to now only masked language models had been tested although causal and masked language models treat context differently. Our results reveal a significant knowledge gap between humans and models for almost all semantic relations. Antonymy is the outlier relation where all models perform reasonably well. In general, masked language models perform significantly better than causal language models. Nonetheless, both masked and causal language models are likely to confuse non-antonymy relations with antonymy.

Title: Referring Video Object Segmentation via Language-aligned Track Selection

Authors: Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01136
Pdf URL: https://arxiv.org/pdf/2412.01136
Copy Paste: [[2412.01136]] Referring Video Object Segmentation via Language-aligned Track Selection(https://arxiv.org/abs/2412.01136)
Keywords: robust, segmentation
Abstract: Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at this https URL

Title: TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Authors: Xingsong Ye, Yongkun Du, Yunbo Tao, Zhineng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01137
Pdf URL: https://arxiv.org/pdf/2412.01137
Copy Paste: [[2412.01137]] TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition(https://arxiv.org/abs/2412.01137)
Keywords: diffusion
Abstract: Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR's effectiveness and its great potential in scene text image synthesis. Our code is available at this https URL.

Title: A2VIS: Amodal-Aware Approach to Video Instance Segmentation

Authors: Minh Tran, Thang Pham, Winston Bounsavy, Tri Nguyen, Ngan Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01147
Pdf URL: https://arxiv.org/pdf/2412.01147
Copy Paste: [[2412.01147]] A2VIS: Amodal-Aware Approach to Video Instance Segmentation(https://arxiv.org/abs/2412.01147)
Keywords: segmentation
Abstract: Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.

Title: R.I.P.: A Simple Black-box Attack on Continual Test-time Adaptation

Authors: Trung-Hieu Hoang, Duc Minh Vo, Minh N. Do
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01154
Pdf URL: https://arxiv.org/pdf/2412.01154
Copy Paste: [[2412.01154]] R.I.P.: A Simple Black-box Attack on Continual Test-time Adaptation(https://arxiv.org/abs/2412.01154)
Keywords: attack
Abstract: Test-time adaptation (TTA) has emerged as a promising solution to tackle the continual domain shift in machine learning by allowing model parameters to change at test time, via self-supervised learning on unlabeled testing data. At the same time, it unfortunately opens the door to unforeseen vulnerabilities for degradation over time. Through a simple theoretical continual TTA model, we successfully identify a risk in the sampling process of testing data that could easily degrade the performance of a continual TTA model. We name this risk as Reusing of Incorrect Prediction (RIP) that TTA attackers can employ or as a result of the unintended query from general TTA users. The risk posed by RIP is also highly realistic, as it does not require prior knowledge of model parameters or modification of testing samples. This simple requirement makes RIP as the first black-box TTA attack algorithm that stands out from existing white-box attempts. We extensively benchmark the performance of the most recent continual TTA approaches when facing the RIP attack, providing insights on its success, and laying out potential roadmaps that could enhance the resilience of future continual TTA systems.

Title: Graph Community Augmentation with GMM-based Modeling in Latent Space

Authors: Shintaro Fukushima, Kenji Yamanishi
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01163
Pdf URL: https://arxiv.org/pdf/2412.01163
Copy Paste: [[2412.01163]] Graph Community Augmentation with GMM-based Modeling in Latent Space(https://arxiv.org/abs/2412.01163)
Keywords: generative
Abstract: This study addresses the issue of graph generation with generative models. In particular, we are concerned with graph community augmentation problem, which refers to the problem of generating unseen or unfamiliar graphs with a new community out of the probability distribution estimated with a given graph dataset. The graph community augmentation means that the generated graphs have a new community. There is a chance of discovering an unseen but important structure of graphs with a new community, for example, in a social network such as a purchaser network. Graph community augmentation may also be helpful for generalization of data mining models in a case where it is difficult to collect real graph data enough. In fact, there are many ways to generate a new community in an existing graph. It is desirable to discover a new graph with a new community beyond the given graph while we keep the structure of the original graphs to some extent for the generated graphs to be realistic. To this end, we propose an algorithm called the graph community augmentation (GCA). The key ideas of GCA are (i) to fit Gaussian mixture model (GMM) to data points in the latent space into which the nodes in the original graph are embedded, and (ii) to add data points in the new cluster in the latent space for generating a new community based on the minimum description length (MDL) principle. We empirically demonstrate the effectiveness of GCA for generating graphs with a new community structure on synthetic and real datasets.

Title: HumekaFL: Automated Detection of Neonatal Asphyxia Using Federated Learning

Authors: Pamely Zantou, Blessed Guda, Bereket Retta, Gladys Inabeza, Carlee Joe-Wong, Assane Gueye
Subjects: cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2412.01167
Pdf URL: https://arxiv.org/pdf/2412.01167
Copy Paste: [[2412.01167]] HumekaFL: Automated Detection of Neonatal Asphyxia Using Federated Learning(https://arxiv.org/abs/2412.01167)
Keywords: security, privacy, federate
Abstract: Birth Apshyxia (BA) is a severe condition characterized by insufficient supply of oxygen to a newborn during the delivery. BA is one of the primary causes of neonatal death in the world. Although there has been a decline in neonatal deaths over the past two decades, the developing world, particularly sub-Saharan Africa, continues to experience the highest under-five (<5) mortality rates. While evidence-based methods are commonly used to detect BA in African healthcare settings, they can be subject to physician errors or delays in diagnosis, preventing timely interventions. Centralized Machine Learning (ML) methods demonstrated good performance in early detection of BA but require sensitive health data to leave their premises before training, which does not guarantee privacy and security. Healthcare institutions are therefore reluctant to adopt such solutions in Africa. To address this challenge, we suggest a federated learning (FL)-based software architecture, a distributed learning method that prioritizes privacy and security by design. We have developed a user-friendly and cost-effective mobile application embedding the FL pipeline for early detection of BA. Our Federated SVM model outperformed centralized SVM pipelines and Neural Networks (NN)-based methods in the existing literature

Title: Rectified Flow For Structure Based Drug Design

Authors: Daiheng Zhang, Chengyue Gong, Qiang Liu
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2412.01174
Pdf URL: https://arxiv.org/pdf/2412.01174
Copy Paste: [[2412.01174]] Rectified Flow For Structure Based Drug Design(https://arxiv.org/abs/2412.01174)
Keywords: diffusion, generative
Abstract: Deep generative models have achieved tremendous success in structure-based drug design in recent years, especially for generating 3D ligand molecules that bind to specific protein pocket. Notably, diffusion models have transformed ligand generation by providing exceptional quality and creativity. However, traditional diffusion models are restricted by their conventional learning objectives, which limit their broader applicability. In this work, we propose a new framework FlowSBDD, which is based on rectified flow model, allows us to flexibly incorporate additional loss to optimize specific target and introduce additional condition either as an extra input condition or replacing the initial Gaussian distribution. Extensive experiments on CrossDocked2020 show that our approach could achieve state-of-the-art performance on generating high-affinity molecules while maintaining proper molecular properties without specifically designing binding site, with up to -8.50 Avg. Vina Dock score and 75.0% Diversity.

Title: Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

Authors: Tao Tang, Hong Liu, Yingxuan You, Ti Wang, Wenhao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01179
Pdf URL: https://arxiv.org/pdf/2412.01179
Copy Paste: [[2412.01179]] Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video(https://arxiv.org/abs/2412.01179)
Keywords: transformer
Abstract: Human Mesh Reconstruction (HMR) from monocular video plays an important role in human-robot interaction and collaboration. However, existing video-based human mesh reconstruction methods face a trade-off between accurate reconstruction and smooth motion. These methods design networks based on either RNNs or attention mechanisms to extract local temporal correlations or global temporal dependencies, but the lack of complementary long-term information and local details limits their performance. To address this problem, we propose a \textbf{D}ual-branch \textbf{G}raph \textbf{T}ransformer network for 3D human mesh \textbf{R}econstruction from video, named DGTR. DGTR employs a dual-branch network including a Global Motion Attention (GMA) branch and a Local Details Refine (LDR) branch to parallelly extract long-term dependencies and local crucial information, helping model global human motion and local human details (e.g., local motion, tiny movement). Specifically, GMA utilizes a global transformer to model long-term human motion. LDR combines modulated graph convolutional networks and the transformer framework to aggregate local information in adjacent frames and extract crucial information of human details. Experiments demonstrate that our DGTR outperforms state-of-the-art video-based methods in reconstruction accuracy and maintains competitive motion smoothness. Moreover, DGTR utilizes fewer parameters and FLOPs, which validate the effectiveness and efficiency of the proposed DGTR. Code is publicly available at \href{this https URL}{\textcolor{myBlue}{this https URL}}.

Title: MeasureNet: Measurement Based Celiac Disease Identification

Authors: Aayush Kumar Tyagi, Vaibhav Mishra, Ashok Tiwari, Lalita Mehra, Prasenjit Das, Govind Makharia, Prathosh AP, Mausam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01182
Pdf URL: https://arxiv.org/pdf/2412.01182
Copy Paste: [[2412.01182]] MeasureNet: Measurement Based Celiac Disease Identification(https://arxiv.org/abs/2412.01182)
Keywords: robust, generative, segmentation
Abstract: Celiac disease is an autoimmune disorder triggered by the consumption of gluten. It causes damage to the villi, the finger-like projections in the small intestine that are responsible for nutrient absorption. Additionally, the crypts, which form the base of the villi, are also affected, impairing the regenerative process. The deterioration in villi length, computed as the villi-to-crypt length ratio, indicates the severity of celiac disease. However, manual measurement of villi-crypt length can be both time-consuming and susceptible to inter-observer variability, leading to inconsistencies in diagnosis. While some methods can perform measurement as a post-hoc process, they are prone to errors in the initial stages. This gap underscores the need for pathologically driven solutions that enhance measurement accuracy and reduce human error in celiac disease assessments. Our proposed method, MeasureNet, is a pathologically driven polyline detection framework incorporating polyline localization and object-driven losses specifically designed for measurement tasks. Furthermore, we leverage segmentation model to provide auxiliary guidance about crypt location when crypt are partially visible. To ensure that model is not overdependent on segmentation mask we enhance model robustness through a mask feature mixup technique. Additionally, we introduce a novel dataset for grading celiac disease, consisting of 750 annotated duodenum biopsy images. MeasureNet achieves an 82.66% classification accuracy for binary classification and 81% accuracy for multi-class grading of celiac disease. Code: this https URL

Title: SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Authors: Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, Qian Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.01186
Pdf URL: https://arxiv.org/pdf/2412.01186
Copy Paste: [[2412.01186]] SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages(https://arxiv.org/abs/2412.01186)
Keywords: robust, large language model
Abstract: In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize LLMs. All datasets and evaluation scripts are public.

Title: MiningGPT -- A Domain-Specific Large Language Model for the Mining Industry

Authors: Kurukulasooriya Fernando ana Gianluca Demartini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.01189
Pdf URL: https://arxiv.org/pdf/2412.01189
Copy Paste: [[2412.01189]] MiningGPT -- A Domain-Specific Large Language Model for the Mining Industry(https://arxiv.org/abs/2412.01189)
Keywords: generative, large language model
Abstract: Recent advancements of generative LLMs (Large Language Models) have exhibited human-like language capabilities but have shown a lack of domain-specific understanding. Therefore, the research community has started the development of domain-specific LLMs for many domains. In this work we focus on discussing how to build mining domain-specific LLMs, as the global mining industry contributes significantly to the worldwide economy. We report on MiningGPT, a mining domain-specific instruction-following 7B parameter LLM model which showed a 14\% higher mining domain knowledge test score as compared to its parent model Mistral 7B instruct.

Title: TinyFusion: Diffusion Transformers Learned Shallow

Authors: Gongfan Fang, Kunjun Li, Xinyin Ma, Xinchao Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01199
Pdf URL: https://arxiv.org/pdf/2412.01199
Copy Paste: [[2412.01199]] TinyFusion: Diffusion Transformers Learned Shallow(https://arxiv.org/abs/2412.01199)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$\times$ speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at this https URL.

Title: Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Authors: Wenxin Su, Song Tang, Xiaofeng Liu, Xiaojing Yi, Mao Ye, Chunxiao Zu, Jiahao Li, Xiatian Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01203
Pdf URL: https://arxiv.org/pdf/2412.01203
Copy Paste: [[2412.01203]] Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data(https://arxiv.org/abs/2412.01203)
Keywords: privacy, attack, robust, generative
Abstract: Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way--learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate the superiority of GUES, showing robustness even with small batch size.

Title: Siamese Machine Unlearning with Knowledge Vaporization and Concentration

Authors: Songjie Xie, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.01207
Pdf URL: https://arxiv.org/pdf/2412.01207
Copy Paste: [[2412.01207]] Siamese Machine Unlearning with Knowledge Vaporization and Concentration(https://arxiv.org/abs/2412.01207)
Keywords: attack, membership infer
Abstract: In response to the practical demands of the ``right to be forgotten" and the removal of undesired data, machine unlearning emerges as an essential technique to remove the learned knowledge of a fraction of data points from trained models. However, existing methods suffer from limitations such as insufficient methodological support, high computational complexity, and significant memory demands. In this work, we propose the concepts of knowledge vaporization and concentration to selectively erase learned knowledge from specific data points while maintaining representations for the remaining data. Utilizing the Siamese networks, we exemplify the proposed concepts and develop an efficient method for machine unlearning. Our proposed Siamese unlearning method does not require additional memory overhead and full access to the remaining dataset. Extensive experiments conducted across multiple unlearning scenarios showcase the superiority of Siamese unlearning over baseline methods, illustrating its ability to effectively remove knowledge from forgetting data, enhance model utility on remaining data, and reduce susceptibility to membership inference attacks.

Title: PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

Authors: Ruichen Wang, Junliang Zhang, Qingsong Xie, Chen Chen, Haonan Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01223
Pdf URL: https://arxiv.org/pdf/2412.01223
Copy Paste: [[2412.01223]] PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control(https://arxiv.org/abs/2412.01223)
Keywords: diffusion
Abstract: Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.

Title: Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

Authors: Xiaoqi Zhao, Youwei Pang, Shijie Chang, Yuan Zhao, Lihe Zhang, Huchuan Lu, Jinsong Ouyang, Georges El Fakhri, Xiaofeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01240
Pdf URL: https://arxiv.org/pdf/2412.01240
Copy Paste: [[2412.01240]] Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes(https://arxiv.org/abs/2412.01240)
Keywords: robust, segmentation
Abstract: As a foundational model, SAM has significantly influenced multiple fields within computer vision, and its upgraded version, SAM 2, enhances capabilities in video segmentation, poised to make a substantial impact once again. While SAMs (SAM and SAM 2) have demonstrated excellent performance in segmenting context-independent concepts like people, cars, and roads, they overlook more challenging context-dependent (CD) concepts, such as visual saliency, camouflage, product defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough quantitative evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM 2 that supports manual, automatic, and intermediate self-prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM 2 for in-context learning and introduce prompt robustness testing to simulate real-world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks. This work aims to provide valuable insights to guide future research in both context-independent and context-dependent concepts segmentation, potentially informing the development of the next version - SAM 3.

Title: Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Authors: Zilyu Ye, Zhiyang Chen, Tiancheng Li, Zemin Huang, Weijian Luo, Guo-Jun Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01243
Pdf URL: https://arxiv.org/pdf/2412.01243
Copy Paste: [[2412.01243]] Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation(https://arxiv.org/abs/2412.01243)
Keywords: diffusion
Abstract: Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, which potentially limits the inference efficiency as well as the flexibility when handling different prompts. In this paper, we argue that the optimal noise schedule should adapt to each inference instance, and introduce the Time Prediction Diffusion Model (TPDM) to accomplish this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning, aiming to maximize a reward that discounts the final image quality by the number of denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts the number of denoising steps and time on the fly, enhancing both performance and efficiency. We train TPDMs on multiple diffusion model benchmarks. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance. We will release our best model alongside this paper.

Title: Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization

Authors: Lingyun Zhang, Yu Xie, Yanwei Fu, Ping Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01244
Pdf URL: https://arxiv.org/pdf/2412.01244
Copy Paste: [[2412.01244]] Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization(https://arxiv.org/abs/2412.01244)
Keywords: diffusion
Abstract: As large-scale diffusion models continue to advance, they excel at producing high-quality images but often generate unwanted content, such as sexually explicit or violent content. Existing methods for concept removal generally guide the image generation process but can unintentionally modify unrelated regions, leading to inconsistencies with the original model. We propose a novel approach for targeted concept replacing in diffusion models, enabling specific concepts to be removed without affecting non-target areas. Our method introduces a dedicated concept localizer for precisely identifying the target concept during the denoising process, trained with few-shot learning to require minimal labeled data. Within the identified region, we introduce a training-free Dual Prompts Cross-Attention (DPCA) module to substitute the target concept, ensuring minimal disruption to surrounding content. We evaluate our method on concept localization precision and replacement efficiency. Experimental results demonstrate that our method achieves superior precision in localizing target concepts and performs coherent concept replacement with minimal impact on non-target areas, outperforming existing approaches.

Title: Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective

Authors: Jinouwen Zhang, Rongkun Xue, Yazhe Niu, Yun Chen, Jing Yang, Hongsheng Li, Yu Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01245
Pdf URL: https://arxiv.org/pdf/2412.01245
Copy Paste: [[2412.01245]] Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective(https://arxiv.org/abs/2412.01245)
Keywords: diffusion, generative
Abstract: Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.

Title: Multimodal Fusion Learning with Dual Attention for Medical Imaging

Authors: Joy Dhar, Nayyar Zaidi, Maryam Haghighat, Puneet Goyal, Sudipta Roy, Azadeh Alavi, Vikas Kumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01248
Pdf URL: https://arxiv.org/pdf/2412.01248
Copy Paste: [[2412.01248]] Multimodal Fusion Learning with Dual Attention for Medical Imaging(https://arxiv.org/abs/2412.01248)
Keywords: robust
Abstract: Multimodal fusion learning has shown significant promise in classifying various diseases such as skin cancer and brain tumors. However, existing methods face three key limitations. First, they often lack generalizability to other diagnosis tasks due to their focus on a particular disease. Second, they do not fully leverage multiple health records from diverse modalities to learn robust complementary information. And finally, they typically rely on a single attention mechanism, missing the benefits of multiple attention strategies within and across various modalities. To address these issues, this paper proposes a dual robust information fusion attention mechanism (DRIFA) that leverages two attention modules, i.e. multi-branch fusion attention module and the multimodal information fusion attention module. DRIFA can be integrated with any deep neural network, forming a multimodal fusion learning framework denoted as DRIFA-Net. We show that the multi-branch fusion attention of DRIFA learns enhanced representations for each modality, such as dermoscopy, pap smear, MRI, and CT-scan, whereas multimodal information fusion attention module learns more refined multimodal shared representations, improving the network's generalization across multiple tasks and enhancing overall performance. Additionally, to estimate the uncertainty of DRIFA-Net predictions, we have employed an ensemble Monte Carlo dropout strategy. Extensive experiments on five publicly available datasets with diverse modalities demonstrate that our approach consistently outperforms state-of-the-art methods. The code is available at this https URL.

Title: Yi-Lightning Technical Report

Authors: 01.AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01253
Pdf URL: https://arxiv.org/pdf/2412.01253
Copy Paste: [[2412.01253]] Yi-Lightning Technical Report(https://arxiv.org/abs/2412.01253)
Keywords: large language model, segmentation
Abstract: This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert segmentation and routing mechanisms coupled with optimized KV-caching techniques. Our development process encompasses comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), where we devise deliberate strategies for multi-stage training, synthetic data construction, and reward modeling. Furthermore, we implement RAISE (Responsible AI Safety Engine), a four-component framework to address safety issues across pre-training, post-training, and serving phases. Empowered by our scalable super-computing infrastructure, all these innovations substantially reduce training, deployment and inference costs while maintaining high-performance standards. With further evaluations on public academic benchmarks, Yi-Lightning demonstrates competitive performance against top-tier LLMs, while we observe a notable disparity between traditional, static benchmark results and real-world, dynamic human preferences. This observation prompts a critical reassessment of conventional benchmarks' utility in guiding the development of more intelligent and powerful AI systems for practical applications. Yi-Lightning is now available through our developer platform at this https URL.

Title: EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Authors: Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01254
Pdf URL: https://arxiv.org/pdf/2412.01254
Copy Paste: [[2412.01254]] EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation(https://arxiv.org/abs/2412.01254)
Keywords: diffusion
Abstract: This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.

Title: NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

Authors: Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01256
Pdf URL: https://arxiv.org/pdf/2412.01256
Copy Paste: [[2412.01256]] NLPrompt: Noise-Label Prompt Learning for Vision-Language Models(https://arxiv.org/abs/2412.01256)
Keywords: robust
Abstract: The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text encoder representations in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representation and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

Title: Do Large Language Models with Reasoning and Acting Meet the Needs of Task-Oriented Dialogue?

Authors: Michelle Elizabeth, Morgan Veyret, Miguel Couceiro, Ondrej Dusek, Lina M. Rojas-Barahona
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2412.01262
Pdf URL: https://arxiv.org/pdf/2412.01262
Copy Paste: [[2412.01262]] Do Large Language Models with Reasoning and Acting Meet the Needs of Task-Oriented Dialogue?(https://arxiv.org/abs/2412.01262)
Keywords: large language model
Abstract: Large language models (LLMs) gained immense popularity due to their impressive capabilities in unstructured conversations. However, they underperform compared to previous approaches in task-oriented dialogue (TOD), wherein reasoning and accessing external information are crucial. Empowering LLMs with advanced prompting strategies such as reasoning and acting (ReAct) has shown promise in solving complex tasks traditionally requiring reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs performing TOD. We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation and with real users. While ReAct-LLMs seem to underperform state-of-the-art approaches in simulation, human evaluation indicates higher user satisfaction rate compared to handcrafted systems despite having a lower success rate.

Title: Towards Robust Interpretable Surrogates for Optimization

Authors: Marc Goerigk, Michael Hartisch, Sebastian Merten
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.01264
Pdf URL: https://arxiv.org/pdf/2412.01264
Copy Paste: [[2412.01264]] Towards Robust Interpretable Surrogates for Optimization(https://arxiv.org/abs/2412.01264)
Keywords: robust, interpretability
Abstract: An important factor in the practical implementation of optimization models is the acceptance by the intended users. This is influenced among other factors by the interpretability of the solution process. Decision rules that meet this requirement can be generated using the framework for inherently interpretable optimization models. In practice, there is often uncertainty about the parameters of an optimization problem. An established way to deal with this challenge is the concept of robust optimization. The goal of our work is to combine both concepts: to create decision trees as surrogates for the optimization process that are more robust to perturbations and still inherently interpretable. For this purpose we present suitable models based on different variants to model uncertainty, and solution methods. Furthermore, the applicability of heuristic methods to perform this task is evaluated. Both approaches are compared with the existing framework for inherently interpretable optimization models.

Title: Indexing Economic Fluctuation Narratives from Keiki Watchers Survey

Authors: Eriko Shigetsugu, Hiroki Sakaji, Itsuki Noda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01265
Pdf URL: https://arxiv.org/pdf/2412.01265
Copy Paste: [[2412.01265]] Indexing Economic Fluctuation Narratives from Keiki Watchers Survey(https://arxiv.org/abs/2412.01265)
Keywords: diffusion
Abstract: In this paper, we design indices of economic fluctuation narratives derived from economic surveys. Companies, governments, and investors rely on key metrics like GDP and industrial production indices to predict economic trends. However, they have yet to effectively leverage the wealth of information contained in economic text, such as causal relationships, in their economic forecasting. Therefore, we design indices of economic fluctuation from economic surveys by using our previously proposed narrative framework. From the evaluation results, it is observed that the proposed indices had a stronger correlation with cumulative lagging diffusion index than other types of diffusion indices.

Title: EdgeOAR: Real-time Online Action Recognition On Edge Devices

Authors: Wei Luo, Deyu Zhang, Ying Tang, Fan Wu, Yaoxue Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01267
Pdf URL: https://arxiv.org/pdf/2412.01267
Copy Paste: [[2412.01267]] EdgeOAR: Real-time Online Action Recognition On Edge Devices(https://arxiv.org/abs/2412.01267)
Keywords: robust
Abstract: This paper addresses the challenges of Online Action Recognition (OAR), a framework that involves instantaneous analysis and classification of behaviors in video streams. OAR must operate under stringent latency constraints, making it an indispensable component for real-time feedback for edge computing. Existing methods, which typically rely on the processing of entire video clips, fall short in scenarios requiring immediate recognition. To address this, we designed EdgeOAR, a novel framework specifically designed for OAR on edge devices. EdgeOAR includes the Early Exit-oriented Task-specific Feature Enhancement Module (TFEM), which comprises lightweight submodules to optimize features in both temporal and spatial dimensions. We design an iterative training method to enable TFEM learning features from the beginning of the video. Additionally, EdgeOAR includes an Inverse Information Entropy (IIE) and Modality Consistency (MC)-driven fusion module to fuse features and make better exit decisions. This design overcomes the two main challenges: robust modeling of spatio-temporal action representations with limited initial frames in online video streams and balancing accuracy and efficiency on resource-constrained edge devices. Experiments show that on the UCF-101 dataset, our method EdgeOAR reduces latency by 99.23% and energy consumption by 99.28% compared to state-of-the-art (SOTA) method. And achieves an adequate accuracy on edge devices.

Title: Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

Authors: Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01268
Pdf URL: https://arxiv.org/pdf/2412.01268
Copy Paste: [[2412.01268]] Ponder & Press: Advancing Visual GUI Agent towards General Computer Control(https://arxiv.org/abs/2412.01268)
Keywords: large language model
Abstract: Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements -- a critical requirement for effective GUI automation -- due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an 'interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments -- including web pages, desktop software, and mobile UIs -- demonstrate that Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents. Refer to the project homepage this https URL

Title: MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Authors: Sen Xing, Muyan Zhong, Zeqiang Lai, Liangchen Li, Jiawen Liu, Yaohui Wang, Jifeng Dai, Wenhai Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01271
Pdf URL: https://arxiv.org/pdf/2412.01271
Copy Paste: [[2412.01271]] MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost(https://arxiv.org/abs/2412.01271)
Keywords: diffusion
Abstract: In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages. Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (38.61 for English vs. 37.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.

Title: PASTA-4-PHT: A Pipeline for Automated Security and Technical Audits for the Personal Health Train

Authors: Sascha Welten, Karl Kindermann, Ahmet Polat, Martin Görz, Maximilian Jugl, Laurenz Neumann, Alexander Neumann, Johannes Lohmöller, Jan Pennekamp, Stefan Decker
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2412.01275
Pdf URL: https://arxiv.org/pdf/2412.01275
Copy Paste: [[2412.01275]] PASTA-4-PHT: A Pipeline for Automated Security and Technical Audits for the Personal Health Train(https://arxiv.org/abs/2412.01275)
Keywords: security, privacy, protect
Abstract: With the introduction of data protection regulations, the need for innovative privacy-preserving approaches to process and analyse sensitive data has become apparent. One approach is the Personal Health Train (PHT) that brings analysis code to the data and conducts the data processing at the data premises. However, despite its demonstrated success in various studies, the execution of external code in sensitive environments, such as hospitals, introduces new research challenges because the interactions of the code with sensitive data are often incomprehensible and lack transparency. These interactions raise concerns about potential effects on the data and increases the risk of data breaches. To address this issue, this work discusses a PHT-aligned security and audit pipeline inspired by DevSecOps principles. The automated pipeline incorporates multiple phases that detect vulnerabilities. To thoroughly study its versatility, we evaluate this pipeline in two ways. First, we deliberately introduce vulnerabilities into a PHT. Second, we apply our pipeline to five real-world PHTs, which have been utilised in real-world studies, to audit them for potential vulnerabilities. Our evaluation demonstrates that our designed pipeline successfully identifies potential vulnerabilities and can be applied to real-world studies. In compliance with the requirements of the GDPR for data management, documentation, and protection, our automated approach supports researchers using in their data-intensive work and reduces manual overhead. It can be used as a decision-making tool to assess and document potential vulnerabilities in code for data processing. Ultimately, our work contributes to an increased security and overall transparency of data processing activities within the PHT framework.

Title: MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Authors: Shan Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01284
Pdf URL: https://arxiv.org/pdf/2412.01284
Copy Paste: [[2412.01284]] MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model(https://arxiv.org/abs/2412.01284)
Keywords: diffusion
Abstract: Text-to-image generation models have become transformative tools. However, diffusion-based vision language models still lack the ability to precisely control the shape, appearance, and positional placement of objects in generated images using text guidance alone. Global image editing models typically achieve global layout control by relying on additional masks or images as guidance, which often require model training. Although local object-editing models enable modification of object shapes, they do not provide control over the positional placement of these objects. To address these limitations, we propose the MFTF model, which enables precise control over object positioning without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional control (such as translation, rotation, etc.) and allows for concurrent layout control and object semantic editing. This is achieved by controlling the denoising process of the diffusion model through parallel denoising. Attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries are then modified according to layout control parameters and injected back into the self-attention layers of the target diffusion model to enable precise positional control.

Title: FedAH: Aggregated Head for Personalized Federated Learning

Authors: Pengzhan Zhou, Yuepeng He, Yijun Zhai, Kaixin Gao, Chao Chen, Zhida Qin, Chong Zhang, Songtao Guo
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2412.01295
Pdf URL: https://arxiv.org/pdf/2412.01295
Copy Paste: [[2412.01295]] FedAH: Aggregated Head for Personalized Federated Learning(https://arxiv.org/abs/2412.01295)
Keywords: privacy, federate
Abstract: Recently, Federated Learning (FL) has gained popularity for its privacy-preserving and collaborative learning capabilities. Personalized Federated Learning (PFL), building upon FL, aims to address the issue of statistical heterogeneity and achieve personalization. Personalized-head-based PFL is a common and effective PFL method that splits the model into a feature extractor and a head, where the feature extractor is collaboratively trained and shared, while the head is locally trained and not shared. However, retaining the head locally, although achieving personalization, prevents the model from learning global knowledge in the head, thus affecting the performance of the personalized model. To solve this problem, we propose a novel PFL method called Federated Learning with Aggregated Head (FedAH), which initializes the head with an Aggregated Head at each iteration. The key feature of FedAH is to perform element-level aggregation between the local model head and the global model head to introduce global information from the global model head. To evaluate the effectiveness of FedAH, we conduct extensive experiments on five benchmark datasets in the fields of computer vision and natural language processing. FedAH outperforms ten state-of-the-art FL methods in terms of test accuracy by 2.87%. Additionally, FedAH maintains its advantage even in scenarios where some clients drop out unexpectedly. Our code is open-accessed at this https URL.

Title: Cross-Modal Visual Relocalization in Prior LiDAR Maps Utilizing Intensity Textures

Authors: Qiyuan Shen, Hengwang Zhao, Weihao Yan, Chunxiang Wang, Tong Qin, Ming Yang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01299
Pdf URL: https://arxiv.org/pdf/2412.01299
Copy Paste: [[2412.01299]] Cross-Modal Visual Relocalization in Prior LiDAR Maps Utilizing Intensity Textures(https://arxiv.org/abs/2412.01299)
Keywords: robust
Abstract: Cross-modal localization has drawn increasing attention in recent years, while the visual relocalization in prior LiDAR maps is less studied. Related methods usually suffer from inconsistency between the 2D texture and 3D geometry, neglecting the intensity features in the LiDAR point cloud. In this paper, we propose a cross-modal visual relocalization system in prior LiDAR maps utilizing intensity textures, which consists of three main modules: map projection, coarse retrieval, and fine relocalization. In the map projection module, we construct the database of intensity channel map images leveraging the dense characteristic of panoramic projection. The coarse retrieval module retrieves the top-K most similar map images to the query image from the database, and retains the top-K' results by covisibility clustering. The fine relocalization module applies a two-stage 2D-3D association and a covisibility inlier selection method to obtain robust correspondences for 6DoF pose estimation. The experimental results on our self-collected datasets demonstrate the effectiveness in both place recognition and pose estimation tasks.

Title: Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Authors: Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2412.01316
Pdf URL: https://arxiv.org/pdf/2412.01316
Copy Paste: [[2412.01316]] Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation(https://arxiv.org/abs/2412.01316)
Keywords: diffusion
Abstract: We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: this https URL.

Title: Explainable fault and severity classification for rolling element bearings using Kolmogorov-Arnold networks

Authors: Spyros Rigas, Michalis Papachristou, Ioannis Sotiropoulos, Georgios Alexandridis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01322
Pdf URL: https://arxiv.org/pdf/2412.01322
Copy Paste: [[2412.01322]] Explainable fault and severity classification for rolling element bearings using Kolmogorov-Arnold networks(https://arxiv.org/abs/2412.01322)
Keywords: interpretability
Abstract: Rolling element bearings are critical components of rotating machinery, with their performance directly influencing the efficiency and reliability of industrial systems. At the same time, bearing faults are a leading cause of machinery failures, often resulting in costly downtime, reduced productivity, and, in extreme cases, catastrophic damage. This study presents a methodology that utilizes Kolmogorov-Arnold Networks to address these challenges through automatic feature selection, hyperparameter tuning and interpretable fault analysis within a unified framework. By training shallow network architectures and minimizing the number of selected features, the framework produces lightweight models that deliver explainable results through feature attribution and symbolic representations of their activation functions. Validated on two widely recognized datasets for bearing fault diagnosis, the framework achieved perfect F1-Scores for fault detection and high performance in fault and severity classification tasks, including 100\% F1-Scores in most cases. Notably, it demonstrated adaptability by handling diverse fault types, such as imbalance and misalignment, within the same dataset. The symbolic representations enhanced model interpretability, while feature attribution offered insights into the optimal feature types or signals for each studied task. These results highlight the framework's potential for practical applications, such as real-time machinery monitoring, and for scientific research requiring efficient and explainable models.

Title: The "LLM World of Words" English free association norms generated by large language models

Authors: Katherine Abramski, Riccardo Improta, Giulio Rossetti, Massimo Stella
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01330
Pdf URL: https://arxiv.org/pdf/2412.01330
Copy Paste: [[2412.01330]] The "LLM World of Words" English free association norms generated by large language models(https://arxiv.org/abs/2412.01330)
Keywords: large language model
Abstract: Free associations have been extensively used in cognitive psychology and linguistics for studying how conceptual knowledge is organized. Recently, the potential of applying a similar approach for investigating the knowledge encoded in LLMs has emerged, specifically as a method for investigating LLM biases. However, the absence of large-scale LLM-generated free association norms that are comparable with human-generated norms is an obstacle to this new research direction. To address this limitation, we create a new dataset of LLM-generated free association norms modeled after the "Small World of Words" (SWOW) human-generated norms consisting of approximately 12,000 cue words. We prompt three LLMs, namely Mistral, Llama3, and Haiku, with the same cues as those in the SWOW norms to generate three novel comparable datasets, the "LLM World of Words" (LWOW). Using both SWOW and LWOW norms, we construct cognitive network models of semantic memory that represent the conceptual knowledge possessed by humans and LLMs. We demonstrate how these datasets can be used for investigating implicit biases in humans and LLMs, such as the harmful gender stereotypes that are prevalent both in society and LLM outputs.

Title: A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

Authors: Junwei Deng, Weijing Tang, Jiaqi W. Ma
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01335
Pdf URL: https://arxiv.org/pdf/2412.01335
Copy Paste: [[2412.01335]] A Versatile Influence Function for Data Attribution with Non-Decomposable Loss(https://arxiv.org/abs/2412.01335)
Keywords: robust
Abstract: Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution -- quantifying how individual training data points affect a model's predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that can be decomposed into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to $10^3$ times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.

Title: Negative Token Merging: Image-based Adversarial Feature Guidance

Authors: Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01339
Pdf URL: https://arxiv.org/pdf/2412.01339
Copy Paste: [[2412.01339]] Negative Token Merging: Image-based Adversarial Feature Guidance(https://arxiv.org/abs/2412.01339)
Keywords: diffusion
Abstract: Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at this https URL

Title: MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Authors: Xiaomin Li, Xu Jia, Qinghe Wang, Haiwen Diao, Mengmeng Ge, Pengxiang Li, You He, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01343
Pdf URL: https://arxiv.org/pdf/2412.01343
Copy Paste: [[2412.01343]] MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models(https://arxiv.org/abs/2412.01343)
Keywords: diffusion, large language model
Abstract: Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.

Title: Integrative CAM: Adaptive Layer Fusion for Comprehensive Interpretation of CNNs

Authors: Aniket K. Singh, Debasis Chaudhuri, Manish P. Singh, Samiran Chattopadhyay
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01354
Pdf URL: https://arxiv.org/pdf/2412.01354
Copy Paste: [[2412.01354]] Integrative CAM: Adaptive Layer Fusion for Comprehensive Interpretation of CNNs(https://arxiv.org/abs/2412.01354)
Keywords: interpretability
Abstract: With the growing demand for interpretable deep learning models, this paper introduces Integrative CAM, an advanced Class Activation Mapping (CAM) technique aimed at providing a holistic view of feature importance across Convolutional Neural Networks (CNNs). Traditional gradient-based CAM methods, such as Grad-CAM and Grad-CAM++, primarily use final layer activations to highlight regions of interest, often neglecting critical features derived from intermediate layers. Integrative CAM addresses this limitation by fusing insights across all network layers, leveraging both gradient and activation scores to adaptively weight layer contributions, thus yielding a comprehensive interpretation of the model's internal representation. Our approach includes a novel bias term in the saliency map calculation, a factor frequently omitted in existing CAM techniques, but essential for capturing a more complete feature importance landscape, as modern CNNs rely on both weighted activations and biases to make predictions. Additionally, we generalize the alpha term from Grad-CAM++ to apply to any smooth function, expanding CAM applicability across a wider range of models. Through extensive experiments on diverse and complex datasets, Integrative CAM demonstrates superior fidelity in feature importance mapping, effectively enhancing interpretability for intricate fusion scenarios and complex decision-making tasks. By advancing interpretability methods to capture multi-layered model insights, Integrative CAM provides a valuable tool for fusion-driven applications, promoting the trustworthy and insightful deployment of deep learning models.

Title: Exploring the Robustness of AI-Driven Tools in Digital Forensics: A Preliminary Study

Authors: Silvia Lucia Sanna, Leonardo Regano, Davide Maiorca, Giorgio Giacinto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01363
Pdf URL: https://arxiv.org/pdf/2412.01363
Copy Paste: [[2412.01363]] Exploring the Robustness of AI-Driven Tools in Digital Forensics: A Preliminary Study(https://arxiv.org/abs/2412.01363)
Keywords: attack, robust, extraction
Abstract: Nowadays, many tools are used to facilitate forensic tasks about data extraction and data analysis. In particular, some tools leverage Artificial Intelligence (AI) to automatically label examined data into specific categories (\ie, drugs, weapons, nudity). However, this raises a serious concern about the robustness of the employed AI algorithms against adversarial attacks. Indeed, some people may need to hide specific data to AI-based digital forensics tools, thus manipulating the content so that the AI system does not recognize the offensive/prohibited content and marks it at as suspicious to the analyst. This could be seen as an anti-forensics attack scenario. For this reason, we analyzed two of the most important forensics tools employing AI for data classification: Magnet AI, used by Magnet Axiom, and Excire Photo AI, used by X-Ways Forensics. We made preliminary tests using about $200$ images, other $100$ sent in $3$ chats about pornography and teenage nudity, drugs and weapons to understand how the tools label them. Moreover, we loaded some deepfake images (images generated by AI forging real ones) of some actors to understand if they would be classified in the same category as the original images. From our preliminary study, we saw that the AI algorithm is not robust enough, as we expected since these topics are still open research problems. For example, some sexual images were not categorized as nudity, and some deepfakes were categorized as the same real person, while the human eye can see the clear nudity image or catch the difference between the deepfakes. Building on these results and other state-of-the-art works, we provide some suggestions for improving how digital forensics analysis tool leverage AI and their robustness against adversarial attacks or different scenarios than the trained one.

Title: Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability

Authors: Wen-Dong Jiang, Chih-Yung Chang, Show-Jane Yen, Diptendu Sinha Roy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01365
Pdf URL: https://arxiv.org/pdf/2412.01365
Copy Paste: [[2412.01365]] Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability(https://arxiv.org/abs/2412.01365)
Keywords: interpretability
Abstract: Deep learning has achieved remarkable success in processing and managing unstructured data. However, its "black box" nature imposes significant limitations, particularly in sensitive application domains. While existing interpretable machine learning methods address some of these issues, they often fail to adequately consider feature correlations and provide insufficient evaluation of model decision paths. To overcome these challenges, this paper introduces Real Explainer (RealExp), an interpretability computation method that decouples the Shapley Value into individual feature importance and feature correlation importance. By incorporating feature similarity computations, RealExp enhances interpretability by precisely quantifying both individual feature contributions and their interactions, leading to more reliable and nuanced explanations. Additionally, this paper proposes a novel interpretability evaluation criterion focused on elucidating the decision paths of deep learning models, going beyond traditional accuracy-based metrics. Experimental validations on two unstructured data tasks -- image classification and text sentiment analysis -- demonstrate that RealExp significantly outperforms existing methods in interpretability. Case studies further illustrate its practical value: in image classification, RealExp aids in selecting suitable pre-trained models for specific tasks from an interpretability perspective; in text classification, it enables the optimization of models and approximates the performance of a fine-tuned GPT-Ada model using traditional bag-of-words approaches.

Title: Behavior Backdoor for Deep Learning Models

Authors: Jiakai Wang, Pengfei Zhang, Renshuai Tao, Jian Yang, Hao Liu, Xianglong Liu, Yunchao Wei, Yao Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01369
Pdf URL: https://arxiv.org/pdf/2412.01369
Copy Paste: [[2412.01369]] Behavior Backdoor for Deep Learning Models(https://arxiv.org/abs/2412.01369)
Keywords: security, attack
Abstract: The various post-processing methods for deep-learning-based models, such as quantification, pruning, and fine-tuning, play an increasingly important role in artificial intelligence technology, with pre-train large models as one of the main development directions. However, this popular series of post-processing behaviors targeting pre-training deep models has become a breeding ground for new adversarial security issues. In this study, we take the first step towards ``behavioral backdoor'' attack, which is defined as a behavior-triggered backdoor model training procedure, to reveal a new paradigm of backdoor attacks. In practice, we propose the first pipeline of implementing behavior backdoor, i.e., the Quantification Backdoor (QB) attack, upon exploiting model quantification method as the set trigger. Specifically, to adapt the optimization goal of behavior backdoor, we introduce the behavior-driven backdoor object optimizing method by a bi-target behavior backdoor training loss, thus we could guide the poisoned model optimization direction. To update the parameters across multiple models, we adopt the address-shared backdoor model training, thereby the gradient information could be utilized for multimodel collaborative optimization. Extensive experiments have been conducted on different models, datasets, and tasks, demonstrating the effectiveness of this novel backdoor attack and its potential application threats.

Title: Understanding the World's Museums through Vision-Language Reasoning

Authors: Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.01370
Pdf URL: https://arxiv.org/pdf/2412.01370
Copy Paste: [[2412.01370]] Understanding the World's Museums through Vision-Language Reasoning(https://arxiv.org/abs/2412.01370)
Keywords: large language model
Abstract: Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; (c) benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical significance of the labels. We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: this https URL

Title: An overview of diffusion models for generative artificial intelligence

Authors: Davide Gallon, Arnulf Jentzen, Philippe von Wurstemberger
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01371
Pdf URL: https://arxiv.org/pdf/2412.01371
Copy Paste: [[2412.01371]] An overview of diffusion models for generative artificial intelligence(https://arxiv.org/abs/2412.01371)
Keywords: diffusion, generative
Abstract: This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also review selected extensions and improvements of the basic framework from the literature such as improved DDPMs, denoising diffusion implicit models, classifier-free diffusion guidance models, and latent diffusion models.

Title: Hierarchical VAE with a Diffusion-based VampPrior

Authors: Anna Kuzina, Jakub M. Tomczak
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01373
Pdf URL: https://arxiv.org/pdf/2412.01373
Copy Paste: [[2412.01373]] Hierarchical VAE with a Diffusion-based VampPrior(https://arxiv.org/abs/2412.01373)
Keywords: diffusion, generative
Abstract: Deep hierarchical variational autoencoders (VAEs) are powerful latent variable generative models. In this paper, we introduce Hierarchical VAE with Diffusion-based Variational Mixture of the Posterior Prior (VampPrior). We apply amortization to scale the VampPrior to models with many stochastic layers. The proposed approach allows us to achieve better performance compared to the original VampPrior work and other deep hierarchical VAEs, while using fewer parameters. We empirically validate our method on standard benchmark datasets (MNIST, OMNIGLOT, CIFAR10) and demonstrate improved training stability and latent space utilization.

Title: Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge

Authors: Yuhe Ji, Yilun Liu, Feiyu Yao, Minggui He, Shimin Tao, Xiaofeng Zhao, Su Chang, Xinhua Yang, Weibin Meng, Yuming Xie, Boxing Chen, Hao Yang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2412.01377
Pdf URL: https://arxiv.org/pdf/2412.01377
Copy Paste: [[2412.01377]] Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge(https://arxiv.org/abs/2412.01377)
Keywords: large language model
Abstract: The increasing complexity of computer systems necessitates innovative approaches to fault and error management, going beyond traditional manual log analysis. While existing solutions using large language models (LLMs) show promise, they are limited by a gap between natural and domain-specific languages, which restricts their effectiveness in real-world applications. Our approach addresses these limitations by integrating interpretable domain knowledge into open-source LLMs through continual pre-training (CPT), enhancing performance on log tasks while retaining natural language processing capabilities. We created a comprehensive dataset, NLPLog, with over 250,000 question-answer pairs to facilitate this integration. Our model, SuperLog, trained with this dataset, achieves the best performance across four log analysis tasks, surpassing the second-best model by an average of 12.01%. Our contributions include a novel CPT paradigm that significantly improves model performance, the development of SuperLog with state-of-the-art results, and the release of a large-scale dataset to support further research in this domain.

Title: Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

Authors: Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul Whatmough
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.01380
Pdf URL: https://arxiv.org/pdf/2412.01380
Copy Paste: [[2412.01380]] Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking(https://arxiv.org/abs/2412.01380)
Keywords: large language model
Abstract: While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU instead of ReLU, which result in little inherent sparsity. While SwiGLU activations can be pruned based on magnitude, the resulting sparsity patterns are difficult to predict, rendering previous approaches ineffective. To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparsification approach, which preserves accuracy with minimal fine-tuning. DIP can further use lightweight LoRA adapters to regain some performance lost during sparsification. Lastly, we describe a novel cache-aware masking strategy, which considers the cache state and activation magnitude to further increase cache hit rate, improving LLM token rate on mobile devices. DIP outperforms other methods in terms of accuracy, memory and throughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP achieves a 46% reduction in memory and 40% increase in throughput with $<$ 0.1 loss in perplexity.

Title: Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data

Authors: Ivan DeAndres-Tame, Ruben Tolosana, Pietro Melzi, Ruben Vera-Rodriguez, Minchul Kim, Christian Rathgeb, Xiaoming Liu, Luis F. Gomez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Zhizhou Zhong, Yuge Huang, Yuxi Mi, Shouhong Ding, Shuigeng Zhou, Shuai He, Lingzhi Fu, Heng Cong, Rongyu Zhang, Zhihong Xiao, Evgeny Smirnov, Anton Pimenov, Aleksei Grigorev, Denis Timoshenko, Kaleb Mesfin Asfaw, Cheng Yaw Low, Hao Liu, Chuyi Wang, Qing Zuo, Zhixiang He, Hatef Otroshi Shahreza, Anjith George, Alexander Unnervik, Parsa Rahimi, Sébastien Marcel, Pedro C. Neto, Marco Huber, Jan Niklas Kolf, Naser Damer, Fadi Boutros, Jaime S. Cardoso, Ana F. Sequeira, Andrea Atzori, Gianni Fenu, Mirko Marras, Vitomir Štruc, Jiang Yu, Zhangjie Li, Jichun Li, Weisong Zhao, Zhen Lei, Xiangyu Zhu, Xiao-Yu Zhang, Bernardo Biesseck, Pedro Vidal, Luiz Coelho, Roger Granada, David Menotti
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01383
Pdf URL: https://arxiv.org/pdf/2412.01383
Copy Paste: [[2412.01383]] Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data(https://arxiv.org/abs/2412.01383)
Keywords: privacy, generative
Abstract: Synthetic data is gaining increasing popularity for face recognition technologies, mainly due to the privacy concerns and challenges associated with obtaining real data, including diverse scenarios, quality, and demographic groups, among others. It also offers some advantages over real data, such as the large amount of data that can be generated or the ability to customize it to adapt to specific problem-solving needs. To effectively use such data, face recognition models should also be specifically designed to exploit synthetic data to its fullest potential. In order to promote the proposal of novel Generative AI methods and synthetic data, and investigate the application of synthetic data to better train face recognition systems, we introduce the 2nd FRCSyn-onGoing challenge, based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. This is an ongoing challenge that provides researchers with an accessible platform to benchmark i) the proposal of novel Generative AI methods and synthetic data, and ii) novel face recognition systems that are specifically proposed to take advantage of synthetic data. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition such as demographic bias, domain adaptation, and performance constraints in demanding situations, such as age disparities between training and testing, changes in the pose, or occlusions. Very interesting findings are obtained in this second edition, including a direct comparison with the first one, in which synthetic databases were restricted to DCFace and GANDiffFace.

Title: Machine Learning Analysis of Anomalous Diffusion

Authors: Wenjie Cai, Yi Hu, Xiang Qu, Hui Zhao, Gongyi Wang, Jing Li, Zihan Huang
Subjects: cs.LG, cond-mat.soft, physics.bio-ph, physics.data-an
Abstract URL: https://arxiv.org/abs/2412.01393
Pdf URL: https://arxiv.org/pdf/2412.01393
Copy Paste: [[2412.01393]] Machine Learning Analysis of Anomalous Diffusion(https://arxiv.org/abs/2412.01393)
Keywords: diffusion, segmentation
Abstract: The rapid advancements in machine learning have made its application to anomalous diffusion analysis both essential and inevitable. This review systematically introduces the integration of machine learning techniques for enhanced analysis of anomalous diffusion, focusing on two pivotal aspects: single trajectory characterization via machine learning and representation learning of anomalous diffusion. We extensively compare various machine learning methods, including both classical machine learning and deep learning, used for the inference of diffusion parameters and trajectory segmentation. Additionally, platforms such as the Anomalous Diffusion Challenge that serve as benchmarks for evaluating these methods are highlighted. On the other hand, we outline three primary strategies for representing anomalous diffusion: the combination of predefined features, the feature vector from the penultimate layer of neural network, and the latent representation from the autoencoder, analyzing their applicability across various scenarios. This investigation paves the way for future research, offering valuable perspectives that can further enrich the study of anomalous diffusion and advance the application of artificial intelligence in statistical physics and biophysics.

Title: Holistic Understanding of 3D Scenes as Universal Scene Description

Authors: Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, Danda Pani Paudel
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01398
Pdf URL: https://arxiv.org/pdf/2412.01398
Copy Paste: [[2412.01398]] Holistic Understanding of 3D Scenes as Universal Scene Description(https://arxiv.org/abs/2412.01398)
Keywords: segmentation
Abstract: 3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered by current works. In this work, we address this shortcoming and introduce (1) an expertly curated dataset in the Universal Scene Description (USD) format, featuring high-quality manual annotations, for instance, segmentation and articulation on 280 indoor scenes; (2) a learning-based model together with a novel baseline capable of predicting part segmentation along with a full specification of motion attributes, including motion type, articulated and interactable parts, and motion parameters; (3) a benchmark serving to compare upcoming methods for the task at hand. Overall, our dataset provides 8 types of annotations - object and part segmentations, motion types, movable and interactable parts, motion parameters, connectivity, and object mass annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models. All data is provided in the USD format, allowing interoperability and easy integration with downstream tasks. We provide open access to our dataset, benchmark, and method's source code.

Title: ULSR-GS: Ultra Large-scale Surface Reconstruction Gaussian Splatting with Multi-View Geometric Consistency

Authors: Zhuoxiao Li, Shanliang Yao, Qizhong Gao, Angel F. Garcia-Fernandez, Yong Yue, Xiaohui Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01402
Pdf URL: https://arxiv.org/pdf/2412.01402
Copy Paste: [[2412.01402]] ULSR-GS: Ultra Large-scale Surface Reconstruction Gaussian Splatting with Multi-View Geometric Consistency(https://arxiv.org/abs/2412.01402)
Keywords: extraction
Abstract: While Gaussian Splatting (GS) demonstrates efficient and high-quality scene rendering and small area surface extraction ability, it falls short in handling large-scale aerial image surface extraction tasks. To overcome this, we present ULSR-GS, a framework dedicated to high-fidelity surface extraction in ultra-large-scale scenes, addressing the limitations of existing GS-based mesh extraction methods. Specifically, we propose a point-to-photo partitioning approach combined with a multi-view optimal view matching principle to select the best training images for each sub-region. Additionally, during training, ULSR-GS employs a densification strategy based on multi-view geometric consistency to enhance surface extraction details. Experimental results demonstrate that ULSR-GS outperforms other state-of-the-art GS-based works on large-scale aerial photogrammetry benchmark datasets, significantly improving surface extraction accuracy in complex urban environments. Project page: this https URL.

Title: MambaU-Lite: A Lightweight Model based on Mamba and Integrated Channel-Spatial Attention for Skin Lesion Segmentation

Authors: Thi-Nhu-Quynh Nguyen, Quang-Huy Ho, Duy-Thai Nguyen, Hoang-Minh-Quang Le, Van-Truong Pham, Thi-Thao Tran
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01405
Pdf URL: https://arxiv.org/pdf/2412.01405
Copy Paste: [[2412.01405]] MambaU-Lite: A Lightweight Model based on Mamba and Integrated Channel-Spatial Attention for Skin Lesion Segmentation(https://arxiv.org/abs/2412.01405)
Keywords: extraction, segmentation
Abstract: Early detection of skin abnormalities plays a crucial role in diagnosing and treating skin cancer. Segmentation of affected skin regions using AI-powered devices is relatively common and supports the diagnostic process. However, achieving high performance remains a significant challenge due to the need for high-resolution images and the often unclear boundaries of individual lesions. At the same time, medical devices require segmentation models to have a small memory foot-print and low computational cost. Based on these requirements, we introduce a novel lightweight model called MambaU-Lite, which combines the strengths of Mamba and CNN architectures, featuring just over 400K parameters and a computational cost of more than 1G flops. To enhance both global context and local feature extraction, we propose the P-Mamba block, a novel component that incorporates VSS blocks along-side multiple pooling layers, enabling the model to effectively learn multiscale features and enhance segmentation performance. We evaluate the model's performance on two skin datasets, ISIC2018 and PH2, yielding promising results. Our source code will be made publicly available at: this https URL.

Title: HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

Authors: Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, Yuwen Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01407
Pdf URL: https://arxiv.org/pdf/2412.01407
Copy Paste: [[2412.01407]] HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving(https://arxiv.org/abs/2412.01407)
Keywords: generative
Abstract: Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.

Title: CellSeg1: Robust Cell Segmentation with One Training Image

Authors: Peilin Zhou, Bo Du, Yongchao Xu
Subjects: cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.01410
Pdf URL: https://arxiv.org/pdf/2412.01410
Copy Paste: [[2412.01410]] CellSeg1: Robust Cell Segmentation with One Training Image(https://arxiv.org/abs/2412.01410)
Keywords: robust, segmentation
Abstract: Recent trends in cell segmentation have shifted towards universal models to handle diverse cell morphologies and imaging modalities. However, for continuously emerging cell types and imaging techniques, these models still require hundreds or thousands of annotated cells for fine-tuning. We introduce CellSeg1, a practical solution for segmenting cells of arbitrary morphology and modality with a few dozen cell annotations in 1 image. By adopting Low-Rank Adaptation of the Segment Anything Model (SAM), we achieve robust cell segmentation. Tested on 19 diverse cell datasets, CellSeg1 trained on 1 image achieved 0.81 average mAP at 0.5 IoU, performing comparably to existing models trained on over 500 images. It also demonstrated superior generalization in cross-dataset tests on TissueNet. We found that high-quality annotation of a few dozen densely packed cells of varied sizes is key to effective segmentation. CellSeg1 provides an efficient solution for cell segmentation with minimal annotation effort.

Title: Impromptu Cybercrime Euphemism Detection

Authors: Xiang Li, Yucheng Zhou, Laiping Zhao, Jing Li, Fangming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.01413
Pdf URL: https://arxiv.org/pdf/2412.01413
Copy Paste: [[2412.01413]] Impromptu Cybercrime Euphemism Detection(https://arxiv.org/abs/2412.01413)
Keywords: security
Abstract: Detecting euphemisms is essential for content security on various social media platforms, but existing methods designed for detecting euphemisms are ineffective in impromptu euphemisms. In this work, we make a first attempt to an exploration of impromptu euphemism detection and introduce the Impromptu Cybercrime Euphemisms Detection (ICED) dataset. Moreover, we propose a detection framework tailored to this problem, which employs context augmentation modeling and multi-round iterative training. Our detection framework mainly consists of a coarse-grained and a fine-grained classification model. The coarse-grained classification model removes most of the harmless content in the corpus to be detected. The fine-grained model, impromptu euphemisms detector, integrates context augmentation and multi-round iterations training to better predicts the actual meaning of a masked token. In addition, we leverage ChatGPT to evaluate the mode's capability. Experimental results demonstrate that our approach achieves a remarkable 76-fold improvement compared to the previous state-of-the-art euphemism detector.

Title: Network Simulation with Complex Cyber-attack Scenarios

Authors: Tiago Dias, João Vitorino, Eva Maia, Isabel Praça
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.01421
Pdf URL: https://arxiv.org/pdf/2412.01421
Copy Paste: [[2412.01421]] Network Simulation with Complex Cyber-attack Scenarios(https://arxiv.org/abs/2412.01421)
Keywords: attack
Abstract: Network Intrusion Detection (NID) systems can benefit from Machine Learning (ML) models to detect complex cyber-attacks. However, to train them with a great amount of high-quality data, it is necessary to perform reliable simulations of multiple interacting machines. This paper presents a network simulation solution for the creation of NID datasets with complex attack scenarios. This solution was integrated in the Airbus CyberRange platform to benefit from its simulation capabilities of generating benign and malicious traffic patterns that represent realistic cyber-attacks targeting a computer network. A realistic vulnerable network topology was configured in the CyberRange and three different attack scenarios were implemented: Man-in-the-Middle (MitM), Denial-of-Service (DoS), and Brute-Force (BF).

Title: MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection

Authors: Yonghao Dang, Liyuan Liu, Hui Kang, Ping Ye, Jianqin Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01422
Pdf URL: https://arxiv.org/pdf/2412.01422
Copy Paste: [[2412.01422]] MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection(https://arxiv.org/abs/2412.01422)
Keywords: transformer
Abstract: Real-time 2D keypoint detection plays an essential role in computer vision. Although CNN-based and Transformer-based methods have achieved breakthrough progress, they often fail to deliver superior performance and real-time speed. This paper introduces MamKPD, the first efficient yet effective mamba-based pose estimation framework for 2D keypoint detection. The conventional Mamba module exhibits limited information interaction between patches. To address this, we propose a lightweight contextual modeling module (CMM) that uses depth-wise convolutions to model inter-patch dependencies and linear layers to distill the pose cues within each patch. Subsequently, by combining Mamba for global modeling across all patches, MamKPD effectively extracts instances' pose information. We conduct extensive experiments on human and animal pose estimation datasets to validate the effectiveness of MamKPD. Our MamKPD-L achieves 77.3% AP on the COCO dataset with 1492 FPS on an NVIDIA GTX 4090 GPU. Moreover, MamKPD achieves state-of-the-art results on the MPII dataset and competitive results on the AP-10K dataset while saving 85% of the parameters compared to ViTPose. Our project page is available at this https URL.

Title: FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

Authors: Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01427
Pdf URL: https://arxiv.org/pdf/2412.01427
Copy Paste: [[2412.01427]] FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration(https://arxiv.org/abs/2412.01427)
Keywords: robust, diffusion
Abstract: Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.

Title: CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

Authors: Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01429
Pdf URL: https://arxiv.org/pdf/2412.01429
Copy Paste: [[2412.01429]] CPA: Camera-pose-awareness Diffusion Transformer for Video Generation(https://arxiv.org/abs/2412.01429)
Keywords: diffusion, transformer
Abstract: Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.

Title: MVImgNet2.0: A Larger-scale Dataset of Multi-view Images

Authors: Xiaoguang Han, Yushuang Wu, Luyue Shi, Haolin Liu, Hongjie Liao, Lingteng Qiu, Weihao Yuan, Xiaodong Gu, Zilong Dong, Shuguang Cui
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.01430
Pdf URL: https://arxiv.org/pdf/2412.01430
Copy Paste: [[2412.01430]] MVImgNet2.0: A Larger-scale Dataset of Multi-view Images(https://arxiv.org/abs/2412.01430)
Keywords: segmentation
Abstract: MVImgNet is a large-scale dataset that contains multi-view images of ~220k real-world objects in 238 classes. As a counterpart of ImageNet, it introduces 3D visual signals via multi-view shooting, making a soft bridge between 2D and 3D vision. This paper constructs the MVImgNet2.0 dataset that expands MVImgNet into a total of ~520k objects and 515 categories, which derives a 3D dataset with a larger scale that is more comparable to ones in the 2D domain. In addition to the expanded dataset scale and category range, MVImgNet2.0 is of a higher quality than MVImgNet owing to four new features: (i) most shoots capture 360-degree views of the objects, which can support the learning of object reconstruction with completeness; (ii) the segmentation manner is advanced to produce foreground object masks of higher accuracy; (iii) a more powerful structure-from-motion method is adopted to derive the camera pose for each frame of a lower estimation error; (iv) higher-quality dense point clouds are reconstructed via advanced methods for objects captured in 360-degree views, which can serve for downstream applications. Extensive experiments confirm the value of the proposed MVImgNet2.0 in boosting the performance of large 3D reconstruction models. MVImgNet2.0 will be public at this http URL, including multi-view images of all 520k objects, the reconstructed high-quality point clouds, and data annotation codes, hoping to inspire the broader vision community.

Title: DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model

Authors: Zhixiang Wang, Guangnan Ye, Xiaosen Wang, Siheng Chen, Zhibo Wang, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01440
Pdf URL: https://arxiv.org/pdf/2412.01440
Copy Paste: [[2412.01440]] DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model(https://arxiv.org/abs/2412.01440)
Keywords: attack, steal, diffusion, generative
Abstract: Physical adversarial patches printed on clothing can easily allow individuals to evade person detectors. However, most existing adversarial patch generation methods prioritize attack effectiveness over stealthiness, resulting in patches that are aesthetically unpleasing. Although existing methods using generative adversarial networks or diffusion models can produce more natural-looking patches, they often struggle to balance stealthiness with attack effectiveness and lack flexibility for user customization. To address these challenges, we propose a novel diffusion-based customizable patch generation framework termed DiffPatch, specifically tailored for creating naturalistic and customizable adversarial patches. Our approach enables users to utilize a reference image as the source, rather than starting from random noise, and incorporates masks to craft naturalistic patches of various shapes, not limited to squares. To prevent the original semantics from being lost during the diffusion process, we employ Null-text inversion to map random noise samples to a single input image and generate patches through Incomplete Diffusion Optimization (IDO). Notably, while maintaining a natural appearance, our method achieves a comparable attack performance to state-of-the-art non-naturalistic patches when using similarly sized attacks. Using DiffPatch, we have created a physical adversarial T-shirt dataset, AdvPatch-1K, specifically targeting YOLOv5s. This dataset includes over a thousand images across diverse scenarios, validating the effectiveness of our attack in real-world environments. Moreover, it provides a valuable resource for future research.

Title: Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization

Authors: Weiqiao Shan, Long Meng, Tong Zheng, Yingfeng Luo, Bei Li, junxin Wang, Tong Xiao, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.01455
Pdf URL: https://arxiv.org/pdf/2412.01455
Copy Paste: [[2412.01455]] Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization(https://arxiv.org/abs/2412.01455)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) exhibit exceptional performance across various downstream tasks. However, they encounter limitations due to slow inference speeds stemming from their extensive parameters. The early exit (EE) is an approach that aims to accelerate auto-regressive decoding. EE generates outputs from intermediate layers instead of using the whole model, which offers a promising solution to this challenge. However, additional output layers and joint optimization used in conventional EE hinder the application of EE in LLMs. In this paper, we explore the possibility of LLMs EE without additional output layers and joint optimization. Our findings indicate that EE is a natural capability within transformer-based models. While joint optimization does not give model EE capability, it must be employed to address challenges by improving the accuracy of locating the optimal EE layer through gating functions. Additionally, our study reveals patterns in EE behavior from a sub-word perspective based on the LLaMA model and the potential possibility for EE based on sub-layers.

Title: Phaseformer: Phase-based Attention Mechanism for Underwater Image Restoration and Beyond

Authors: MD Raqib Khan, Anshul Negi, Ashutosh Kulkarni, Shruti S. Phutke, Santosh Kumar Vipparthi, Subrahmanyam Murala
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01456
Pdf URL: https://arxiv.org/pdf/2412.01456
Copy Paste: [[2412.01456]] Phaseformer: Phase-based Attention Mechanism for Underwater Image Restoration and Beyond(https://arxiv.org/abs/2412.01456)
Keywords: transformer
Abstract: Quality degradation is observed in underwater images due to the effects of light refraction and absorption by water, leading to issues like color cast, haziness, and limited visibility. This degradation negatively affects the performance of autonomous underwater vehicles used in marine applications. To address these challenges, we propose a lightweight phase-based transformer network with 1.77M parameters for underwater image restoration (UIR). Our approach focuses on effectively extracting non-contaminated features using a phase-based self-attention mechanism. We also introduce an optimized phase attention block to restore structural information by propagating prominent attentive features from the input. We evaluate our method on both synthetic (UIEB, UFO-120) and real-world (UIEB, U45, UCCS, SQUID) underwater image datasets. Additionally, we demonstrate its effectiveness for low-light image enhancement using the LOL dataset. Through extensive ablation studies and comparative analysis, it is clear that the proposed approach outperforms existing state-of-the-art (SOTA) methods.

Title: A comprehensive review of datasets and deep learning techniques for vision in Unmanned Surface Vehicles

Authors: Linh Trinh, Siegfried Mercelis, Ali Anwar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01461
Pdf URL: https://arxiv.org/pdf/2412.01461
Copy Paste: [[2412.01461]] A comprehensive review of datasets and deep learning techniques for vision in Unmanned Surface Vehicles(https://arxiv.org/abs/2412.01461)
Keywords: segmentation
Abstract: Unmanned Surface Vehicles (USVs) have emerged as a major platform in maritime operations, capable of supporting a wide range of applications. USVs can help reduce labor costs, increase safety, save energy, and allow for difficult unmanned tasks in harsh maritime environments. With the rapid development of USVs, many vision tasks such as detection and segmentation become increasingly important. Datasets play an important role in encouraging and improving the research and development of reliable vision algorithms for USVs. In this regard, a large number of recent studies have focused on the release of vision datasets for USVs. Along with the development of datasets, a variety of deep learning techniques have also been studied, with a focus on USVs. However, there is a lack of a systematic review of recent studies in both datasets and vision techniques to provide a comprehensive picture of the current development of vision on USVs, including limitations and trends. In this study, we provide a comprehensive review of both USV datasets and deep learning techniques for vision tasks. Our review was conducted using a large number of vision datasets from USVs. We elaborate several challenges and potential opportunities for research and development in USV vision based on a thorough analysis of current datasets and deep learning techniques.

Title: Multi-Granularity Video Object Segmentation

Authors: Sangbeom Lim, Seongchan Kim, Seungjun An, Seokju Cho, Paul Hongsuck Seo, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01471
Pdf URL: https://arxiv.org/pdf/2412.01471
Copy Paste: [[2412.01471]] Multi-Granularity Video Object Segmentation(https://arxiv.org/abs/2412.01471)
Keywords: segmentation
Abstract: Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to real-world scenarios. Thus, developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary. In this work, we aim to generate multi-granularity video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset, which leads to the best performance among the existing video object segmentation methods and Segment SAM-based video segmentation methods. Project page is available at this https URL.

Title: Improving Object Detection by Modifying Synthetic Data with Explainable AI

Authors: Nitish Mital, Simon Malzard, Richard Walters, Celso M. De Melo, Raghuveer Rao, Victoria Nockles
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01477
Pdf URL: https://arxiv.org/pdf/2412.01477
Copy Paste: [[2412.01477]] Improving Object Detection by Modifying Synthetic Data with Explainable AI(https://arxiv.org/abs/2412.01477)
Keywords: robust
Abstract: In many computer vision domains the collection of sufficient real-world data is challenging and can severely impact model performance, particularly when running inference on samples that are unseen or underrepresented in training. Synthetically generated images provide a promising solution, but it remains unclear how to design synthetic data to optimally improve model performance, for example whether to introduce more realism or more abstraction in such datasets. Here we propose a novel conceptual approach to improve the performance of computer vision models trained on synthetic images, by using robust Explainable AI (XAI) techniques to guide the modification of 3D models used to generate these images. Importantly, this framework allows both modifications that increase and decrease realism in synthetic data, which can both improve model performance. We illustrate this concept using a real-world example where data are sparse; the detection of vehicles in infrared imagery. We fine-tune an initial YOLOv8 model on the ATR DSIAC infrared dataset and synthetic images generated from 3D mesh models in the Unity gaming engine, and then use XAI saliency maps to guide modification of our Unity models. We show that synthetic data can improve detection of vehicles in orientations unseen in training by 4.6\% (to mAP50 scores of 94.6\%). We further improve performance by an additional 1.5\% (to 96.1\%) through our new XAI-guided approach, which reduces misclassifications through both increasing and decreasing the realism of different parts of the synthetic data. These proof-of-concept results pave the way for fine, XAI-controlled curation of synthetic datasets through detailed feature modifications, tailored to improve object detection performance.

Title: Adversarial Attacks on Hyperbolic Networks

Authors: Max van Spengler, Jan Zahálka, Pascal Mettes
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01495
Pdf URL: https://arxiv.org/pdf/2412.01495
Copy Paste: [[2412.01495]] Adversarial Attacks on Hyperbolic Networks(https://arxiv.org/abs/2412.01495)
Keywords: attack, robust
Abstract: As hyperbolic deep learning grows in popularity, so does the need for adversarial robustness in the context of such a non-Euclidean geometry. To this end, this paper proposes hyperbolic alternatives to the commonly used FGM and PGD adversarial attacks. Through interpretable synthetic benchmarks and experiments on existing datasets, we show how the existing and newly proposed attacks differ. Moreover, we investigate the differences in adversarial robustness between Euclidean and fully hyperbolic networks. We find that these networks suffer from different types of vulnerabilities and that the newly proposed hyperbolic attacks cannot address these differences. Therefore, we conclude that the shifts in adversarial robustness are due to the models learning distinct patterns resulting from their different geometries.

Title: RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications

Authors: Nicholas Konz, Yuwen Chen, Hanxue Gu, Haoyu Dong, Yaqian Chen, Maciej A. Mazurowski
Subjects: cs.CV, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01496
Pdf URL: https://arxiv.org/pdf/2412.01496
Copy Paste: [[2412.01496]] RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications(https://arxiv.org/abs/2412.01496)
Keywords: interpretability, generative, segmentation
Abstract: Determining whether two sets of images belong to the same or different domain is a crucial task in modern medical image analysis and deep learning, where domain shift is a common problem that commonly results in decreased model performance. This determination is also important to evaluate the output quality of generative models, e.g., image-to-image translation models used to mitigate domain shift. Current metrics for this either rely on the (potentially biased) choice of some downstream task such as segmentation, or adopt task-independent perceptual metrics (e.g., FID) from natural imaging which insufficiently capture anatomical consistency and realism in medical images. We introduce a new perceptual metric tailored for medical images: Radiomic Feature Distance (RaD), which utilizes standardized, clinically meaningful and interpretable image features. We show that RaD is superior to other metrics for out-of-domain (OOD) detection in a variety of experiments. Furthermore, RaD outperforms previous perceptual metrics (FID, KID, etc.) for image-to-image translation by correlating more strongly with downstream task performance as well as anatomical consistency and realism, and shows similar utility for evaluating unconditional image generation. RaD also offers additional benefits such as interpretability, as well as stability and computational efficiency at low sample sizes. Our results are supported by broad experiments spanning four multi-domain medical image datasets, nine downstream tasks, six image translation models, and other factors, highlighting the broad potential of RaD for medical image analysis.

Title: Scaling Law for Language Models Training Considering Batch Size

Authors: Xian Shuai, Yiding Wang, Yimeng Wu, Xin Jiang, Xiaozhe Ren
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01505
Pdf URL: https://arxiv.org/pdf/2412.01505
Copy Paste: [[2412.01505]] Scaling Law for Language Models Training Considering Batch Size(https://arxiv.org/abs/2412.01505)
Keywords: large language model
Abstract: Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.

Title: Structured 3D Latents for Scalable and Versatile 3D Generation

Authors: Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01506
Pdf URL: https://arxiv.org/pdf/2412.01506
Copy Paste: [[2412.01506]] Structured 3D Latents for Scalable and Versatile 3D Generation(https://arxiv.org/abs/2412.01506)
Keywords: transformer
Abstract: We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.

Title: HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition

Authors: Anton Nuzhdin, Alexander Nagaev, Alexander Sautin, Alexander Kapitanov, Karina Kvanchiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01508
Pdf URL: https://arxiv.org/pdf/2412.01508
Copy Paste: [[2412.01508]] HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition(https://arxiv.org/abs/2412.01508)
Keywords: diffusion
Abstract: This paper proposes the second version of the widespread Hand Gesture Recognition dataset HaGRID -- HaGRIDv2. We cover 15 new gestures with conversation and control functions, including two-handed ones. Building on the foundational concepts proposed by HaGRID's authors, we implemented the dynamic gesture recognition algorithm and further enhanced it by adding three new groups of manipulation gestures. The ``no gesture" class was diversified by adding samples of natural hand movements, which allowed us to minimize false positives by 6 times. Combining extra samples with HaGRID, the received version outperforms the original in pre-training models for gesture-related tasks. Besides, we achieved the best generalization ability among gesture and hand detection datasets. In addition, the second version enhances the quality of the gestures generated by the diffusion model. HaGRIDv2, pre-trained models, and a dynamic gesture recognition algorithm are publicly available.

Title: ReHub: Linear Complexity Graph Transformers with Adaptive Hub-Spoke Reassignment

Authors: Tomer Borreda, Daniel Freedman, Or Litany
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.01519
Pdf URL: https://arxiv.org/pdf/2412.01519
Copy Paste: [[2412.01519]] ReHub: Linear Complexity Graph Transformers with Adaptive Hub-Spoke Reassignment(https://arxiv.org/abs/2412.01519)
Keywords: transformer
Abstract: We present ReHub, a novel graph transformer architecture that achieves linear complexity through an efficient reassignment technique between nodes and virtual nodes. Graph transformers have become increasingly important in graph learning for their ability to utilize long-range node communication explicitly, addressing limitations such as oversmoothing and oversquashing found in message-passing graph networks. However, their dense attention mechanism scales quadratically with the number of nodes, limiting their applicability to large-scale graphs. ReHub draws inspiration from the airline industry's hub-and-spoke model, where flights are assigned to optimize operational efficiency. In our approach, graph nodes (spokes) are dynamically reassigned to a fixed number of virtual nodes (hubs) at each model layer. Recent work, Neural Atoms (Li et al., 2024), has demonstrated impressive and consistent improvements over GNN baselines by utilizing such virtual nodes; their findings suggest that the number of hubs strongly influences performance. However, increasing the number of hubs typically raises complexity, requiring a trade-off to maintain linear complexity. Our key insight is that each node only needs to interact with a small subset of hubs to achieve linear complexity, even when the total number of hubs is large. To leverage all hubs without incurring additional computational costs, we propose a simple yet effective adaptive reassignment technique based on hub-hub similarity scores, eliminating the need for expensive node-hub computations. Our experiments on LRGB indicate a consistent improvement in results over the base method, Neural Atoms, while maintaining a linear complexity. Remarkably, our sparse model achieves performance on par with its non-sparse counterpart. Furthermore, ReHub outperforms competitive baselines and consistently ranks among top performers across various benchmarks.

Title: Traversing the Subspace of Adversarial Patches

Authors: Jens Bayer, Stefan Becker, David Münch, Michael Arens, Jürgen Beyerer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01527
Pdf URL: https://arxiv.org/pdf/2412.01527
Copy Paste: [[2412.01527]] Traversing the Subspace of Adversarial Patches(https://arxiv.org/abs/2412.01527)
Keywords: attack
Abstract: Despite ongoing research on the topic of adversarial examples in deep learning for computer vision, some fundamentals of the nature of these attacks remain unclear. As the manifold hypothesis posits, high-dimensional data tends to be part of a low-dimensional manifold. To verify the thesis with adversarial patches, this paper provides an analysis of a set of adversarial patches and investigates the reconstruction abilities of three different dimensionality reduction methods. Quantitatively, the performance of reconstructed patches in an attack setting is measured and the impact of sampled patches from the latent space during adversarial training is investigated. The evaluation is performed on two publicly available datasets for person detection. The results indicate that more sophisticated dimensionality reduction methods offer no advantages over a simple principal component analysis.

Title: The Future of Document Verification: Leveraging Blockchain and Self-Sovereign Identity for Enhanced Security and Transparency

Authors: Swapna Krishnakumar Radha, Andrey Kuehlkamp, Jarek Nabrzyski
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.01531
Pdf URL: https://arxiv.org/pdf/2412.01531
Copy Paste: [[2412.01531]] The Future of Document Verification: Leveraging Blockchain and Self-Sovereign Identity for Enhanced Security and Transparency(https://arxiv.org/abs/2412.01531)
Keywords: secure, security, privacy
Abstract: Attestation of documents like legal papers, professional qualifications, medical records, and commercial documents is crucial in global transactions, ensuring their authenticity, integrity, and trustworthiness. Companies expanding operations internationally need to submit attested financial statements and incorporation documents to foreign governments or business partners to prove their businesses and operations' authenticity, legal validity, and regulatory compliance. Attestation also plays a critical role in education, overseas employment, and authentication of legal documents such as testaments and medical records. The traditional attestation process is plagued by several challenges, including time-consuming procedures, the circulation of counterfeit documents, and concerns over data privacy in the attested records. The COVID-19 pandemic brought into light another challenge: ensuring physical presence for attestation, which caused a significant delay in the attestation process. Traditional methods also lack real-time tracking capabilities for attesting entities and requesters. This paper aims to propose a new strategy using decentralized technologies such as blockchain and self-sovereign identity to overcome the identified hurdles and provide an efficient, secure, and user-friendly attestation ecosystem.

Title: The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs

Authors: Christina Kassab, Matías Mattamala, Sacha Morin, Martin Büchner, Abhinav Valada, Liam Paull, Maurice Fallon
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01539
Pdf URL: https://arxiv.org/pdf/2412.01539
Copy Paste: [[2412.01539]] The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs(https://arxiv.org/abs/2412.01539)
Keywords: segmentation
Abstract: 3D open-vocabulary scene graph methods are a promising map representation for embodied agents, however many current approaches are computationally expensive. In this paper, we reexamine the critical design choices established in previous works to optimize both efficiency and performance. We propose a general scene graph framework and conduct three studies that focus on image pre-processing, feature fusion, and feature selection. Our findings reveal that commonly used image pre-processing techniques provide minimal performance improvement while tripling computation (on a per object view basis). We also show that averaging feature labels across different views significantly degrades performance. We study alternative feature selection strategies that enhance performance without adding unnecessary computational costs. Based on our findings, we introduce a computationally balanced approach for 3D point cloud segmentation with per-object features. The approach matches state-of-the-art classification accuracy while achieving a threefold reduction in computation.

Title: Effectiveness of L2 Regularization in Privacy-Preserving Machine Learning

Authors: Nikolaos Chandrinos (1), Iliana Loi (2), Panagiotis Zachos (2), Ioannis Symeonidis (1), Aristotelis Spiliotis (1), Maria Panou (1), Konstantinos Moustakas (2) ((1) Human Factors and Vehicle Technology, Hellenic Institute of Transport, Centre for Research and Technology Hellas, Thermi, Greece, (2) Wire Communications and Information Technology Laboratory, Dept. of Electrical and Computer Engineering, University of Patras, Patras, Greece)
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2412.01541
Pdf URL: https://arxiv.org/pdf/2412.01541
Copy Paste: [[2412.01541]] Effectiveness of L2 Regularization in Privacy-Preserving Machine Learning(https://arxiv.org/abs/2412.01541)
Keywords: privacy, attack, membership infer
Abstract: Artificial intelligence, machine learning, and deep learning as a service have become the status quo for many industries, leading to the widespread deployment of models that handle sensitive data. Well-performing models, the industry seeks, usually rely on a large volume of training data. However, the use of such data raises serious privacy concerns due to the potential risks of leaks of highly sensitive information. One prominent threat is the Membership Inference Attack, where adversaries attempt to deduce whether a specific data point was used in a model's training process. An adversary's ability to determine an individual's presence represents a significant privacy threat, especially when related to a group of users sharing sensitive information. Hence, well-designed privacy-preserving machine learning solutions are critically needed in the industry. In this work, we compare the effectiveness of L2 regularization and differential privacy in mitigating Membership Inference Attack risks. Even though regularization techniques like L2 regularization are commonly employed to reduce overfitting, a condition that enhances the effectiveness of Membership Inference Attacks, their impact on mitigating these attacks has not been systematically explored.

Title: Towards Type Agnostic Cyber Defense Agents

Authors: Erick Galinkin, Emmanouil Pountrourakis, Spiros Mancoridis
Subjects: cs.CR, cs.AI, cs.GT, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01542
Pdf URL: https://arxiv.org/pdf/2412.01542
Copy Paste: [[2412.01542]] Towards Type Agnostic Cyber Defense Agents(https://arxiv.org/abs/2412.01542)
Keywords: security, defense, attack
Abstract: With computing now ubiquitous across government, industry, and education, cybersecurity has become a critical component for every organization on the planet. Due to this ubiquity of computing, cyber threats have continued to grow year over year, leading to labor shortages and a skills gap in cybersecurity. As a result, many cybersecurity product vendors and security organizations have looked to artificial intelligence to shore up their defenses. This work considers how to characterize attackers and defenders in one approach to the automation of cyber defense -- the application of reinforcement learning. Specifically, we characterize the types of attackers and defenders in the sense of Bayesian games and, using reinforcement learning, derive empirical findings about how to best train agents that defend against multiple types of attackers.

Title: Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

Authors: Erick Galinkin, Martin Sablotny
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01547
Pdf URL: https://arxiv.org/pdf/2412.01547
Copy Paste: [[2412.01547]] Improved Large Language Model Jailbreak Detection via Pretrained Embeddings(https://arxiv.org/abs/2412.01547)
Keywords: secure, security, privacy, attack, large language model
Abstract: The adoption of large language models (LLMs) in many applications, from customer service chat bots and software development assistants to more capable agentic systems necessitates research into how to secure these systems. Attacks like prompt injection and jailbreaking attempt to elicit responses and actions from these models that are not compliant with the safety, privacy, or content policies of organizations using the model in their application. In order to counter abuse of LLMs for generating potentially harmful replies or taking undesirable actions, LLM owners must apply safeguards during training and integrate additional tools to block the LLM from generating text that abuses the model. Jailbreaking prompts play a vital role in convincing an LLM to generate potentially harmful content, making it important to identify jailbreaking attempts to block any further steps. In this work, we propose a novel approach to detect jailbreak prompts based on pairing text embeddings well-suited for retrieval with traditional machine learning classification algorithms. Our approach outperforms all publicly available methods from open source LLM security applications.

Title: SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model

Authors: Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, Jingya Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01550
Pdf URL: https://arxiv.org/pdf/2412.01550
Copy Paste: [[2412.01550]] SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model(https://arxiv.org/abs/2412.01550)
Keywords: large language model, segmentation
Abstract: 3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the benchmark, we propose our model, SeqAfford, to unlock the 3D multi-modal large language model with additional affordance segmentation abilities, which ensures reasoning with world knowledge and fine-grained affordance grounding in a cohesive framework. We further introduce a multi-granular language-point integration module to endow 3D dense prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization with sequential reasoning abilities.

Title: Optimizing Domain-Specific Image Retrieval: A Benchmark of FAISS and Annoy with Fine-Tuned Features

Authors: MD Shaikh Rahman, Syed Maudud E Rabbi, Muhammad Mahbubur Rashid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01555
Pdf URL: https://arxiv.org/pdf/2412.01555
Copy Paste: [[2412.01555]] Optimizing Domain-Specific Image Retrieval: A Benchmark of FAISS and Annoy with Fine-Tuned Features(https://arxiv.org/abs/2412.01555)
Keywords: extraction
Abstract: Approximate Nearest Neighbor search is one of the keys to high-scale data retrieval performance in many applications. The work is a bridge between feature extraction and ANN indexing through fine-tuning a ResNet50 model with various ANN methods: FAISS and Annoy. We evaluate the systems with respect to indexing time, memory usage, query time, precision, recall, F1-score, and Recall@5 on a custom image dataset. FAISS's Product Quantization can achieve a precision of 98.40% with low memory usage at 0.24 MB index size, and Annoy is the fastest, with average query times of 0.00015 seconds, at a slight cost to accuracy. These results reveal trade-offs among speed, accuracy, and memory efficiency and offer actionable insights into the optimization of feature-based image retrieval systems. This study will serve as a blueprint for constructing actual retrieval pipelines and be built on fine-tuned deep learning networks and associated ANN methods.

Title: Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection

Authors: Hao Tang, Zechao Li, Dong Zhang, Shengfeng He, Jinhui Tang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.01556
Pdf URL: https://arxiv.org/pdf/2412.01556
Copy Paste: [[2412.01556]] Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection(https://arxiv.org/abs/2412.01556)
Keywords: robust
Abstract: RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios.

Title: VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Authors: Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01558
Pdf URL: https://arxiv.org/pdf/2412.01558
Copy Paste: [[2412.01558]] VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval(https://arxiv.org/abs/2412.01558)
Keywords: transformer
Abstract: Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at this https URL .

Title: Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

Authors: Miroslav Purkrabek, Jiri Matas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01562
Pdf URL: https://arxiv.org/pdf/2412.01562
Copy Paste: [[2412.01562]] Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle(https://arxiv.org/abs/2412.01562)
Keywords: segmentation
Abstract: Human pose estimation methods work well on separated people but struggle with multi-body scenarios. Recent work has addressed this problem by conditioning pose estimation with detected bounding boxes or bottom-up-estimated poses. Unfortunately, all of these approaches overlooked segmentation masks and their connection to estimated keypoints. We condition pose estimation model by segmentation masks instead of bounding boxes to improve instance separation. This improves top-down pose estimation in multi-body scenarios but does not fix detection errors. Consequently, we develop BBox-Mask-Pose (BMP), integrating detection, segmentation and pose estimation into self-improving feedback loop. We adapt detector and pose estimation model for conditioning by instance masks and use Segment Anything as pose-to-mask model to close the circle. With only small models, BMP is superior to top-down methods on OCHuman dataset and to detector-free methods on COCO dataset, combining the best from both approaches and matching state of art performance in both settings. Code is available on this https URL.

Title: Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

Authors: Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John E. Hopcroft, Kun He, Lijun Wu
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2412.01564
Pdf URL: https://arxiv.org/pdf/2412.01564
Copy Paste: [[2412.01564]] Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates(https://arxiv.org/abs/2412.01564)
Keywords: robust
Abstract: The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their specific forms, ensuring compatibility with various molecular representation schemes. (2) We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors. To further enhance the representation, we incorporate neighborhood bond lengths and bond angles as understanding descriptors. Leveraging this tokenization framework, we train a GPT-2 style model for 3D molecular generation tasks. Results demonstrate strong performance with significantly faster generation speeds and competitive chemical stability compared to previous methods. Further, by integrating our learned discrete representations into Graphormer model for property prediction on QM9 dataset, Mol-StrucTok reveals consistent improvements across various molecular properties, underscoring the versatility and robustness of our approach.

Title: Multi-objective Deep Learning: Taxonomy and Survey of the State of the Art

Authors: Sebastian Peitz, Sedjro Salomon Hotegni
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.01566
Pdf URL: https://arxiv.org/pdf/2412.01566
Copy Paste: [[2412.01566]] Multi-objective Deep Learning: Taxonomy and Survey of the State of the Art(https://arxiv.org/abs/2412.01566)
Keywords: generative
Abstract: Simultaneously considering multiple objectives in machine learning has been a popular approach for several decades, with various benefits for multi-task learning, the consideration of secondary goals such as sparsity, or multicriteria hyperparameter tuning. However - as multi-objective optimization is significantly more costly than single-objective optimization - the recent focus on deep learning architectures poses considerable additional challenges due to the very large number of parameters, strong nonlinearities and stochasticity. This survey covers recent advancements in the area of multi-objective deep learning. We introduce a taxonomy of existing methods - based on the type of training algorithm as well as the decision maker's needs - before listing recent advancements, and also successful applications. All three main learning paradigms supervised learning, reinforcement learning and unsupervised learning are covered, and we also address the recently very popular area of generative modeling.

Title: 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

Authors: Ziyang Yan, Lei Li, Yihua Shao, Siyu Chen, Wuzong Kai, Jenq-Neng Hwang, Hao Zhao, Fabio Remondino
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01583
Pdf URL: https://arxiv.org/pdf/2412.01583
Copy Paste: [[2412.01583]] 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting(https://arxiv.org/abs/2412.01583)
Keywords: diffusion, generative, segmentation
Abstract: The creation of 3D scenes has traditionally been both labor-intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text-to-3D and image-to-3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi-step, 2D-to-3D projection methods and diffusion-based techniques, which often lack precision in control and hamper real-time performance. In this work, we propose 3DSceneEditor, a fully 3D-based paradigm for real-time, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct manipulation of Gaussians for efficient, high-quality edits based on input this http URL proposed framework (i) integrates a pre-trained instance segmentation model for semantic labeling; (ii) employs a zero-shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and deletion directly on Gaussians. Extensive experimental results show that 3DSceneEditor achieves superior editing precision and speed with respect to current SOTA 3D scene editing approaches, establishing a new benchmark for efficient and interactive 3D scene customization.

Title: FairML: A Julia Package for Fair Classification

Authors: Jan Pablo Burgard, João Vitor Pamplona
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.01585
Pdf URL: https://arxiv.org/pdf/2412.01585
Copy Paste: [[2412.01585]] FairML: A Julia Package for Fair Classification(https://arxiv.org/abs/2412.01585)
Keywords: fair
Abstract: In this paper, we propose this http URL, a Julia package providing a framework for fair classification in machine learning. In this framework, the fair learning process is divided into three stages. Each stage aims to reduce unfairness, such as disparate impact and disparate mistreatment, in the final prediction. For the preprocessing stage, we present a resampling method that addresses unfairness coming from data imbalances. The in-processing phase consist of a classification method. This can be either one coming from the this http URL package, or a user defined one. For this phase, we incorporate fair ML methods that can handle unfairness to a certain degree through their optimization process. In the post-processing, we discuss the choice of the cut-off value for fair prediction. With simulations, we show the performance of the single phases and their combinations.

Title: Epipolar Attention Field Transformers for Bird's Eye View Semantic Segmentation

Authors: Christian Witte, Jens Behley, Cyrill Stachniss, Marvin Raaijmakers
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01595
Pdf URL: https://arxiv.org/pdf/2412.01595
Copy Paste: [[2412.01595]] Epipolar Attention Field Transformers for Bird's Eye View Semantic Segmentation(https://arxiv.org/abs/2412.01595)
Keywords: transformer, segmentation
Abstract: Spatial understanding of the semantics of the surroundings is a key capability needed by autonomous cars to enable safe driving decisions. Recently, purely vision-based solutions have gained increasing research interest. In particular, approaches extracting a bird's eye view (BEV) from multiple cameras have demonstrated great performance for spatial understanding. This paper addresses the dependency on learned positional encodings to correlate image and BEV feature map elements for transformer-based methods. We propose leveraging epipolar geometric constraints to model the relationship between cameras and the BEV by Epipolar Attention Fields. They are incorporated into the attention mechanism as a novel attribution term, serving as an alternative to learned positional encodings. Experiments show that our method EAFormer outperforms previous BEV approaches by 2% mIoU for map semantic segmentation and exhibits superior generalization capabilities compared to implicitly learning the camera configuration.

Title: FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection

Authors: Brian K.S. Isaac-Medina, Mauricio Che, Yona F.A. Gaus, Samet Akcay, Toby P. Breckon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01596
Pdf URL: https://arxiv.org/pdf/2412.01596
Copy Paste: [[2412.01596]] FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection(https://arxiv.org/abs/2412.01596)
Keywords: robust
Abstract: Modern machine learning models, that excel on computer vision tasks such as classification and object detection, are often overconfident in their predictions for Out-of-Distribution (OOD) examples, resulting in unpredictable behaviour for open-set environments. Recent works have demonstrated that the free energy score is an effective measure of uncertainty for OOD detection given its close relationship to the data distribution. However, despite free energy-based methods representing a significant empirical advance in OOD detection, our theoretical analysis reveals previously unexplored and inherent vulnerabilities within the free energy score formulation such that in-distribution and OOD instances can have distinct feature representations yet identical free energy scores. This phenomenon occurs when the vector direction representing the feature space difference between the in-distribution and OOD sample lies within the null space of the last layer of a neural-based classifier. To mitigate these issues, we explore lower-dimensional feature spaces to reduce the null space footprint and introduce novel regularisation to maximize the least singular value of the final linear layer, hence enhancing inter-sample free energy separation. We refer to these techniques as Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection (FEVER-OOD). Our experiments show that FEVER-OOD techniques achieve state of the art OOD detection in Imagenet-100, with average OOD false positive rate (at 95% true positive rate) of 35.83% when used with the baseline Dream-OOD model.

Title: Arabic Handwritten Document OCR Solution with Binarization and Adaptive Scale Fusion Detection

Authors: Alhossien Waly, Bassant Tarek, Ali Feteha, Rewan Yehia, Gasser Amr, Ahmed Fares
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01601
Pdf URL: https://arxiv.org/pdf/2412.01601
Copy Paste: [[2412.01601]] Arabic Handwritten Document OCR Solution with Binarization and Adaptive Scale Fusion Detection(https://arxiv.org/abs/2412.01601)
Keywords: segmentation
Abstract: The problem of converting images of text into plain text is a widely researched topic in both academia and industry. Arabic handwritten Text Recognation (AHTR) poses additional challenges due to diverse handwriting styles and limited labeled data. In this paper we present a complete OCR pipeline that starts with line segmentation using Differentiable Binarization and Adaptive Scale Fusion techniques to ensure accurate detection of text lines. Following segmentation, a CNN-BiLSTM-CTC architecture is applied to recognize characters. Our system, trained on the Arabic Multi-Fonts Dataset (AMFDS), achieves a Character Recognition Rate (CRR) of 99.20% and a Word Recognition Rate (WRR) of 93.75% on single-word samples containing 7 to 10 characters, along with a CRR of 83.76% for sentences. These results demonstrate the system's strong performance in handling Arabic scripts, establishing a new benchmark for AHTR systems.

Title: Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking

Authors: Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, Michael Lyu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01605
Pdf URL: https://arxiv.org/pdf/2412.01605
Copy Paste: [[2412.01605]] Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking(https://arxiv.org/abs/2412.01605)
Keywords: large language model
Abstract: Clinical decision making (CDM) is a complex, dynamic process crucial to healthcare delivery, yet it remains a significant challenge for artificial intelligence systems. While Large Language Model (LLM)-based agents have been tested on general medical knowledge using licensing exams and knowledge question-answering tasks, their performance in the CDM in real-world scenarios is limited due to the lack of comprehensive testing datasets that mirror actual medical practice. To address this gap, we present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. MedChain distinguishes itself from existing benchmarks with three key features of real-world clinical practice: personalization, interactivity, and sequentiality. Further, to tackle real-world CDM challenges, we also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses. MedChain-Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches. The relevant dataset and code will be released upon acceptance of this paper.

Title: OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking

Authors: Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01615
Pdf URL: https://arxiv.org/pdf/2412.01615
Copy Paste: [[2412.01615]] OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking(https://arxiv.org/abs/2412.01615)
Keywords: protect, robust, extraction, watermark, generative
Abstract: With the rapid growth of generative AI and its widespread application in image editing, new risks have emerged regarding the authenticity and integrity of digital content. Existing versatile watermarking approaches suffer from trade-offs between tamper localization precision and visual quality. Constrained by the limited flexibility of previous framework, their localized watermark must remain fixed across all images. Under AIGC-editing, their copyright extraction accuracy is also unsatisfactory. To address these challenges, we propose OmniGuard, a novel augmented versatile watermarking approach that integrates proactive embedding with passive, blind extraction for robust copyright protection and tamper localization. OmniGuard employs a hybrid forensic framework that enables flexible localization watermark selection and introduces a degradation-aware tamper extraction network for precise localization under challenging conditions. Additionally, a lightweight AIGC-editing simulation layer is designed to enhance robustness across global and local editing. Extensive experiments show that OmniGuard achieves superior fidelity, robustness, and flexibility. Compared to the recent state-of-the-art approach EditGuard, our method outperforms it by 4.25dB in PSNR of the container image, 20.7% in F1-Score under noisy conditions, and 14.8% in average bit accuracy.

Title: If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World

Authors: Adrian de Wynter
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2412.01617
Pdf URL: https://arxiv.org/pdf/2412.01617
Copy Paste: [[2412.01617]] If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World(https://arxiv.org/abs/2412.01617)
Keywords: large language model
Abstract: Loneliness, or the lack of fulfilling relationships, significantly impacts a person's mental and physical well-being and is prevalent worldwide. Previous research suggests that large language models (LLMs) may help mitigate loneliness. However, we argue that the use of widespread LLMs like ChatGPT is more prevalent--and riskier, as they are not designed for this purpose. To explore this, we analysed user interactions with ChatGPT, particularly those outside of its marketed use as task-oriented assistant. In dialogues classified as lonely, users frequently (37%) sought advice or validation, and received good engagement. However, ChatGPT failed in sensitive scenarios, like responding appropriately to suicidal ideation or trauma. We also observed a 35% higher incidence of toxic content, with women being 22 times more likely to be targeted than men. Our findings underscore ethical and legal questions about this technology, and note risks like radicalisation or further isolation. We conclude with recommendations for research and industry to address loneliness.

Title: NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers

Authors: Angel Yahir Loredo Lopez, Tyler McDonald, Ali Emami
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01621
Pdf URL: https://arxiv.org/pdf/2412.01621
Copy Paste: [[2412.01621]] NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers(https://arxiv.org/abs/2412.01621)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.

Title: Image Forgery Localization via Guided Noise and Multi-Scale Feature Aggregation

Authors: Yakun Niu, Pei Chen, Lei Zhang, Lei Tan, Yingjian Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01622
Pdf URL: https://arxiv.org/pdf/2412.01622
Copy Paste: [[2412.01622]] Image Forgery Localization via Guided Noise and Multi-Scale Feature Aggregation(https://arxiv.org/abs/2412.01622)
Keywords: robust, extraction
Abstract: Image Forgery Localization (IFL) technology aims to detect and locate the forged areas in an image, which is very important in the field of digital forensics. However, existing IFL methods suffer from feature degradation during training using multi-layer convolutions or the self-attention mechanism, and perform poorly in detecting small forged regions and in robustness against post-processing. To tackle these, we propose a guided and multi-scale feature aggregated network for IFL. Spectifically, in order to comprehensively learn the noise feature under different types of forgery, we develop an effective noise extraction module in a guided way. Then, we design a Feature Aggregation Module (FAM) that uses dynamic convolution to adaptively aggregate RGB and noise features over multiple scales. Moreover, we propose an Atrous Residual Pyramid Module (ARPM) to enhance features representation and capture both global and local features using different receptive fields to improve the accuracy and robustness of forgery localization. Expensive experiments on 5 public datasets have shown that our proposed model outperforms several the state-of-the-art methods, specially on small region forged image.

Title: Using Large Language Models in Automatic Hint Ranking and Generation Tasks

Authors: Jamshid Mozafari, Florian Gerhold, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.01626
Pdf URL: https://arxiv.org/pdf/2412.01626
Copy Paste: [[2412.01626]] Using Large Language Models in Automatic Hint Ranking and Generation Tasks(https://arxiv.org/abs/2412.01626)
Keywords: large language model
Abstract: The use of Large Language Models (LLMs) has increased significantly recently, with individuals frequently interacting with chatbots to receive answers to a wide range of questions. In an era where information is readily accessible, it is crucial to stimulate and preserve human cognitive abilities and maintain strong reasoning skills. This paper addresses such challenges by promoting the use of hints as an alternative or a supplement to direct answers. We first introduce a manually constructed hint dataset, WIKIHINT, which includes 5,000 hints created for 1,000 questions. We then finetune open-source LLMs such as LLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We assess the effectiveness of the hints with human participants who try to answer questions with and without the aid of hints. Additionally, we introduce a lightweight evaluation method, HINTRANK, to evaluate and rank hints in both answer-aware and answer-agnostic settings. Our findings show that (a) the dataset helps generate more effective hints, (b) including answer information along with questions generally improves hint quality, and (c) encoder-based models perform better than decoder-based models in hint ranking.

Title: Review of Mathematical Optimization in Federated Learning

Authors: Shusen Yang, Fangyuan Zhao, Zihao Zhou, Liang Shi, Xuebin Ren, Zongben Xu
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2412.01630
Pdf URL: https://arxiv.org/pdf/2412.01630
Copy Paste: [[2412.01630]] Review of Mathematical Optimization in Federated Learning(https://arxiv.org/abs/2412.01630)
Keywords: privacy, federate
Abstract: Federated Learning (FL) has been becoming a popular interdisciplinary research area in both applied mathematics and information sciences. Mathematically, FL aims to collaboratively optimize aggregate objective functions over distributed datasets while satisfying a variety of privacy and system this http URL from conventional distributed optimization methods, FL needs to address several specific issues (e.g., non-i.i.d. data distributions and differential private noises), which pose a set of new challenges in the problem formulation, algorithm design, and convergence analysis. In this paper, we will systematically review existing FL optimization research including their assumptions, formulations, methods, and theoretical results. Potential future directions are also discussed.

Title: Linearly Homomorphic Signature with Tight Security on Lattice

Authors: Heng Guo, Kun Tian, Feng Liu, Zhiyong Zheng
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2412.01641
Pdf URL: https://arxiv.org/pdf/2412.01641
Copy Paste: [[2412.01641]] Linearly Homomorphic Signature with Tight Security on Lattice(https://arxiv.org/abs/2412.01641)
Keywords: security, attack
Abstract: At present, in lattice-based linearly homomorphic signature schemes, especially under the standard model, there are very few schemes with tight security. This paper constructs the first lattice-based linearly homomorphic signature scheme that achieves tight security against existential unforgeability under chosen-message attacks (EUF-CMA) in the standard model. Furthermore, among existing schemes, the scheme proposed in this paper also offers certain advantages in terms of public key size, signature length, and computational cost.

Title: Robust and Transferable Backdoor Attacks Against Deep Image Compression With Selective Frequency Prior

Authors: Yi Yu, Yufei Wang, Wenhan Yang, Lanqing Guo, Shijian Lu, Ling-Yu Duan, Yap-Peng Tan, Alex C. Kot
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.01646
Pdf URL: https://arxiv.org/pdf/2412.01646
Copy Paste: [[2412.01646]] Robust and Transferable Backdoor Attacks Against Deep Image Compression With Selective Frequency Prior(https://arxiv.org/abs/2412.01646)
Keywords: attack, robust, segmentation
Abstract: Recent advancements in deep learning-based compression techniques have surpassed traditional methods. However, deep neural networks remain vulnerable to backdoor attacks, where pre-defined triggers induce malicious behaviors. This paper introduces a novel frequency-based trigger injection model for launching backdoor attacks with multiple triggers on learned image compression models. Inspired by the widely used DCT in compression codecs, triggers are embedded in the DCT domain. We design attack objectives tailored to diverse scenarios, including: 1) degrading compression quality in terms of bit-rate and reconstruction accuracy; 2) targeting task-driven measures like face recognition and semantic segmentation. To improve training efficiency, we propose a dynamic loss function that balances loss terms with fewer hyper-parameters, optimizing attack objectives effectively. For advanced scenarios, we evaluate the attack's resistance to defensive preprocessing and propose a two-stage training schedule with robust frequency selection to enhance resilience. To improve cross-model and cross-domain transferability for downstream tasks, we adjust the classification boundary in the attack loss during training. Experiments show that our trigger injection models, combined with minor modifications to encoder parameters, successfully inject multiple backdoors and their triggers into a single compression model, demonstrating strong performance and versatility. (*Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)

Title: Privacy-Preserving Federated Learning via Homomorphic Adversarial Networks

Authors: Wenhan Dong, Chao Lin, Xinlei He, Xinyi Huang, Shengmin Xu
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01650
Pdf URL: https://arxiv.org/pdf/2412.01650
Copy Paste: [[2412.01650]] Privacy-Preserving Federated Learning via Homomorphic Adversarial Networks(https://arxiv.org/abs/2412.01650)
Keywords: privacy, attack, robust, federate
Abstract: Privacy-preserving federated learning (PPFL) aims to train a global model for multiple clients while maintaining their data privacy. However, current PPFL protocols exhibit one or more of the following insufficiencies: considerable degradation in accuracy, the requirement for sharing keys, and cooperation during the key generation or decryption processes. As a mitigation, we develop the first protocol that utilizes neural networks to implement PPFL, as well as incorporating an Aggregatable Hybrid Encryption scheme tailored to the needs of PPFL. We name these networks as Homomorphic Adversarial Networks (HANs) which demonstrate that neural networks are capable of performing tasks similar to multi-key homomorphic encryption (MK-HE) while solving the problems of key distribution and collaborative decryption. Our experiments show that HANs are robust against privacy attacks. Compared with non-private federated learning, experiments conducted on multiple datasets demonstrate that HANs exhibit a negligible accuracy loss (at most 1.35%). Compared to traditional MK-HE schemes, HANs increase encryption aggregation speed by 6,075 times while incurring a 29.2 times increase in communication overhead.

Title: Verified Foundations for Differential Privacy

Authors: Markus de Medeiros, Muhammad Naveed, Tancrede Lepoint, Temesghen Kahsai, Tristan Ravitch, Stefan Zetzsche, Anjali Joshi, Joseph Tassarotti, Aws Albarghouthi, Jean-Baptiste Tristan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.01671
Pdf URL: https://arxiv.org/pdf/2412.01671
Copy Paste: [[2412.01671]] Verified Foundations for Differential Privacy(https://arxiv.org/abs/2412.01671)
Keywords: privacy
Abstract: Differential privacy (DP) has become the gold standard for privacy-preserving data analysis, but implementing it correctly has proven challenging. Prior work has focused on verifying DP at a high level, assuming the foundations are correct and a perfect source of randomness is available. However, the underlying theory of differential privacy can be very complex and subtle. Flaws in basic mechanisms and random number generation have been a critical source of vulnerabilities in real-world DP systems. In this paper, we present SampCert, the first comprehensive, mechanized foundation for differential privacy. SampCert is written in Lean with over 12,000 lines of proof. It offers a generic and extensible notion of DP, a framework for constructing and composing DP mechanisms, and formally verified implementations of Laplace and Gaussian sampling algorithms. SampCert provides (1) a mechanized foundation for developing the next generation of differentially private algorithms, and (2) mechanically verified primitives that can be deployed in production systems. Indeed, SampCert's verified algorithms power the DP offerings of Amazon Web Services (AWS), demonstrating its real-world impact. SampCert's key innovations include: (1) A generic DP foundation that can be instantiated for various DP definitions (e.g., pure, concentrated, Rényi DP); (2) formally verified discrete Laplace and Gaussian sampling algorithms that avoid the pitfalls of floating-point implementations; and (3) a simple probability monad and novel proof techniques that streamline the formalization. To enable proving complex correctness properties of DP and random number generation, SampCert makes heavy use of Lean's extensive Mathlib library, leveraging theorems in Fourier analysis, measure and probability theory, number theory, and topology.

Title: Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

Authors: Varun Belagali, Srikar Yellapragada, Alexandros Graikos, Saarthak Kapse, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Joel Saltz, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01672
Pdf URL: https://arxiv.org/pdf/2412.01672
Copy Paste: [[2412.01672]] Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning(https://arxiv.org/abs/2412.01672)
Keywords: diffusion, generative
Abstract: Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.

Title: Causal Discovery by Interventions via Integer Programming

Authors: Abdelmonem Elrefaey, Rong Pan
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01674
Pdf URL: https://arxiv.org/pdf/2412.01674
Copy Paste: [[2412.01674]] Causal Discovery by Interventions via Integer Programming(https://arxiv.org/abs/2412.01674)
Keywords: robust
Abstract: Causal discovery is essential across various scientific fields to uncover causal structures within data. Traditional methods relying on observational data have limitations due to confounding variables. This paper presents an optimization-based approach using integer programming (IP) to design minimal intervention sets that ensure causal structure identifiability. Our method provides exact and modular solutions that can be adjusted to different experimental settings and constraints. We demonstrate its effectiveness through comparative analysis across different settings, demonstrating its applicability and robustness.

Title: Diffusion Models with Anisotropic Gaussian Splatting for Image Inpainting

Authors: Jacob Fein-Ashley, Benjamin Fein-Ashley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01682
Pdf URL: https://arxiv.org/pdf/2412.01682
Copy Paste: [[2412.01682]] Diffusion Models with Anisotropic Gaussian Splatting for Image Inpainting(https://arxiv.org/abs/2412.01682)
Keywords: diffusion
Abstract: Image inpainting is a fundamental task in computer vision, aiming to restore missing or corrupted regions in images realistically. While recent deep learning approaches have significantly advanced the state-of-the-art, challenges remain in maintaining structural continuity and generating coherent textures, particularly in large missing areas. Diffusion models have shown promise in generating high-fidelity images but often lack the structural guidance necessary for realistic inpainting. We propose a novel inpainting method that combines diffusion models with anisotropic Gaussian splatting to capture both local structures and global context effectively. By modeling missing regions using anisotropic Gaussian functions that adapt to local image gradients, our approach provides structural guidance to the diffusion-based inpainting network. The Gaussian splat maps are integrated into the diffusion process, enhancing the model's ability to generate high-fidelity and structurally coherent inpainting results. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques, producing visually plausible results with enhanced structural integrity and texture realism.

Title: Unlocking Video-LLM via Agent-of-Thoughts Distillation

Authors: Yudi Shi, Shangzhe Di, Qirui Chen, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01694
Pdf URL: https://arxiv.org/pdf/2412.01694
Copy Paste: [[2412.01694]] Unlocking Video-LLM via Agent-of-Thoughts Distillation(https://arxiv.org/abs/2412.01694)
Keywords: explainability, large language model
Abstract: This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.

Title: Uncertainty-Aware Regularization for Image-to-Image Translation

Authors: Anuja Vats, Ivar Farup, Marius Pedersen, Kiran Raja
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01705
Pdf URL: https://arxiv.org/pdf/2412.01705
Copy Paste: [[2412.01705]] Uncertainty-Aware Regularization for Image-to-Image Translation(https://arxiv.org/abs/2412.01705)
Keywords: robust
Abstract: The importance of quantifying uncertainty in deep networks has become paramount for reliable real-world applications. In this paper, we propose a method to improve uncertainty estimation in medical Image-to-Image (I2I) translation. Our model integrates aleatoric uncertainty and employs Uncertainty-Aware Regularization (UAR) inspired by simple priors to refine uncertainty estimates and enhance reconstruction quality. We show that by leveraging simple priors on parameters, our approach captures more robust uncertainty maps, effectively refining them to indicate precisely where the network encounters difficulties, while being less affected by noise. Our experiments demonstrate that UAR not only improves translation performance, but also provides better uncertainty estimations, particularly in the presence of noise and artifacts. We validate our approach using two medical imaging datasets, showcasing its effectiveness in maintaining high confidence in familiar regions while accurately identifying areas of uncertainty in novel/ambiguous scenarios.

Title: Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

Authors: Rui Ye, Xianghe Pang, Jingyi Chai, Jiaao Chen, Zhenfei Yin, Zhen Xiang, Xiaowen Dong, Jing Shao, Siheng Chen
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01708
Pdf URL: https://arxiv.org/pdf/2412.01708
Copy Paste: [[2412.01708]] Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review(https://arxiv.org/abs/2412.01708)
Keywords: robust, large language model
Abstract: Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs' susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards.

Title: Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Authors: Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.01711
Pdf URL: https://arxiv.org/pdf/2412.01711
Copy Paste: [[2412.01711]] Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models(https://arxiv.org/abs/2412.01711)
Keywords: interpretability, large language model
Abstract: Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the LLM output at decoding-time. This approach combines resource efficiency with interpretability and can be optimized for mitigating specific types of bias, depending on the target use case. Experiments on mitigating gender, race, and religion biases show a reduction in bias on several local and global bias metrics while preserving language model performance.

Title: Driving Scene Synthesis on Free-form Trajectories with Generative Prior

Authors: Zeyu Yang, Zijie Pan, Yuankun Yang, Xiatian Zhu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01717
Pdf URL: https://arxiv.org/pdf/2412.01717
Copy Paste: [[2412.01717]] Driving Scene Synthesis on Free-form Trajectories with Generative Prior(https://arxiv.org/abs/2412.01717)
Keywords: diffusion, generative
Abstract: Driving scene synthesis along free-form trajectories is essential for driving simulations to enable closed-loop evaluation of end-to-end driving policies. While existing methods excel at novel view synthesis on recorded trajectories, they face challenges with novel trajectories due to limited views of driving videos and the vastness of driving environments. To tackle this challenge, we propose a novel free-form driving view synthesis approach, dubbed DriveX, by leveraging video generative prior to optimize a 3D model across a variety of trajectories. Concretely, we crafted an inverse problem that enables a video diffusion model to be utilized as a prior for many-trajectory optimization of a parametric 3D model (e.g., Gaussian splatting). To seamlessly use the generative prior, we iteratively conduct this process during optimization. Our resulting model can produce high-fidelity virtual driving environments outside the recorded trajectory, enabling free-form trajectory driving simulation. Beyond real driving scenes, DriveX can also be utilized to simulate virtual driving worlds from AI-generated videos.

Title: HUGSIM: A Real-Time, Photo-Realistic and Closed-Loop Simulator for Autonomous Driving

Authors: Hongyu Zhou, Longzhong Lin, Jiabao Wang, Yichong Lu, Dongfeng Bai, Bingbing Liu, Yue Wang, Andreas Geiger, Yiyi Liao
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01718
Pdf URL: https://arxiv.org/pdf/2412.01718
Copy Paste: [[2412.01718]] HUGSIM: A Real-Time, Photo-Realistic and Closed-Loop Simulator for Autonomous Driving(https://arxiv.org/abs/2412.01718)
Keywords: fair
Abstract: In the past few decades, autonomous driving algorithms have made significant progress in perception, planning, and control. However, evaluating individual components does not fully reflect the performance of entire systems, highlighting the need for more holistic assessment methods. This motivates the development of HUGSIM, a closed-loop, photo-realistic, and real-time simulator for evaluating autonomous driving algorithms. We achieve this by lifting captured 2D RGB images into the 3D space via 3D Gaussian Splatting, improving the rendering quality for closed-loop scenarios, and building the closed-loop environment. In terms of rendering, We tackle challenges of novel view synthesis in closed-loop scenarios, including viewpoint extrapolation and 360-degree vehicle rendering. Beyond novel view synthesis, HUGSIM further enables the full closed simulation loop, dynamically updating the ego and actor states and observations based on control commands. Moreover, HUGSIM offers a comprehensive benchmark across more than 70 sequences from KITTI-360, Waymo, nuScenes, and PandaSet, along with over 400 varying scenarios, providing a fair and realistic evaluation platform for existing autonomous driving algorithms. HUGSIM not only serves as an intuitive evaluation benchmark but also unlocks the potential for fine-tuning autonomous driving algorithms in a photorealistic closed-loop setting.

Title: LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Authors: Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01720
Pdf URL: https://arxiv.org/pdf/2412.01720
Copy Paste: [[2412.01720]] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant(https://arxiv.org/abs/2412.01720)
Keywords: robust, generative
Abstract: With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.

Title: BroadTrack: Broadcast Camera Tracking for Soccer

Authors: Floriane Magera, Thomas Hoyoux, Olivier Barnich, Marc Van Droogenbroeck
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01721
Pdf URL: https://arxiv.org/pdf/2412.01721
Copy Paste: [[2412.01721]] BroadTrack: Broadcast Camera Tracking for Soccer(https://arxiv.org/abs/2412.01721)
Keywords: robust
Abstract: Camera calibration and localization, sometimes simply named camera calibration, enables many applications in the context of soccer broadcasting, for instance regarding the interpretation and analysis of the game, or the insertion of augmented reality graphics for storytelling or refereeing purposes. To contribute to such applications, the research community has typically focused on single-view calibration methods, leveraging the near-omnipresence of soccer field markings in wide-angle broadcast views, but leaving all temporal aspects, if considered at all, to general-purpose tracking or filtering techniques. Only a few contributions have been made to leverage any domain-specific knowledge for this tracking task, and, as a result, there lacks a truly performant and off-the-shelf camera tracking system tailored for soccer broadcasting, specifically for elevated tripod-mounted cameras around the stadium. In this work, we present such a system capable of addressing the task of soccer broadcast camera tracking efficiently, robustly, and accurately, outperforming by far the most precise methods of the state-of-the-art. By combining the available open-source soccer field detectors with carefully designed camera and tripod models, our tracking system, BroadTrack, halves the mean reprojection error rate and gains more than 15% in terms of Jaccard index for camera calibration on the SoccerNet dataset. Furthermore, as the SoccerNet dataset videos are relatively short (30 seconds), we also present qualitative results on a 20-minute broadcast clip to showcase the robustness and the soundness of our system.

Title: Attacks on multimodal models

Authors: Viacheslav Iablochnikov, Alexander Rogachev
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01725
Pdf URL: https://arxiv.org/pdf/2412.01725
Copy Paste: [[2412.01725]] Attacks on multimodal models(https://arxiv.org/abs/2412.01725)
Keywords: attack
Abstract: Today, models capable of working with various modalities simultaneously in a chat format are gaining increasing popularity. Despite this, there is an issue of potential attacks on these models, especially considering that many of them include open-source components. It is important to study whether the vulnerabilities of these components are inherited and how dangerous this can be when using such models in the industry. This work is dedicated to researching various types of attacks on such models and evaluating their generalization capabilities. Modern VLM models (LLaVA, BLIP, etc.) often use pre-trained parts from other models, so the main part of this research focuses on them, specifically on the CLIP architecture and its image encoder (CLIP-ViT) and various patch attack variations for it.

Title: Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios

Authors: Sangyeon Yoon, Wonje Jeung, Albert No
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01756
Pdf URL: https://arxiv.org/pdf/2412.01756
Copy Paste: [[2412.01756]] Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios(https://arxiv.org/abs/2412.01756)
Keywords: privacy
Abstract: Auditing Differentially Private Stochastic Gradient Descent (DP-SGD) in the final model setting is challenging and often results in empirical lower bounds that are significantly looser than theoretical privacy guarantees. We introduce a novel auditing method that achieves tighter empirical lower bounds without additional assumptions by crafting worst-case adversarial samples through loss-based input-space auditing. Our approach surpasses traditional canary-based heuristics and is effective in both white-box and black-box scenarios. Specifically, with a theoretical privacy budget of $\varepsilon = 10.0$, our method achieves empirical lower bounds of $6.68$ in white-box settings and $4.51$ in black-box settings, compared to the baseline of $4.11$ for MNIST. Moreover, we demonstrate that significant privacy auditing results can be achieved using in-distribution (ID) samples as canaries, obtaining an empirical lower bound of $4.33$ where traditional methods produce near-zero leakage detection. Our work offers a practical framework for reliable and accurate privacy auditing in differentially private machine learning.

Title: XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, Bhiksha Raj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01762
Pdf URL: https://arxiv.org/pdf/2412.01762
Copy Paste: [[2412.01762]] XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation(https://arxiv.org/abs/2412.01762)
Keywords: generative
Abstract: Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.

Title: HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Authors: Lajos Muzsai, David Imolai, András Lukács
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01778
Pdf URL: https://arxiv.org/pdf/2412.01778
Copy Paste: [[2412.01778]] HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing(https://arxiv.org/abs/2412.01778)
Keywords: security, robust, large language model
Abstract: We introduce HackSynth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. HackSynth's dual-module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM-based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top-p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent's capabilities. The experiments show that the agent performed best with the GPT-4o model, better than what the GPT-4o's system card suggests. We also discuss the safety and predictability of HackSynth's actions. Our findings indicate the potential of LLM-based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.

Title: Identifying Reliable Predictions in Detection Transformers

Authors: Young-Jin Park, Carson Sobolewski, Navid Azizan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01782
Pdf URL: https://arxiv.org/pdf/2412.01782
Copy Paste: [[2412.01782]] Identifying Reliable Predictions in Detection Transformers(https://arxiv.org/abs/2412.01782)
Keywords: transformer
Abstract: DEtection TRansformer (DETR) has emerged as a promising architecture for object detection, offering an end-to-end prediction pipeline. In practice, however, DETR generates hundreds of predictions that far outnumber the actual number of objects present in an image. This raises the question: can we trust and use all of these predictions? Addressing this concern, we present empirical evidence highlighting how different predictions within the same image play distinct roles, resulting in varying reliability levels across those predictions. More specifically, while multiple predictions are often made for a single object, our findings show that most often one such prediction is well-calibrated, and the others are poorly calibrated. Based on these insights, we demonstrate identifying a reliable subset of DETR's predictions is crucial for accurately assessing the reliability of the model at both object and image levels. Building on this viewpoint, we first tackle the shortcomings of widely used performance and calibration metrics, such as average precision and various forms of expected calibration error. Specifically, they are inadequate for determining which subset of DETR's predictions should be trusted and utilized. In response, we present Object-level Calibration Error (OCE), which is capable of assessing the calibration quality both across different models and among various configurations within a specific model. As a final contribution, we introduce a post hoc Uncertainty Quantification (UQ) framework that predicts the accuracy of the model on a per-image basis. By contrasting the average confidence scores of positive (i.e., likely to be matched) and negative predictions determined by OCE, the framework assesses the reliability of the DETR model for each test image.

Title: Hard Constraint Guided Flow Matching for Gradient-Free Generation of PDE Solutions

Authors: Chaoran Cheng, Boran Han, Danielle C. Maddix, Abdul Fatir Ansari, Andrew Stuart, Michael W. Mahoney, Yuyang Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.01786
Pdf URL: https://arxiv.org/pdf/2412.01786
Copy Paste: [[2412.01786]] Hard Constraint Guided Flow Matching for Gradient-Free Generation of PDE Solutions(https://arxiv.org/abs/2412.01786)
Keywords: generative
Abstract: Generative models that satisfy hard constraints are crucial in many scientific and engineering applications where physical laws or system requirements must be strictly respected. However, many existing constrained generative models, especially those developed for computer vision, rely heavily on gradient information, often sparse or computationally expensive in fields like partial differential equations (PDEs). In this work, we introduce a novel framework for adapting pre-trained, unconstrained flow-matching models to satisfy constraints exactly in a zero-shot manner without requiring expensive gradient computations or fine-tuning. Our framework, ECI sampling, alternates between extrapolation (E), correction (C), and interpolation (I) stages during each iterative sampling step of flow matching sampling to ensure accurate integration of constraint information while preserving the validity of the generation. We demonstrate the effectiveness of our approach across various PDE systems, showing that ECI-guided generation strictly adheres to physical constraints and accurately captures complex distribution shifts induced by these constraints. Empirical results demonstrate that our framework consistently outperforms baseline approaches in various zero-shot constrained generation tasks and also achieves competitive results in the regression tasks without additional fine-tuning.

Title: Pretrained Reversible Generation as Unsupervised Visual Representation Learning

Authors: Rongkun Xue, Jinouwen Zhang, Yazhe Niu, Dazhong Shen, Bingqi Ma, Yu Liu, Jing Yang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01787
Pdf URL: https://arxiv.org/pdf/2412.01787
Copy Paste: [[2412.01787]] Pretrained Reversible Generation as Unsupervised Visual Representation Learning(https://arxiv.org/abs/2412.01787)
Keywords: robust, generative
Abstract: Recent generative models based on score matching and flow matching have significantly advanced generation tasks, but their potential in discriminative tasks remains underexplored. Previous approaches, such as generative classifiers, have not fully leveraged the capabilities of these models for discriminative tasks due to their intricate designs. We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous flow model. PRG effectively reuses unsupervised generative models, leveraging their high capacity to serve as robust and generalizable feature extractors for downstream tasks. Our method consistently outperforms prior approaches across multiple benchmarks, achieving state-of-the-art performance among generative model-based methods, including 78\% top-1 accuracy on ImageNet. Extensive ablation studies further validate the effectiveness of our approach.

Title: CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

Authors: Kai He, Chin-Hsuan Wu, Igor Gilitschenski
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.01792
Pdf URL: https://arxiv.org/pdf/2412.01792
Copy Paste: [[2412.01792]] CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion(https://arxiv.org/abs/2412.01792)
Keywords: diffusion
Abstract: Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to "learn" the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.

Title: IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models

Authors: Khaled Abud, Sergey Lavrushkin, Alexey Kirillov, Dmitriy Vatolin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01794
Pdf URL: https://arxiv.org/pdf/2412.01794
Copy Paste: [[2412.01794]] IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models(https://arxiv.org/abs/2412.01794)
Keywords: robust, diffusion, generative
Abstract: Diffusion-based models have recently transformed conditional image generation, achieving unprecedented fidelity in generating photorealistic and semantically accurate images. However, consistently generating high-quality images remains challenging, partly due to the lack of mechanisms for conditioning outputs on perceptual quality. In this work, we propose methods to integrate image quality assessment (IQA) models into diffusion-based generators, enabling quality-aware image generation. First, we experiment with gradient-based guidance to optimize image quality directly and show this approach has limited generalizability. To address this, we introduce IQA-Adapter, a novel architecture that conditions generation on target quality levels by learning the relationship between images and quality scores. When conditioned on high target quality, IQA-Adapter shifts the distribution of generated images towards a higher-quality subdomain. This approach achieves up to a 10% improvement across multiple objective metrics, as confirmed by a subjective study, while preserving generative diversity and content. Additionally, IQA-Adapter can be used inversely as a degradation model, generating progressively more distorted images when conditioned on lower quality scores. Our quality-aware methods also provide insights into the adversarial robustness of IQA models, underscoring the potential of quality conditioning in generative modeling and the importance of robust IQA methods.

Title: PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

Authors: Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01800
Pdf URL: https://arxiv.org/pdf/2412.01800
Copy Paste: [[2412.01800]] PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos(https://arxiv.org/abs/2412.01800)
Keywords: large language model
Abstract: Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.

Title: SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Authors: Alexey Bokhovkin, Quan Meng, Shubham Tulsiani, Angela Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01801
Pdf URL: https://arxiv.org/pdf/2412.01801
Copy Paste: [[2412.01801]] SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation(https://arxiv.org/abs/2412.01801)
Keywords: diffusion
Abstract: We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.

Title: V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

Authors: Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z. Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, Jacob Pham, Xin Xia, Zhiyu Huang, Bolei Zhou, Jiaqi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01812
Pdf URL: https://arxiv.org/pdf/2412.01812
Copy Paste: [[2412.01812]] V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction(https://arxiv.org/abs/2412.01812)
Keywords: transformer
Abstract: Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on temporal perception and prediction tasks in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with various fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatiotemporal relationships across temporal per-frame, spatial per-agent, and high-definition map. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X cooperation modes and addresses the limitations of existing real-world datasets, which are restricted to single-frame or single-mode cooperation. Extensive experiments demonstrate our framework outperforms state-of-the-art methods in both perception and prediction tasks.

Title: COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Authors: Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01814
Pdf URL: https://arxiv.org/pdf/2412.01814
Copy Paste: [[2412.01814]] COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training(https://arxiv.org/abs/2412.01814)
Keywords: segmentation
Abstract: Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

Title: Efficient Semantic Communication Through Transformer-Aided Compression

Authors: Matin Mortaheb, Mohammad A. Amir Khojastepour, Sennur Ulukus
Subjects: cs.LG, cs.CV, cs.IT, eess.SP
Abstract URL: https://arxiv.org/abs/2412.01817
Pdf URL: https://arxiv.org/pdf/2412.01817
Copy Paste: [[2412.01817]] Efficient Semantic Communication Through Transformer-Aided Compression(https://arxiv.org/abs/2412.01817)
Keywords: transformer
Abstract: Transformers, known for their attention mechanisms, have proven highly effective in focusing on critical elements within complex data. This feature can effectively be used to address the time-varying channels in wireless communication systems. In this work, we introduce a channel-aware adaptive framework for semantic communication, where different regions of the image are encoded and compressed based on their semantic content. By employing vision transformers, we interpret the attention mask as a measure of the semantic contents of the patches and dynamically categorize the patches to be compressed at various rates as a function of the instantaneous channel bandwidth. Our method enhances communication efficiency by adapting the encoding resolution to the content's relevance, ensuring that even in highly constrained environments, critical information is preserved. We evaluate the proposed adaptive transmission framework using the TinyImageNet dataset, measuring both reconstruction quality and accuracy. The results demonstrate that our approach maintains high semantic fidelity while optimizing bandwidth, providing an effective solution for transmitting multi-resolution data in limited bandwidth conditions.

Title: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

Authors: Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01818
Pdf URL: https://arxiv.org/pdf/2412.01818
Copy Paste: [[2412.01818]] [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster(https://arxiv.org/abs/2412.01818)
Keywords: large language model
Abstract: Large vision-language models (VLMs) often rely on a substantial number of visual tokens when interacting with large language models (LLMs), which has proven to be inefficient. Recent efforts have aimed to accelerate VLM inference by pruning visual tokens. Most existing methods assess the importance of visual tokens based on the text-visual cross-attentions in LLMs. In this study, we find that the cross-attentions between text and visual tokens in LLMs are inaccurate. Pruning tokens based on these inaccurate attentions leads to significant performance degradation, especially at high reduction ratios. To this end, we introduce FasterVLM, a simple yet effective training-free visual token pruning method that evaluates the importance of visual tokens more accurately by utilizing attentions between the [CLS] token and image tokens from the visual encoder. Since FasterVLM eliminates redundant visual tokens immediately after the visual encoder, ensuring they do not interact with LLMs and resulting in faster VLM inference. It is worth noting that, benefiting from the accuracy of [CLS] cross-attentions, FasterVLM can prune 95\% of visual tokens while maintaining 90\% of the performance of LLaVA-1.5-7B. We apply FasterVLM to various VLMs, including LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA, to demonstrate its effectiveness. Experimental results show that our FasterVLM maintains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing text-visual attention-based methods. Our code is available at this https URL.

Title: Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Authors: Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01819
Pdf URL: https://arxiv.org/pdf/2412.01819
Copy Paste: [[2412.01819]] Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis(https://arxiv.org/abs/2412.01819)
Keywords: diffusion, transformer
Abstract: This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating ${\sim}11\%$ faster sampling and lower memory usage while also achieving slightly better generation this http URL, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of ${\sim}20\%$ and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to $7{\times}$ faster.

Title: World-consistent Video Diffusion with Explicit 3D Modeling

Authors: Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01821
Pdf URL: https://arxiv.org/pdf/2412.01821
Copy Paste: [[2412.01821]] World-consistent Video Diffusion with Explicit 3D Modeling(https://arxiv.org/abs/2412.01821)
Keywords: diffusion, transformer
Abstract: Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

Title: X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Authors: Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2412.01824
Pdf URL: https://arxiv.org/pdf/2412.01824
Copy Paste: [[2412.01824]] X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models(https://arxiv.org/abs/2412.01824)
Keywords: large language model
Abstract: In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.

Title: RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Authors: Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01827
Pdf URL: https://arxiv.org/pdf/2412.01827
Copy Paste: [[2412.01827]] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders(https://arxiv.org/abs/2412.01827)
Keywords: transformer
Abstract: We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at this https URL.