2024-12-06

Title: Not All Adapters Matter: Selective Adapter Freezing for Memory-Efficient Fine-Tuning of Language Models

Authors: Hyegang Son, Yonglak Son, Changhoon Kim, Young Geun Kim
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03587
Pdf URL: https://arxiv.org/pdf/2412.03587
Copy Paste: [[2412.03587]] Not All Adapters Matter: Selective Adapter Freezing for Memory-Efficient Fine-Tuning of Language Models(https://arxiv.org/abs/2412.03587)
Keywords: transformer
Abstract: Transformer-based large-scale pre-trained models achieve great success, and fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. Recent work has developed adapter-tuning, but these approaches either still require a relatively high resource usage. Through our investigation, we show that each adapter in adapter-tuning does not have the same impact on task performance and resource usage. Based on our findings, we propose SAFE, which gradually freezes less-important adapters that do not contribute to adaptation during the early training steps. In our experiments, SAFE reduces memory usage, computation amount, and training time by 42.85\%, 34.59\%, and 11.82\%, respectively, while achieving comparable or better performance compared to the baseline. We also demonstrate that SAFE induces regularization effect, thereby smoothing the loss landscape.

Title: Enhancing Document AI Data Generation Through Graph-Based Synthetic Layouts

Authors: Amit Agarwal, Hitesh Patel, Priyaranjan Pattnayak, Srikant Panda, Bhargava Kumar, Tejaswini Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03590
Pdf URL: https://arxiv.org/pdf/2412.03590
Copy Paste: [[2412.03590]] Enhancing Document AI Data Generation Through Graph-Based Synthetic Layouts(https://arxiv.org/abs/2412.03590)
Keywords: privacy, robust, extraction
Abstract: The development of robust Document AI models has been constrained by limited access to high-quality, labeled datasets, primarily due to data privacy concerns, scarcity, and the high cost of manual annotation. Traditional methods of synthetic data generation, such as text and image augmentation, have proven effective for increasing data diversity but often fail to capture the complex layout structures present in real world documents. This paper proposes a novel approach to synthetic document layout generation using Graph Neural Networks (GNNs). By representing document elements (e.g., text blocks, images, tables) as nodes in a graph and their spatial relationships as edges, GNNs are trained to generate realistic and diverse document layouts. This method leverages graph-based learning to ensure structural coherence and semantic consistency, addressing the limitations of traditional augmentation techniques. The proposed framework is evaluated on tasks such as document classification, named entity recognition (NER), and information extraction, demonstrating significant performance improvements. Furthermore, we address the computational challenges of GNN based synthetic data generation and propose solutions to mitigate domain adaptation issues between synthetic and real-world datasets. Our experimental results show that graph-augmented document layouts outperform existing augmentation techniques, offering a scalable and flexible solution for training Document AI models.

Title: CovidLLM: A Robust Large Language Model with Missing Value Adaptation and Multi-Objective Learning Strategy for Predicting Disease Severity and Clinical Outcomes in COVID-19 Patients

Authors: Shengjun Zhu (1), Siyu Liu (2), Yang Li (3), Qing Lei, Hongyan Hou, Hewei Jiang, Shujuan Guo, Feng Wang, Rongshang Chen, Xionglin Fan, Shengce Tao, Jiaxin Cai ((1) School of Mathematics and Statistics, Xiamen University of Technology, Xiamen, China, (2) School of Computer and Information Engineering, Xiamen University of Technology, Xiamen, China, (3) Shanghai Center for Systems Biomedicine, Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Jiao Tong University, Shanghai, China)
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03593
Pdf URL: https://arxiv.org/pdf/2412.03593
Copy Paste: [[2412.03593]] CovidLLM: A Robust Large Language Model with Missing Value Adaptation and Multi-Objective Learning Strategy for Predicting Disease Severity and Clinical Outcomes in COVID-19 Patients(https://arxiv.org/abs/2412.03593)
Keywords: robust, large language model
Abstract: Coronavirus Disease 2019 (COVID-19), which emerged in 2019, has caused millions of deaths worldwide. Although effective vaccines have been developed to mitigate severe symptoms, certain populations, particularly the elderly and those with comorbidities, remain at high risk for severe outcomes and increased mortality. Consequently, early identification of the severity and clinical outcomes of the disease in these patients is vital to prevent adverse prognoses. Although traditional machine learning and deep learning models have been widely employed in this area, the potential of large language models (LLMs) remains largely unexplored. Our research focuses primarily on constructing specialized prompts and adopting multi-objective learning strategies. We started by selecting serological indicators that significantly correlate with clinical outcomes and disease severity to serve as input data for the model. Blood test samples often contain numerous missing values, and traditional models generally rely on imputation to handle these gaps in the data. In contrast, LLMs offer the advantage of robust semantic understanding. By setting prompts, we can explicitly inform the model when a feature's value is missing, without the need for imputation. For the multi-objective learning strategy, the model is designed to first predict disease severity and then predict clinical outcomes. Given that LLMs utilize both the input text and the generated tokens as input for generating the next token, the predicted severity is used as a basis for generating the clinical outcome. During the fine-tuning of the LLM, the two objectives influence and improve each other. Our experiments were implemented based on the ChatGLM model. The results demonstrate the effectiveness of LLMs in this task, suggesting promising potential for further development.

Title: The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

Authors: Sourav Banerjee, Ayushi Agarwal, Eishkaran Singh
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.03597
Pdf URL: https://arxiv.org/pdf/2412.03597
Copy Paste: [[2412.03597]] The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?(https://arxiv.org/abs/2412.03597)
Keywords: large language model
Abstract: The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum, from basic metrics to complex benchmarks like GLUE and MMLU. These vulnerabilities manifest through benchmark exploitation, dataset contamination, and evaluation bias, creating a false perception of progress in language understanding capabilities. Through extensive review of contemporary evaluation approaches, we identify significant limitations in static benchmark designs, human evaluation protocols, and LLM-as-judge frameworks, all of which compromise the reliability of current performance assessments. As LLM capabilities evolve and existing benchmarks become redundant, we lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks. This requires frameworks that are adapted dynamically, addressing current limitations and providing a more accurate reflection of LLM performance.

Title: CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models

Authors: Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03599
Pdf URL: https://arxiv.org/pdf/2412.03599
Copy Paste: [[2412.03599]] CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models(https://arxiv.org/abs/2412.03599)
Keywords: robust, large language model
Abstract: Large language models have transformed the comprehension and generation of natural language tasks, but they come with substantial memory and computational requirements. Quantization techniques have emerged as a promising avenue for addressing these challenges while preserving accuracy and making energy efficient. We propose CPTQuant, a comprehensive strategy that introduces correlation-based (CMPQ), pruning-based (PMPQ), and Taylor decomposition-based (TDMPQ) mixed precision techniques. CMPQ adapts the precision level based on canonical correlation analysis of different layers. PMPQ optimizes precision layer-wise based on their sensitivity to sparsity. TDMPQ modifies precision using Taylor decomposition to assess each layer's sensitivity to input perturbation. These strategies allocate higher precision to more sensitive layers while diminishing precision to robust layers. CPTQuant assesses the performance across BERT, OPT-125M, OPT-350M, OPT-1.3B, and OPT-2.7B. We demonstrate up to 4x compression and a 2x-fold increase in efficiency with minimal accuracy drop compared to Hugging Face FP16. PMPQ stands out for achieving a considerably higher model compression. Sensitivity analyses across various LLMs show that the initial and final 30% of layers exhibit higher sensitivities than the remaining layers. PMPQ demonstrates an 11% higher compression ratio than other methods for classification tasks, while TDMPQ achieves a 30% greater compression ratio for language modeling tasks.

Title: HunyuanVideo: A Systematic Framework For Large Video Generative Models

Authors: Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Aladdin Wang, Andong Wang, Bai Jiawang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Junkun Yuan, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yanxin Long, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Daquan Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Caesar Zhong (Refer to the report for detailed contributions)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03603
Pdf URL: https://arxiv.org/pdf/2412.03603
Copy Paste: [[2412.03603]] HunyuanVideo: A Systematic Framework For Large Video Generative Models(https://arxiv.org/abs/2412.03603)
Keywords: generative
Abstract: Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at this https URL.

Title: CBEval: A framework for evaluating and interpreting cognitive biases in LLMs

Authors: Ammar Shaikh, Raj Abhijit Dandekar, Sreedath Panat, Rajat Dandekar
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2412.03605
Pdf URL: https://arxiv.org/pdf/2412.03605
Copy Paste: [[2412.03605]] CBEval: A framework for evaluating and interpreting cognitive biases in LLMs(https://arxiv.org/abs/2412.03605)
Keywords: large language model
Abstract: Rapid advancements in Large Language models (LLMs) has significantly enhanced their reasoning capabilities. Despite improved performance on benchmarks, LLMs exhibit notable gaps in their cognitive processes. Additionally, as reflections of human-generated data, these models have the potential to inherit cognitive biases, raising concerns about their reasoning and decision making capabilities. In this paper we present a framework to interpret, understand and provide insights into a host of cognitive biases in LLMs. Conducting our research on frontier language models we're able to elucidate reasoning limitations and biases, and provide reasoning behind these biases by constructing influence graphs that identify phrases and words most responsible for biases manifested in LLMs. We further investigate biases such as round number bias and cognitive bias barrier revealed when noting framing effect in language models.

Title: Multimodal Sentiment Analysis Based on BERT and ResNet

Authors: JiaLe Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.03625
Pdf URL: https://arxiv.org/pdf/2412.03625
Copy Paste: [[2412.03625]] Multimodal Sentiment Analysis Based on BERT and ResNet(https://arxiv.org/abs/2412.03625)
Keywords: extraction
Abstract: With the rapid development of the Internet and social media, multi-modal data (text and image) is increasingly important in sentiment analysis tasks. However, the existing methods are difficult to effectively fuse text and image features, which limits the accuracy of analysis. To solve this problem, a multimodal sentiment analysis framework combining BERT and ResNet was proposed. BERT has shown strong text representation ability in natural language processing, and ResNet has excellent image feature extraction performance in the field of computer vision. Firstly, BERT is used to extract the text feature vector, and ResNet is used to extract the image feature representation. Then, a variety of feature fusion strategies are explored, and finally the fusion model based on attention mechanism is selected to make full use of the complementary information between text and image. Experimental results on the public dataset MAVA-single show that compared with the single-modal models that only use BERT or ResNet, the proposed multi-modal model improves the accuracy and F1 score, reaching the best accuracy of 74.5%. This study not only provides new ideas and methods for multimodal sentiment analysis, but also demonstrates the application potential of BERT and ResNet in cross-domain fusion. In the future, more advanced feature fusion techniques and optimization strategies will be explored to further improve the accuracy and generalization ability of multimodal sentiment analysis.

Title: Evaluating Single Event Upsets in Deep Neural Networks for Semantic Segmentation: an embedded system perspective

Authors: Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe
Subjects: cs.CV, cs.AI, cs.AR, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.03630
Pdf URL: https://arxiv.org/pdf/2412.03630
Copy Paste: [[2412.03630]] Evaluating Single Event Upsets in Deep Neural Networks for Semantic Segmentation: an embedded system perspective(https://arxiv.org/abs/2412.03630)
Keywords: robust, segmentation
Abstract: As the deployment of artifical intelligence (AI) algorithms at edge devices becomes increasingly prevalent, enhancing the robustness and reliability of autonomous AI-based perception and decision systems is becoming as relevant as precision and performance, especially in applications areas considered safety-critical such as autonomous driving and aerospace. This paper delves into the robustness assessment in embedded Deep Neural Networks (DNNs), particularly focusing on the impact of parameter perturbations produced by single event upsets (SEUs) on convolutional neural networks (CNN) for image semantic segmentation. By scrutinizing the layer-by-layer and bit-by-bit sensitivity of various encoder-decoder models to soft errors, this study thoroughly investigates the vulnerability of segmentation DNNs to SEUs and evaluates the consequences of techniques like model pruning and parameter quantization on the robustness of compressed models aimed at embedded implementations. The findings offer valuable insights into the mechanisms underlying SEU-induced failures that allow for evaluating the robustness of DNNs once trained in advance. Moreover, based on the collected data, we propose a set of practical lightweight error mitigation techniques with no memory or computational cost suitable for resource-constrained deployments. The code used to perform the fault injection (FI) campaign is available at this https URL , while the code to implement proposed techniques is available at this https URL .

Title: MV-Adapter: Multi-view Consistent Image Generation Made Easy

Authors: Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03632
Pdf URL: https://arxiv.org/pdf/2412.03632
Copy Paste: [[2412.03632]] MV-Adapter: Multi-view Consistent Image Generation Made Easy(https://arxiv.org/abs/2412.03632)
Keywords: diffusion
Abstract: Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.

Title: Explainable Malware Detection through Integrated Graph Reduction and Learning Techniques

Authors: Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03634
Pdf URL: https://arxiv.org/pdf/2412.03634
Copy Paste: [[2412.03634]] Explainable Malware Detection through Integrated Graph Reduction and Learning Techniques(https://arxiv.org/abs/2412.03634)
Keywords: interpretability
Abstract: Control Flow Graphs and Function Call Graphs have become pivotal in providing a detailed understanding of program execution and effectively characterizing the behavior of malware. These graph-based representations, when combined with Graph Neural Networks (GNN), have shown promise in developing high-performance malware detectors. However, challenges remain due to the large size of these graphs and the inherent opacity in the decision-making process of GNNs. This paper addresses these issues by developing several graph reduction techniques to reduce graph size and applying the state-of-the-art GNNExplainer to enhance the interpretability of GNN outputs. The analysis demonstrates that integrating our proposed graph reduction technique along with GNNExplainer in the malware detection framework significantly reduces graph size while preserving high performance, providing an effective balance between efficiency and transparency in malware detection.

Title: Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

Authors: Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.03665
Pdf URL: https://arxiv.org/pdf/2412.03665
Copy Paste: [[2412.03665]] Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis(https://arxiv.org/abs/2412.03665)
Keywords: large language model
Abstract: The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging. We discuss the implications of these findings for future research in image captioning and the development of more adaptable Multimodal LLMs.

Title: Hyperparameter Tuning Through Pessimistic Bilevel Optimization

Authors: Meltem Apaydin Ustun, Liang Xu, Bo Zeng, Xiaoning Qian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.03666
Pdf URL: https://arxiv.org/pdf/2412.03666
Copy Paste: [[2412.03666]] Hyperparameter Tuning Through Pessimistic Bilevel Optimization(https://arxiv.org/abs/2412.03666)
Keywords: robust
Abstract: Automated hyperparameter search in machine learning, especially for deep learning models, is typically formulated as a bilevel optimization problem, with hyperparameter values determined by the upper level and the model learning achieved by the lower-level problem. Most of the existing bilevel optimization solutions either assume the uniqueness of the optimal training model given hyperparameters or adopt an optimistic view when the non-uniqueness issue emerges. Potential model uncertainty may arise when training complex models with limited data, especially when the uniqueness assumption is violated. Thus, the suitability of the optimistic view underlying current bilevel hyperparameter optimization solutions is questionable. In this paper, we propose pessimistic bilevel hyperparameter optimization to assure appropriate outer-level hyperparameters to better generalize the inner-level learned models, by explicitly incorporating potential uncertainty of the inner-level solution set. To solve the resulting computationally challenging pessimistic bilevel optimization problem, we develop a novel relaxation-based approximation method. It derives pessimistic solutions with more robust prediction models. In our empirical studies of automated hyperparameter search for binary linear classifiers, pessimistic solutions have demonstrated better prediction performances than optimistic counterparts when we have limited training data or perturbed testing data, showing the necessity of considering pessimistic solutions besides existing optimistic ones.

Title: Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings

Authors: Guy Barel, Oren Tsur, Dan Volenchik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.03681
Pdf URL: https://arxiv.org/pdf/2412.03681
Copy Paste: [[2412.03681]] Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings(https://arxiv.org/abs/2412.03681)
Keywords: transformer
Abstract: Stance detection plays a pivotal role in enabling an extensive range of downstream applications, from discourse parsing to tracing the spread of fake news and the denial of scientific facts. While most stance classification models rely on textual representation of the utterance in question, prior work has demonstrated the importance of the conversational context in stance detection. In this work we introduce TASTE -- a multimodal architecture for stance detection that harmoniously fuses Transformer-based content embedding with unsupervised structural embedding. Through the fine-tuning of a pretrained transformer and the amalgamation with social embedding via a Gated Residual Network (GRN) layer, our model adeptly captures the complex interplay between content and conversational structure in determining stance. TASTE achieves state-of-the-art results on common benchmarks, significantly outperforming an array of strong baselines. Comparative evaluations underscore the benefits of social grounding -- emphasizing the criticality of concurrently harnessing both content and structure for enhanced stance detection.

Title: Designing DNNs for a trade-off between robustness and processing performance in embedded devices

Authors: Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe
Subjects: cs.LG, cs.AI, cs.AR, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.03682
Pdf URL: https://arxiv.org/pdf/2412.03682
Copy Paste: [[2412.03682]] Designing DNNs for a trade-off between robustness and processing performance in embedded devices(https://arxiv.org/abs/2412.03682)
Keywords: robust, segmentation
Abstract: Machine learning-based embedded systems employed in safety-critical applications such as aerospace and autonomous driving need to be robust against perturbations produced by soft errors. Soft errors are an increasing concern in modern digital processors since smaller transistor geometries and lower voltages give electronic devices a higher sensitivity to background radiation. The resilience of deep neural network (DNN) models to perturbations in their parameters is determined, to a large extent, by the structure of the model itself, and also by the selected numerical representation and used arithmetic precision. When compression techniques such as model pruning and model quantization are applied to reduce memory footprint and computational complexity for deployment, both model structure and numerical representation are modified and thus, soft error robustness also changes. In this sense, although the choice of activation functions (AFs) in DNN models is frequently ignored, it conditions not only their accuracy and trainability, but also compressibility rates and numerical robustness. This paper investigates the suitability of using bounded AFs to improve model robustness against DNN parameter perturbations, assessing at the same time the impact of this choice on deployment in terms of model accuracy, compressibility, and computational burden. In particular, we analyze encoder-decoder fully convolutional models aimed at performing semantic segmentation tasks on hyperspectral images for scene understanding in autonomous driving. Deployment characterization is performed experimentally on an AMD-Xilinx's KV260 SoM.

Title: Interpretable Hierarchical Attention Network for Medical Condition Identification

Authors: Dongping Fang, Lian Duan, Xiaojing Yuan, Allyn Klunder, Kevin Tan, Suiting Cao, Yeqing Ji, Mike Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.03701
Pdf URL: https://arxiv.org/pdf/2412.03701
Copy Paste: [[2412.03701]] Interpretable Hierarchical Attention Network for Medical Condition Identification(https://arxiv.org/abs/2412.03701)
Keywords: interpretability
Abstract: Accurate prediction of medical conditions with straight past clinical evidence is a long-sought topic in the medical management and health insurance field. Although great progress has been made with machine learning algorithms, the medical community is still skeptical about the model accuracy and interpretability. This paper presents an innovative hierarchical attention deep learning model to achieve better prediction and clear interpretability that can be easily understood by medical professionals. This paper developed an Interpretable Hierarchical Attention Network (IHAN). IHAN uses a hierarchical attention structure that matches naturally with the medical history data structure and reflects patients encounter (date of service) sequence. The model attention structure consists of 3 levels: (1) attention on the medical code types (diagnosis codes, procedure codes, lab test results, and prescription drugs), (2) attention on the sequential medical encounters within a type, (3) attention on the individual medical codes within an encounter and type. This model is applied to predict the occurrence of stage 3 chronic kidney disease (CKD), using three years medical history of Medicare Advantage (MA) members from an American nationwide health insurance company. The model takes members medical events, both claims and Electronic Medical Records (EMR) data, as input, makes a prediction of stage 3 CKD and calculates contribution from individual events to the predicted outcome.

Title: Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Authors: Wang Xiyao, Yang Zhengyuan, Li Linjie, Lu Hongjin, Xu Yuancheng, Lin Chung-Ching Lin, Lin Kevin, Huang Furong, Wang Lijuan
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03704
Pdf URL: https://arxiv.org/pdf/2412.03704
Copy Paste: [[2412.03704]] Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension(https://arxiv.org/abs/2412.03704)
Keywords: large language model
Abstract: Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at this https URL.

Title: Fairness without Demographics through Learning Graph of Gradients

Authors: Yingtao Luo, Zhixun Li, Qiang Liu, Jun Zhu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.03706
Pdf URL: https://arxiv.org/pdf/2412.03706
Copy Paste: [[2412.03706]] Fairness without Demographics through Learning Graph of Gradients(https://arxiv.org/abs/2412.03706)
Keywords: privacy, robust, fair
Abstract: Machine learning systems are notoriously prone to biased predictions about certain demographic groups, leading to algorithmic fairness issues. Due to privacy concerns and data quality problems, some demographic information may not be available in the training data and the complex interaction of different demographics can lead to a lot of unknown minority subpopulations, which all limit the applicability of group fairness. Many existing works on fairness without demographics assume the correlation between groups and features. However, we argue that the model gradients are also valuable for fairness without demographics. In this paper, we show that the correlation between gradients and groups can help identify and improve group fairness. With an adversarial weighting architecture, we construct a graph where samples with similar gradients are connected and learn the weights of different samples from it. Unlike the surrogate grouping methods that cluster groups from features and labels as proxy sensitive attribute, our method leverages the graph structure as a soft grouping mechanism, which is much more robust to noises. The results show that our method is robust to noise and can improve fairness significantly without decreasing the overall accuracy too much.

Title: Securing RC Based P2P Networks: A Blockchain-based Access Control Framework utilizing Ethereum Smart Contracts for IoT and Web 3.0

Authors: Saurav Ghosh, Reshmi Mitra, Indranil Roy, Bidyut Gupta
Subjects: cs.CR, cs.DC, cs.NI, eess.SY
Abstract URL: https://arxiv.org/abs/2412.03709
Pdf URL: https://arxiv.org/pdf/2412.03709
Copy Paste: [[2412.03709]] Securing RC Based P2P Networks: A Blockchain-based Access Control Framework utilizing Ethereum Smart Contracts for IoT and Web 3.0(https://arxiv.org/abs/2412.03709)
Keywords: security
Abstract: Ensuring security for highly dynamic peer-to-peer (P2P) networks has always been a challenge, especially for services like online transactions and smart devices. These networks experience high churn rates, making it difficult to maintain appropriate access control. Traditional systems, particularly Role-Based Access Control (RBAC), often fail to meet the needs of a P2P environment. This paper presents a blockchain-based access control framework that uses Ethereum smart contracts to address these challenges. Our framework aims to close the gaps in existing access control systems by providing flexible, transparent, and decentralized security solutions. The proposed framework includes access control contracts (ACC) that manage access based on static and dynamic policies, a Judge Contract (JC) to handle misbehavior, and a Register Contract (RC) to record and manage the interactions between ACCs and JC. The security model combines impact and severity-based threat assessments using the CIA (Confidentiality, Integrity, Availability) and STRIDE principles, ensuring responses are tailored to different threat levels. This system not only stabilizes the fundamental issues of peer membership but also offers a scalable solution, particularly valuable in areas such as the Internet of Things (IoT) and Web 3.0 technologies.

Title: PathletRL++: Optimizing Trajectory Pathlet Extraction and Dictionary Formation via Reinforcement Learning

Authors: Gian Alix, Arian Haghparast, Manos Papagelis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03715
Pdf URL: https://arxiv.org/pdf/2412.03715
Copy Paste: [[2412.03715]] PathletRL++: Optimizing Trajectory Pathlet Extraction and Dictionary Formation via Reinforcement Learning(https://arxiv.org/abs/2412.03715)
Keywords: extraction
Abstract: Advances in tracking technologies have spurred the rapid growth of large-scale trajectory data. Building a compact collection of pathlets, referred to as a trajectory pathlet dictionary, is essential for supporting mobility-related applications. Existing methods typically adopt a top-down approach, generating numerous candidate pathlets and selecting a subset, leading to high memory usage and redundant storage from overlapping pathlets. To overcome these limitations, we propose a bottom-up strategy that incrementally merges basic pathlets to build the dictionary, reducing memory requirements by up to 24,000 times compared to baseline methods. The approach begins with unit-length pathlets and iteratively merges them while optimizing utility, which is defined using newly introduced metrics of trajectory loss and representability. We develop a deep reinforcement learning framework, PathletRL, which utilizes Deep Q-Networks (DQN) to approximate the utility function, resulting in a compact and efficient pathlet dictionary. Experiments on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art techniques, reducing the size of the constructed dictionary by up to 65.8%. Additionally, our results show that only half of the dictionary pathlets are needed to reconstruct 85% of the original trajectory data. Building on PathletRL, we introduce PathletRL++, which extends the original model by incorporating a richer state representation and an improved reward function to optimize decision-making during pathlet merging. These enhancements enable the agent to gain a more nuanced understanding of the environment, leading to higher-quality pathlet dictionaries. PathletRL++ achieves even greater dictionary size reduction, surpassing the performance of PathletRL, while maintaining high trajectory representability.

Title: A Water Efficiency Dataset for African Data Centers

Authors: Noah Shumba, Opelo Tshekiso, Pengfei Li, Giulia Fanti, Shaolei Ren
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2412.03716
Pdf URL: https://arxiv.org/pdf/2412.03716
Copy Paste: [[2412.03716]] A Water Efficiency Dataset for African Data Centers(https://arxiv.org/abs/2412.03716)
Keywords: large language model
Abstract: AI computing and data centers consume a large amount of freshwater, both directly for cooling and indirectly for electricity generation. While most attention has been paid to developed countries such as the U.S., this paper presents the first-of-its-kind dataset that combines nation-level weather and electricity generation data to estimate water usage efficiency for data centers in 41 African countries across five different climate regions. We also use our dataset to evaluate and estimate the water consumption of inference on two large language models (i.e., Llama-3-70B and GPT-4) in 11 selected African countries. Our findings show that writing a 10-page report using Llama-3-70B could consume about \textbf{0.7 liters} of water, while the water consumption by GPT-4 for the same task may go up to about 60 liters. For writing a medium-length email of 120-200 words, Llama-3-70B and GPT-4 could consume about \textbf{0.13 liters} and 3 liters of water, respectively. Interestingly, given the same AI model, 8 out of the 11 selected African countries consume less water than the global average, mainly because of lower water intensities for electricity generation. However, water consumption can be substantially higher in some African countries with a steppe climate than the U.S. and global averages, prompting more attention when deploying AI computing in these countries. Our dataset is publicly available on \href{this https URL}{Hugging Face}.

Title: Electrocardiogram-based diagnosis of liver diseases: an externally validated and explainable machine learning approach

Authors: Juan Miguel Lopez Alcaraz, Wilhelm Haverkamp, Nils Strodthoff
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2412.03717
Pdf URL: https://arxiv.org/pdf/2412.03717
Copy Paste: [[2412.03717]] Electrocardiogram-based diagnosis of liver diseases: an externally validated and explainable machine learning approach(https://arxiv.org/abs/2412.03717)
Keywords: robust, explainability
Abstract: Background: Liver diseases are a major global health concern, often diagnosed using resource-intensive methods. Electrocardiogram (ECG) data, widely accessible and non-invasive, offers potential as a diagnostic tool for liver diseases, leveraging the physiological connections between cardiovascular and hepatic health. Methods: This study applies machine learning models to ECG data for the diagnosis of liver diseases. The pipeline, combining tree-based models with Shapley values for explainability, was trained, internally validated, and externally validated on an independent cohort, demonstrating robust generalizability. Findings: Our results demonstrate the potential of ECG to derive biomarkers to diagnose liver diseases. Shapley values revealed key ECG features contributing to model predictions, highlighting already known connections between cardiovascular biomarkers and hepatic conditions as well as providing new ones. Furthermore, our approach holds promise as a scalable and affordable solution for liver disease detection, particularly in resource-limited settings. Interpretation: This study underscores the feasibility of leveraging ECG features and machine learning to enhance the diagnosis of liver diseases. By providing interpretable insights into cardiovascular-liver interactions, the approach bridges existing gaps in non-invasive diagnostics, offering implications for broader systemic disease monitoring.

Title: VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

Authors: Chaoyu Li, Eun Woo Im, Pooyan Fazli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03735
Pdf URL: https://arxiv.org/pdf/2412.03735
Copy Paste: [[2412.03735]] VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding(https://arxiv.org/abs/2412.03735)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, the problem of hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that the visual encoder of MLLMs often struggles to differentiate between video pairs that are visually distinct but semantically similar, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding tasks. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. VidHalluc consists of 5,002 videos, paired based on semantic similarity and visual differences, focusing on cases where hallucinations are most likely to occur. Through comprehensive testing, our experiments show that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency information from DINOv2 to reweight visual features during inference. Our results demonstrate that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations among all tasks. Both the VidHalluc benchmark and DINO-HEAL code can be accessed via $\href{this https URL}{\text{this link}}$.

Title: Domain-specific Question Answering with Hybrid Search

Authors: Dewang Sultania, Zhaoyu Lu, Twisha Naik, Franck Dernoncourt, David Seunghyun Yoon, Sanat Sharma, Trung Bui, Ashok Gupta, Tushar Vatsa, Suhas Suresha, Ishita Verma, Vibha Belavadi, Cheng Chen, Michael Friedrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.03736
Pdf URL: https://arxiv.org/pdf/2412.03736
Copy Paste: [[2412.03736]] Domain-specific Question Answering with Hybrid Search(https://arxiv.org/abs/2412.03736)
Keywords: robust
Abstract: Domain specific question answering is an evolving field that requires specialized solutions to address unique challenges. In this paper, we show that a hybrid approach combining a fine-tuned dense retriever with keyword based sparse search methods significantly enhances performance. Our system leverages a linear combination of relevance signals, including cosine similarity from dense retrieval, BM25 scores, and URL host matching, each with tunable boost parameters. Experimental results indicate that this hybrid method outperforms our single-retriever system, achieving improved accuracy while maintaining robust contextual grounding. These findings suggest that integrating multiple retrieval methodologies with weighted scoring effectively addresses the complexities of domain specific question answering in enterprise settings.

Title: Multi-view Image Diffusion via Coordinate Noise and Fourier Attention

Authors: Justin Theiss, Norman Müller, Daeil Kim, Aayush Prakash
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03756
Pdf URL: https://arxiv.org/pdf/2412.03756
Copy Paste: [[2412.03756]] Multi-view Image Diffusion via Coordinate Noise and Fourier Attention(https://arxiv.org/abs/2412.03756)
Keywords: diffusion
Abstract: Recently, text-to-image generation with diffusion models has made significant advancements in both higher fidelity and generalization capabilities compared to previous baselines. However, generating holistic multi-view consistent images from prompts still remains an important and challenging task. To address this challenge, we propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism as well as novel noise initialization technique and cross-attention loss. This Fourier-based attention block focuses on features from non-overlapping regions of the generated scene in order to better align the global appearance. Our noise initialization technique incorporates shared noise and low spatial frequency information derived from pixel coordinates and depth maps to induce noise correlations across views. The cross-attention loss further aligns features sharing the same prompt across the scene. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.

Title: Advancing Auto-Regressive Continuation for Video Frames

Authors: Ruibo Ming, Jingwei Wu, Zhewei Huang, Zhuoxuan Ju, Jianming HU, Lihui Peng, Shuchang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03758
Pdf URL: https://arxiv.org/pdf/2412.03758
Copy Paste: [[2412.03758]] Advancing Auto-Regressive Continuation for Video Frames(https://arxiv.org/abs/2412.03758)
Keywords: large language model
Abstract: Recent advances in auto-regressive large language models (LLMs) have shown their potential in generating high-quality text, inspiring researchers to apply them to image and video generation. This paper explores the application of LLMs to video continuation, a task essential for building world models and predicting future frames. In this paper, we tackle challenges including preventing degeneration in long-term frame generation and enhancing the quality of generated images. We design a scheme named ARCON, which involves training our model to alternately generate semantic tokens and RGB tokens, enabling the LLM to explicitly learn and predict the high-level structural information of the video. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance the visual quality of the generated videos. Quantitative and qualitative experiments in autonomous driving scenarios demonstrate our model can consistently generate long videos.

Title: Language Model Meets Prototypes: Towards Interpretable Text Classification Models through Prototypical Networks

Authors: Ximing Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03761
Pdf URL: https://arxiv.org/pdf/2412.03761
Copy Paste: [[2412.03761]] Language Model Meets Prototypes: Towards Interpretable Text Classification Models through Prototypical Networks(https://arxiv.org/abs/2412.03761)
Keywords: interpretability, transformer
Abstract: Pretrained transformer-based Language Models (LMs) are well-known for their ability to achieve significant improvement on NLP tasks, but their black-box nature, which leads to a lack of interpretability, has been a major concern. My dissertation focuses on developing intrinsically interpretable models when using LMs as encoders while maintaining their superior performance via prototypical networks. I initiated my research by investigating enhancements in performance for interpretable models of sarcasm detection. My proposed approach focuses on capturing sentiment incongruity to enhance accuracy while offering instance-based explanations for the classification decisions. Later, I developed a novel white-box multi-head graph attention-based prototype network designed to explain the decisions of text classification models without sacrificing the accuracy of the original black-box LMs. In addition, I am working on extending the attention-based prototype network with contrastive learning to redesign an interpretable graph neural network, aiming to enhance both the interpretability and performance of the model in document classification.

Title: End to End Collaborative Synthetic Data Generation

Authors: Sikha Pentyala, Geetha Sitaraman, Trae Claar, Martine De Cock
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03766
Pdf URL: https://arxiv.org/pdf/2412.03766
Copy Paste: [[2412.03766]] End to End Collaborative Synthetic Data Generation(https://arxiv.org/abs/2412.03766)
Keywords: secure, privacy, federate
Abstract: The success of AI is based on the availability of data to train models. While in some cases a single data custodian may have sufficient data to enable AI, often multiple custodians need to collaborate to reach a cumulative size required for meaningful AI research. The latter is, for example, often the case for rare diseases, with each clinical site having data for only a small number of patients. Recent algorithms for federated synthetic data generation are an important step towards collaborative, privacy-preserving data sharing. Existing techniques, however, focus exclusively on synthesizer training, assuming that the training data is already preprocessed and that the desired synthetic data can be delivered in one shot, without any hyperparameter tuning. In this paper, we propose an end-to-end collaborative framework for publishing of synthetic data that accounts for privacy-preserving preprocessing as well as evaluation. We instantiate this framework with Secure Multiparty Computation (MPC) protocols and evaluate it in a use case for privacy-preserving publishing of synthetic genomic data for leukemia.

Title: Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning

Authors: Yiran Wang, Chenshu Liu, Yunfan Li, Sanae Amani, Bolei Zhou, Lin F. Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.03767
Pdf URL: https://arxiv.org/pdf/2412.03767
Copy Paste: [[2412.03767]] Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning(https://arxiv.org/abs/2412.03767)
Keywords: robust
Abstract: The exploration \& exploitation dilemma poses significant challenges in reinforcement learning (RL). Recently, curiosity-based exploration methods achieved great success in tackling hard-exploration problems. However, they necessitate extensive hyperparameter tuning on different environments, which heavily limits the applicability and accessibility of this line of methods. In this paper, we characterize this problem via analysis of the agent behavior, concluding the fundamental difficulty of choosing a proper hyperparameter. We then identify the difficulty and the instability of the optimization when the agent learns with curiosity. We propose our method, hyperparameter robust exploration (\textbf{Hyper}), which extensively mitigates the problem by effectively regularizing the visitation of the exploration and decoupling the exploitation to ensure stable training. We theoretically justify that \textbf{Hyper} is provably efficient under function approximation setting and empirically demonstrate its appealing performance and robustness in various environments.

Title: Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Authors: Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03773
Pdf URL: https://arxiv.org/pdf/2412.03773
Copy Paste: [[2412.03773]] Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration(https://arxiv.org/abs/2412.03773)
Keywords: interpretability, transformer
Abstract: The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps -- like MLP layers -- is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of} the MLP layer in one-layer transformers implementing the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at this https URL.

Title: Coordinate In and Value Out: Training Flow Transformers in Ambient Space

Authors: Yuyang Wang, Anurag Ranjan, Josh Susskind, Miguel Angel Bautista
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03791
Pdf URL: https://arxiv.org/pdf/2412.03791
Copy Paste: [[2412.03791]] Coordinate In and Value Out: Training Flow Transformers in Ambient Space(https://arxiv.org/abs/2412.03791)
Keywords: transformer, generative
Abstract: Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on unstructured data like 3D point clouds. These models are commonly trained in two stages: first, a data compressor (i.e., a variational auto-encoder) is trained, and in a subsequent training stage a flow matching generative model is trained in the low-dimensional latent space of the data compressor. This two stage paradigm adds complexity to the overall training recipe and sets obstacles for unifying models across data domains, as specific data compressors are used for different data modalities. To this end, we introduce Ambient Space Flow Transformers (ASFT), a domain-agnostic approach to learn flow matching transformers in ambient space, sidestepping the requirement of training compressors and simplifying the training process. We introduce a conditionally independent point-wise training objective that enables ASFT to make predictions continuously in coordinate space. Our empirical results demonstrate that using general purpose transformer blocks, ASFT effectively handles different data modalities such as images and 3D point clouds, achieving strong performance in both domains and outperforming comparable approaches. ASFT is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.

Title: Agent AI with LangGraph: A Modular Framework for Enhancing Machine Translation Using Large Language Models

Authors: Jialin Wang, Zhihua Duan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03801
Pdf URL: https://arxiv.org/pdf/2412.03801
Copy Paste: [[2412.03801]] Agent AI with LangGraph: A Modular Framework for Enhancing Machine Translation Using Large Language Models(https://arxiv.org/abs/2412.03801)
Keywords: large language model
Abstract: This paper explores the transformative role of Agent AI and LangGraph in advancing the automation and effectiveness of machine translation (MT). Agents are modular components designed to perform specific tasks, such as translating between particular languages, with specializations like TranslateEnAgent, TranslateFrenchAgent, and TranslateJpAgent for English, French, and Japanese translations, respectively. These agents leverage the powerful semantic capabilities of large language models (LLMs), such as GPT-4o, to ensure accurate, contextually relevant translations while maintaining modularity, scalability, and context retention. LangGraph, a graph-based framework built on LangChain, simplifies the creation and management of these agents and their workflows. It supports dynamic state management, enabling agents to maintain dialogue context and automates complex workflows by linking agents and facilitating their collaboration. With flexibility, open-source community support, and seamless integration with LLMs, LangGraph empowers agents to deliver high-quality translations. Together, Agent AI and LangGraph create a cohesive system where LangGraph orchestrates agent interactions, ensuring that user inputs are analyzed, routed, and processed efficiently. Experimental results demonstrate the potential of this system to enhance multilingual translation accuracy and scalability. By highlighting modular design and automated workflows, this paper sets the stage for further innovations in intelligent machine translation services.

Title: EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

Authors: Quang Nguyen, Truong Vu, Trong-Tung Nguyen, Yuxin Wen, Preston K Robinette, Taylor T Johnson, Tom Goldstein, Anh Tran, Khoi Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03809
Pdf URL: https://arxiv.org/pdf/2412.03809
Copy Paste: [[2412.03809]] EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM(https://arxiv.org/abs/2412.03809)
Keywords: diffusion, large language model
Abstract: Image editing technologies are tools used to transform, adjust, remove, or otherwise alter images. Recent research has significantly improved the capabilities of image editing tools, enabling the creation of photorealistic and semantically informed forged regions that are nearly indistinguishable from authentic imagery, presenting new challenges in digital forensics and media credibility. While current image forensic techniques are adept at localizing forged regions produced by traditional image manipulation methods, current capabilities struggle to localize regions created by diffusion-based techniques. To bridge this gap, we present a novel framework that integrates a multimodal Large Language Model (LLM) for enhanced reasoning capabilities to localize tampered regions in images produced by diffusion model-based editing methods. By leveraging the contextual and semantic strengths of LLMs, our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush (novel diffusion-based dataset) datasets, outperforming previous approaches in mIoU and F1-score metrics. Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits. Here, where traditional methods typically falter, achieving markedly low scores, our approach demonstrates promising performance.

Title: I$^2$OL-Net: Intra-Inter Objectness Learning Network for Point-Supervised X-Ray Prohibited Item Detection

Authors: Sanjoeng Wong, Yan Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03811
Pdf URL: https://arxiv.org/pdf/2412.03811
Copy Paste: [[2412.03811]] I$^2$OL-Net: Intra-Inter Objectness Learning Network for Point-Supervised X-Ray Prohibited Item Detection(https://arxiv.org/abs/2412.03811)
Keywords: security
Abstract: Automatic detection of prohibited items in X-ray images plays a crucial role in public security. However, existing methods rely heavily on labor-intensive box annotations. To address this, we investigate X-ray prohibited item detection under labor-efficient point supervision and develop an intra-inter objectness learning network (I$^2$OL-Net). I$^2$OL-Net consists of two key modules: an intra-modality objectness learning (intra-OL) module and an inter-modality objectness learning (inter-OL) module. The intra-OL module designs a local focus Gaussian masking block and a global random Gaussian masking block to collaboratively learn the objectness in X-ray images. Meanwhile, the inter-OL module introduces the wavelet decomposition-based adversarial learning block and the objectness block, effectively reducing the modality discrepancy and transferring the objectness knowledge learned from natural images with box annotations to X-ray images. Based on the above, I$^2$OL-Net greatly alleviates the problem of part domination caused by severe intra-class variations in X-ray images. Experimental results on four X-ray datasets show that I$^2$OL-Net can achieve superior performance with a significant reduction of annotation cost, thus enhancing its accessibility and practicality.

Title: Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting

Authors: Guangben Lu, Yuzhen Du, Zhimin Sun, Ran Yi, Yifan Qi, Yizhe Tang, Tianyi Wang, Lizhuang Ma, Fangyuan Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03812
Pdf URL: https://arxiv.org/pdf/2412.03812
Copy Paste: [[2412.03812]] Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting(https://arxiv.org/abs/2412.03812)
Keywords: extraction, diffusion, transformer
Abstract: Foreground-conditioned inpainting aims to seamlessly fill the background region of an image by utilizing the provided foreground subject and a text description. While existing T2I-based image inpainting methods can be applied to this task, they suffer from issues of subject shape expansion, distortion, or impaired ability to align with the text description, resulting in inconsistencies between the visual elements and the text description. To address these challenges, we propose Pinco, a plug-and-play foreground-conditioned inpainting adapter that generates high-quality backgrounds with good text alignment while effectively preserving the shape of the foreground subject. Firstly, we design a Self-Consistent Adapter that integrates the foreground subject features into the layout-related self-attention layer, which helps to alleviate conflicts between the text and subject features by ensuring that the model can effectively consider the foreground subject's characteristics while processing the overall image layout. Secondly, we design a Decoupled Image Feature Extraction method that employs distinct architectures to extract semantic and shape features separately, significantly improving subject feature extraction and ensuring high-quality preservation of the subject's shape. Thirdly, to ensure precise utilization of the extracted features and to focus attention on the subject region, we introduce a Shared Positional Embedding Anchor, greatly improving the model's understanding of subject features and boosting training efficiency. Extensive experiments demonstrate that our method achieves superior performance and efficiency in foreground-conditioned inpainting.

Title: Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Authors: Yuzhen Du, Teng Hu, Jiangning Zhang, Ran Yi Chengming Xu, Xiaobin Hu, Kai Wu, Donghao Luo, Yabiao Wang, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03814
Pdf URL: https://arxiv.org/pdf/2412.03814
Copy Paste: [[2412.03814]] Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration(https://arxiv.org/abs/2412.03814)
Keywords: transformer
Abstract: Image Restoration aims to restore degraded images, with deep learning, especially CNNs and Transformers, enhancing performance. However, there's a lack of a unified training benchmark for IR. We identified a bias in image complexity between training and testing datasets, affecting restoration quality. To address this, we created ReSyn, a large-scale IR dataset with balanced complexity, including real and synthetic images. We also established a unified training standard for IR models. Our RWKV-IR model integrates linear complexity RWKV into transformers for global and local receptive fields. It replaces Q-Shift with Depth-wise Convolution for local dependencies and combines Bi-directional attention for global-local awareness. The Cross-Bi-WKV module balances horizontal and vertical attention. Experiments show RWKV-IR's effectiveness in image restoration.

Title: Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

Authors: Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, He He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03822
Pdf URL: https://arxiv.org/pdf/2412.03822
Copy Paste: [[2412.03822]] Beyond the Binary: Capturing Diverse Preferences With Reward Regularization(https://arxiv.org/abs/2412.03822)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly deployed via public-facing interfaces to interact with millions of users, each with diverse preferences. Despite this, preference tuning of LLMs predominantly relies on reward models trained using binary judgments where annotators select the preferred choice out of pairs of model outputs. In this work, we argue that this reliance on binary choices does not capture the broader, aggregate preferences of the target user in real-world tasks. We propose a taxonomy that identifies two dimensions of subjectivity where different users disagree on the preferred output-namely, the Plurality of Responses to Prompts, where prompts allow for multiple correct answers, and the Indistinguishability of Responses, where candidate outputs are paraphrases of each other. We show that reward models correlate weakly with user preferences in these cases. As a first step to address this issue, we introduce a simple yet effective method that augments existing binary preference datasets with synthetic preference judgments to estimate potential user disagreement. Incorporating these via a margin term as a form of regularization during model training yields predictions that better align with the aggregate user preferences.

Title: Residual Hyperbolic Graph Convolution Networks

Authors: Yangkai Xue, Jindou Dai, Zhipeng Lu, Yuwei Wu, Yunde Jia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.03825
Pdf URL: https://arxiv.org/pdf/2412.03825
Copy Paste: [[2412.03825]] Residual Hyperbolic Graph Convolution Networks(https://arxiv.org/abs/2412.03825)
Keywords: extraction
Abstract: Hyperbolic graph convolutional networks (HGCNs) have demonstrated representational capabilities of modeling hierarchical-structured graphs. However, as in general GCNs, over-smoothing may occur as the number of model layers increases, limiting the representation capabilities of most current HGCN models. In this paper, we propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the over-smoothing problem. We introduce a hyperbolic residual connection function to overcome the over-smoothing problem, and also theoretically prove the effectiveness of the hyperbolic residual function. Moreover, we use product manifolds and HyperDrop to facilitate the R-HGCNs. The distinctive features of the R-HGCNs are as follows: (1) The hyperbolic residual connection preserves the initial node information in each layer and adds a hyperbolic identity mapping to prevent node features from being indistinguishable. (2) Product manifolds in R-HGCNs have been set up with different origin points in different components to facilitate the extraction of feature information from a wider range of perspectives, which enhances the representing capability of R-HGCNs. (3) HyperDrop adds multiplicative Gaussian noise into hyperbolic representations, such that perturbations can be added to alleviate the over-fitting problem without deconstructing the hyperbolic geometry. Experiment results demonstrate the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds.

Title: A large language model-type architecture for high-dimensional molecular potential energy surfaces

Authors: Xiao Zhu, Srinivasan S. Iyengar
Subjects: cs.LG, physics.atm-clus, physics.chem-ph, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2412.03831
Pdf URL: https://arxiv.org/pdf/2412.03831
Copy Paste: [[2412.03831]] A large language model-type architecture for high-dimensional molecular potential energy surfaces(https://arxiv.org/abs/2412.03831)
Keywords: generative, large language model
Abstract: Computing high dimensional potential surfaces for molecular and materials systems is considered to be a great challenge in computational chemistry with potential impact in a range of areas including fundamental prediction of reaction rates. In this paper we design and discuss an algorithm that has similarities to large language models in generative AI and natural language processing. Specifically, we represent a molecular system as a graph which contains a set of nodes, edges, faces etc. Interactions between these sets, which represent molecular subsystems in our case, are used to construct the potential energy surface for a reasonably sized chemical system with 51 dimensions. Essentially a family of neural networks that pertain to the graph-based subsystems, get the job done for this 51 dimensional system. We then ask if this same family of lower-dimensional neural networks can be transformed to provide accurate predictions for a 186 dimensional potential surface. We find that our algorithm does provide reasonably accurate results for this larger dimensional problem with sub-kcal/mol accuracy for the higher dimensional potential surface problem.

Title: LL-ICM: Image Compression for Low-level Machine Vision via Large Vision-Language Model

Authors: Yuan Xue, Qi Zhang, Chuanmin Jia, Shiqi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03841
Pdf URL: https://arxiv.org/pdf/2412.03841
Copy Paste: [[2412.03841]] LL-ICM: Image Compression for Low-level Machine Vision via Large Vision-Language Model(https://arxiv.org/abs/2412.03841)
Keywords: robust, segmentation
Abstract: Image Compression for Machines (ICM) aims to compress images for machine vision tasks rather than human viewing. Current works predominantly concentrate on high-level tasks like object detection and semantic segmentation. However, the quality of original images is usually not guaranteed in the real world, leading to even worse perceptual quality or downstream task performance after compression. Low-level (LL) machine vision models, like image restoration models, can help improve such quality, and thereby their compression requirements should also be considered. In this paper, we propose a pioneered ICM framework for LL machine vision tasks, namely LL-ICM. By jointly optimizing compression and LL tasks, the proposed LL-ICM not only enriches its encoding ability in generalizing to versatile LL tasks but also optimizes the processing ability of down-stream LL task models, achieving mutual adaptation for image codecs and LL task models. Furthermore, we integrate large-scale vision-language models into the LL-ICM framework to generate more universal and distortion-robust feature embeddings for LL vision tasks. Therefore, one LL-ICM codec can generalize to multiple tasks. We establish a solid benchmark to evaluate LL-ICM, which includes extensive objective experiments by using both full and no-reference image quality assessments. Experimental results show that LL-ICM can achieve 22.65% BD-rate reductions over the state-of-the-art methods.

Title: CCxTrust: Confidential Computing Platform Based on TEE and TPM Collaborative Trust

Authors: Ketong Shang, Jiangnan Lin, Yu Qin, Muyan Shen, Hongzhan Ma, Wei Feng, Dengguo Feng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.03842
Pdf URL: https://arxiv.org/pdf/2412.03842
Copy Paste: [[2412.03842]] CCxTrust: Confidential Computing Platform Based on TEE and TPM Collaborative Trust(https://arxiv.org/abs/2412.03842)
Keywords: secure, security, privacy, protect
Abstract: Confidential Computing has emerged to address data security challenges in cloud-centric deployments by protecting data in use through hardware-level isolation. However, reliance on a single hardware root of trust (RoT) limits user confidence in cloud platforms, especially for high-performance AI services, where end-to-end protection of sensitive models and data is critical. Furthermore, the lack of interoperability and a unified trust model in multi-cloud environments prevents the establishment of a cross-platform, cross-cloud chain of trust, creating a significant trust gap for users with high privacy requirements. To address the challenges mentioned above, this paper proposes CCxTrust (Confidential Computing with Trust), a confidential computing platform leveraging collaborative roots of trust from TEE and TPM. CCxTrust combines the black-box RoT embedded in the CPU-TEE with the flexible white-box RoT of TPM to establish a collaborative trust framework. The platform implements independent Roots of Trust for Measurement (RTM) for TEE and TPM, and a collaborative Root of Trust for Report (RTR) for composite attestation. The Root of Trust for Storage (RTS) is solely supported by TPM. We also present the design and implementation of a confidential TPM supporting multiple modes for secure use within confidential virtual machines. Additionally, we propose a composite attestation protocol integrating TEE and TPM to enhance security and attestation efficiency, which is proven secure under the PCL protocol security model. We implemented a prototype of CCxTrust on a confidential computing server with AMD SEV-SNP and TPM chips, requiring minimal modifications to the TPM and guest Linux kernel. The composite attestation efficiency improved by 24% without significant overhead, while Confidential TPM performance showed a 16.47% reduction compared to standard TPM.

Title: HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting

Authors: Jingyu Lin, Jiaqi Gu, Lubin Fan, Bojian Wu, Yujing Lou, Renjie Chen, Ligang Liu, Jieping Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03844
Pdf URL: https://arxiv.org/pdf/2412.03844
Copy Paste: [[2412.03844]] HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting(https://arxiv.org/abs/2412.03844)
Keywords: robust
Abstract: Generating high-quality novel view renderings of 3D Gaussian Splatting (3DGS) in scenes featuring transient objects is challenging. We propose a novel hybrid representation, termed as HybridGS, using 2D Gaussians for transient objects per image and maintaining traditional 3D Gaussians for the whole static scenes. Note that, the 3DGS itself is better suited for modeling static scenes that assume multi-view consistency, but the transient objects appear occasionally and do not adhere to the assumption, thus we model them as planar objects from a single view, represented with 2D Gaussians. Our novel representation decomposes the scene from the perspective of fundamental viewpoint consistency, making it more reasonable. Additionally, we present a novel multi-view regulated supervision method for 3DGS that leverages information from co-visible regions, further enhancing the distinctions between the transients and statics. Then, we propose a straightforward yet effective multi-stage training strategy to ensure robust training and high-quality view synthesis across various settings. Experiments on benchmark datasets show our state-of-the-art performance of novel view synthesis in both indoor and outdoor scenes, even in the presence of distracting elements.

Title: Educational-Psychological Dialogue Robot Based on Multi-Agent Collaboration

Authors: Shiwen Ni, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.03847
Pdf URL: https://arxiv.org/pdf/2412.03847
Copy Paste: [[2412.03847]] Educational-Psychological Dialogue Robot Based on Multi-Agent Collaboration(https://arxiv.org/abs/2412.03847)
Keywords: security
Abstract: Intelligent dialogue systems are increasingly used in modern education and psychological counseling fields, but most existing systems are limited to a single domain, cannot deal with both educational and psychological issues, and often lack accuracy and professionalism when dealing with complex issues. To address these problems, this paper proposes an intelligent dialog system that combines educational and psychological counseling functions. The system consists of multiple AI agent, including security detection agent, intent identification agent, educational LLM agent, and psychological LLM agent, which work in concert to ensure the provision of accurate educational knowledge Q\&A and psychological support services. Specifically, the system recognizes user-input intentions through an intention classification model and invokes a retrieval-enhanced educational grand model and a psychological grand model fine-tuned with psychological data in order to provide professional educational advice and psychological support.

Title: Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer

Authors: Jayaprakash Sundararaj, Akhil Vyas, Benjamin Gonzalez-Maldonado
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.03853
Pdf URL: https://arxiv.org/pdf/2412.03853
Copy Paste: [[2412.03853]] Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer(https://arxiv.org/abs/2412.03853)
Keywords: transformer
Abstract: Converting mathematical expressions into LaTeX is challenging. In this paper, we explore using newer transformer based architectures for addressing the problem of converting handwritten/digital mathematical expression images into equivalent LaTeX code. We use the current state of the art CNN encoder and RNN decoder as a baseline for our experiments. We also investigate improvements to CNN-RNN architecture by replacing the CNN encoder with the ResNet50 model. Our experiments show that transformer architectures achieve a higher overall accuracy and BLEU scores along with lower Levenschtein scores compared to the baseline CNN/RNN architecture with room to achieve even better results with appropriate fine-tuning of model parameters.

Title: CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Authors: Hui Zhang, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03859
Pdf URL: https://arxiv.org/pdf/2412.03859
Copy Paste: [[2412.03859]] CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation(https://arxiv.org/abs/2412.03859)
Keywords: diffusion, transformer, large language model
Abstract: Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. Our code, model, and dataset will be available at this https URL.

Title: GP-FL: Model-Based Hessian Estimation for Second-Order Over-the-Air Federated Learning

Authors: Shayan Mohajer Hamidi, Ali Bereyhi, Saba Asaad, H. Vincent Poor
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.03867
Pdf URL: https://arxiv.org/pdf/2412.03867
Copy Paste: [[2412.03867]] GP-FL: Model-Based Hessian Estimation for Second-Order Over-the-Air Federated Learning(https://arxiv.org/abs/2412.03867)
Keywords: federate
Abstract: Second-order methods are widely adopted to improve the convergence rate of learning algorithms. In federated learning (FL), these methods require the clients to share their local Hessian matrices with the parameter server (PS), which comes at a prohibitive communication cost. A classical solution to this issue is to approximate the global Hessian matrix from the first-order information. Unlike in idealized networks, this solution does not perform effectively in over-the-air FL settings, where the PS receives noisy versions of the local gradients. This paper introduces a novel second-order FL framework tailored for wireless channels. The pivotal innovation lies in the PS's capability to directly estimate the global Hessian matrix from the received noisy local gradients via a non-parametric method: the PS models the unknown Hessian matrix as a Gaussian process, and then uses the temporal relation between the gradients and Hessian along with the channel model to find a stochastic estimator for the global Hessian matrix. We refer to this method as Gaussian process-based Hessian modeling for wireless FL (GP-FL) and show that it exhibits a linear-quadratic convergence rate. Numerical experiments on various datasets demonstrate that GP-FL outperforms all classical baseline first and second order FL approaches.

Title: CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Authors: Chu Myaet Thwal, Ye Lin Tun, Minh N. H. Nguyen, Eui-Nam Huh, Choong Seon Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03871
Pdf URL: https://arxiv.org/pdf/2412.03871
Copy Paste: [[2412.03871]] CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance(https://arxiv.org/abs/2412.03871)
Keywords: robust
Abstract: Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision from these neighbors substantially boosts cross-modal alignment, enabling lightweight models to learn more generic features with rich semantic diversity. Extensive experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks. Specifically, a 5.5% gain on zero-shot ImageNet1K with 10.7% (I2T) and 5.7% (T2I) on Flickr30K, compared to the original CLIP when using ViT-XS image encoder trained on 3 million (image, text) pairs. Moreover, CLIP-PING showcases strong transferability under the linear evaluation protocol across several downstream tasks.

Title: Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

Authors: Jiangweizhi Peng, Zhiwei Tang, Gaowen Liu, Charles Fleming, Mingyi Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03876
Pdf URL: https://arxiv.org/pdf/2412.03876
Copy Paste: [[2412.03876]] Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization(https://arxiv.org/abs/2412.03876)
Keywords: attack, robust, diffusion
Abstract: Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. However, despite recent advances, these models are still prone to generating unsafe images containing sensitive or inappropriate content, which can be harmful to users. Current efforts to prevent inappropriate image generation for diffusion models are easy to bypass and vulnerable to adversarial attacks. How to ensure that T2I models align with specific safety goals remains a significant challenge. In this work, we propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation. Our method introduces a novel optimization framework that leverages both the continuous prompt embedding and the injected noise trajectory in the sampling process to generate safe images. Extensive numerical results demonstrate that our framework achieves state-of-the-art performance in suppressing toxic image generations and demonstrates robustness to adversarial attacks, without needing to tune the model parameters. Furthermore, compared with existing methods, PNO uses comparable generation time while offering the best tradeoff between the conflicting goals of safe generation and prompt-image alignment.

Title: AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer

Authors: Davor Lauc, Attapol Rutherford, Weerin Wongwarawipatr
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03877
Pdf URL: https://arxiv.org/pdf/2412.03877
Copy Paste: [[2412.03877]] AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer(https://arxiv.org/abs/2412.03877)
Keywords: transformer
Abstract: This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system's practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.

Title: DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism

Authors: Sudha Krishnamurthy, Vimal Bhat, Abhinav Jain
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03878
Pdf URL: https://arxiv.org/pdf/2412.03878
Copy Paste: [[2412.03878]] DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism(https://arxiv.org/abs/2412.03878)
Keywords: diffusion, generative
Abstract: The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.

Title: Uniform Discretized Integrated Gradients: An effective attribution based method for explaining large language models

Authors: Swarnava Sinha Roy, Ayan Kundu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03886
Pdf URL: https://arxiv.org/pdf/2412.03886
Copy Paste: [[2412.03886]] Uniform Discretized Integrated Gradients: An effective attribution based method for explaining large language models(https://arxiv.org/abs/2412.03886)
Keywords: large language model
Abstract: Integrated Gradients is a well-known technique for explaining deep learning models. It calculates feature importance scores by employing a gradient based approach computing gradients of the model output with respect to input features and accumulating them along a linear path. While this works well for continuous features spaces, it may not be the most optimal way to deal with discrete spaces like word embeddings. For interpreting LLMs (Large Language Models), there exists a need for a non-linear path where intermediate points, whose gradients are to be computed, lie close to actual words in the embedding space. In this paper, we propose a method called Uniform Discretized Integrated Gradients (UDIG) based on a new interpolation strategy where we choose a favorable nonlinear path for computing attribution scores suitable for predictive language models. We evaluate our method on two types of NLP tasks- Sentiment Classification and Question Answering against three metrics viz Log odds, Comprehensiveness and Sufficiency. For sentiment classification, we have used the SST2, IMDb and Rotten Tomatoes datasets for benchmarking and for Question Answering, we have used the fine-tuned BERT model on SQuAD dataset. Our approach outperforms the existing methods in almost all the metrics.

Title: Machine Learning-based Android Intrusion Detection System

Authors: Madiha Tahreem, Ifrah Andleeb, Bilal Zahid Hussain, Arsalan Hameed
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03894
Pdf URL: https://arxiv.org/pdf/2412.03894
Copy Paste: [[2412.03894]] Machine Learning-based Android Intrusion Detection System(https://arxiv.org/abs/2412.03894)
Keywords: security, attack
Abstract: The android operating system is being installed in most of the smart devices. The introduction of intrusions in such operating systems is rising at a tremendous rate. With the introduction of such malicious data streams, the smart devices are being subjected to various attacks like Phishing, Spyware, SMS Fraud, Bots and Banking-Trojans and many such. The application of machine learning classification algorithms for the security of android APK files is used in this paper. Each apk data stream was marked to be either malicious or non malicious on the basis of different parameters. The machine learning classification techniques are then used to classify whether the newly installed applications' signature falls within the malicious or non-malicious domain. If it falls within the malicious category, appropriate action can be taken, and the Android operating system can be shielded against illegal activities.

Title: A Noise is Worth Diffusion Guidance

Authors: Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, Seungryong Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03895
Pdf URL: https://arxiv.org/pdf/2412.03895
Copy Paste: [[2412.03895]] A Noise is Worth Diffusion Guidance(https://arxiv.org/abs/2412.03895)
Keywords: diffusion
Abstract: Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: this https URL.

Title: Can Targeted Clean-Label Poisoning Attacks Generalize?

Authors: Zhizhen Chen, Subrat Kishore Dutta, Zhengyu Zhao, Chenhao Lin, Chao Shen, Xiao Zhang
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03908
Pdf URL: https://arxiv.org/pdf/2412.03908
Copy Paste: [[2412.03908]] Can Targeted Clean-Label Poisoning Attacks Generalize?(https://arxiv.org/abs/2412.03908)
Keywords: attack
Abstract: Targeted poisoning attacks aim to compromise the model's prediction on specific target samples. In a common clean-label setting, they are achieved by slightly perturbing a subset of training samples given access to those specific targets. Despite continuous efforts, it remains unexplored whether such attacks can generalize to unknown variations of those targets. In this paper, we take the first step to systematically study this generalization problem. Observing that the widely adopted, cosine similarity-based attack exhibits limited generalizability, we propose a well-generalizable attack that leverages both the direction and magnitude of model gradients. In particular, we explore diverse target variations, such as an object with varied viewpoints and an animal species with distinct appearances. Extensive experiments across various generalization scenarios demonstrate that our method consistently achieves the best attack effectiveness. For example, our method outperforms the cosine similarity-based attack by 20.95% in attack success rate with similar overall accuracy, averaged over four models on two image benchmark datasets. The code is available at this https URL

Title: Quantized and Interpretable Learning Scheme for Deep Neural Networks in Classification Task

Authors: Alireza Maleki, Mahsa Lavaei, Mohsen Bagheritabar, Salar Beigzad, Zahra Abadi
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.03915
Pdf URL: https://arxiv.org/pdf/2412.03915
Copy Paste: [[2412.03915]] Quantized and Interpretable Learning Scheme for Deep Neural Networks in Classification Task(https://arxiv.org/abs/2412.03915)
Keywords: interpretability
Abstract: Deep learning techniques have proven highly effective in image classification, but their deployment in resourceconstrained environments remains challenging due to high computational demands. Furthermore, their interpretability is of high importance which demands even more available resources. In this work, we introduce an approach that combines saliency-guided training with quantization techniques to create an interpretable and resource-efficient model without compromising accuracy. We utilize Parameterized Clipping Activation (PACT) to perform quantization-aware training, specifically targeting activations and weights to optimize precision while minimizing resource usage. Concurrently, saliency-guided training is employed to enhance interpretability by iteratively masking features with low gradient values, leading to more focused and meaningful saliency maps. This training procedure helps in mitigating noisy gradients and yields models that provide clearer, more interpretable insights into their decision-making processes. To evaluate the impact of our approach, we conduct experiments using famous Convolutional Neural Networks (CNN) architecture on the MNIST and CIFAR-10 benchmark datasets as two popular datasets. We compare the saliency maps generated by standard and quantized models to assess the influence of quantization on both interpretability and classification accuracy. Our results demonstrate that the combined use of saliency-guided training and PACT-based quantization not only maintains classification performance but also produces models that are significantly more efficient and interpretable, making them suitable for deployment in resource-limited settings.

Title: A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios

Authors: Xiachong Feng, Longxu Dou, Ella Li, Qinghao Wang, Haochuan Wang, Yu Guo, Chang Ma, Lingpeng Kong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03920
Pdf URL: https://arxiv.org/pdf/2412.03920
Copy Paste: [[2412.03920]] A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios(https://arxiv.org/abs/2412.03920)
Keywords: large language model
Abstract: Game-theoretic scenarios have become pivotal in evaluating the social intelligence of Large Language Model (LLM)-based social agents. While numerous studies have explored these agents in such settings, there is a lack of a comprehensive survey summarizing the current progress. To address this gap, we systematically review existing research on LLM-based social agents within game-theoretic scenarios. Our survey organizes the findings into three core components: Game Framework, Social Agent, and Evaluation Protocol. The game framework encompasses diverse game scenarios, ranging from choice-focusing to communication-focusing games. The social agent part explores agents' preferences, beliefs, and reasoning abilities. The evaluation protocol covers both game-agnostic and game-specific metrics for assessing agent performance. By reflecting on the current research and identifying future research directions, this survey provides insights to advance the development and evaluation of social agents in game-theoretic scenarios.

Title: Privacy-Preserving in Medical Image Analysis: A Review of Methods and Applications

Authors: Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew, Hui Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03924
Pdf URL: https://arxiv.org/pdf/2412.03924
Copy Paste: [[2412.03924]] Privacy-Preserving in Medical Image Analysis: A Review of Methods and Applications(https://arxiv.org/abs/2412.03924)
Keywords: secure, privacy, federate, generative
Abstract: With the rapid advancement of artificial intelligence and deep learning, medical image analysis has become a critical tool in modern healthcare, significantly improving diagnostic accuracy and efficiency. However, AI-based methods also raise serious privacy concerns, as medical images often contain highly sensitive patient information. This review offers a comprehensive overview of privacy-preserving techniques in medical image analysis, including encryption, differential privacy, homomorphic encryption, federated learning, and generative adversarial networks. We explore the application of these techniques across various medical image analysis tasks, such as diagnosis, pathology, and telemedicine. Notably, we organizes the review based on specific challenges and their corresponding solutions in different medical image analysis applications, so that technical applications are directly aligned with practical issues, addressing gaps in the current research landscape. Additionally, we discuss emerging trends, such as zero-knowledge proofs and secure multi-party computation, offering insights for future research. This review serves as a valuable resource for researchers and practitioners and can help advance privacy-preserving in medical image analysis.

Title: MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction

Authors: Mithun Parab, Pranay Lendave, Jiyoung Kim, Thi Quynh Dan Nguyen, Palash Ingle
Subjects: cs.CV, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03928
Pdf URL: https://arxiv.org/pdf/2412.03928
Copy Paste: [[2412.03928]] MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction(https://arxiv.org/abs/2412.03928)
Keywords: segmentation
Abstract: In image-assisted minimally invasive surgeries (MIS), understanding surgical scenes is vital for real-time feedback to surgeons, skill evaluation, and improving outcomes through collaborative human-robot procedures. Within this context, the challenge lies in accurately detecting, segmenting, and estimating the depth of surgical scenes depicted in high-resolution images, while simultaneously reconstructing the scene in 3D and providing segmentation of surgical instruments along with detection labels for each instrument. To address this challenge, a novel Multi-Task Learning (MTL) network is proposed for performing these tasks concurrently. A key aspect of this approach involves overcoming the optimization hurdles associated with handling multiple tasks concurrently by integrating a Adversarial Weight Update into the MTL framework, the proposed MTL model achieves 3D reconstruction through the integration of segmentation, depth estimation, and object detection, thereby enhancing the understanding of surgical scenes, which marks a significant advancement compared to existing studies that lack 3D capabilities. Comprehensive experiments on the EndoVis2018 benchmark dataset underscore the adeptness of the model in efficiently addressing all three tasks, demonstrating the efficacy of the proposed techniques.

Title: InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Authors: Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, Jiahui Huang
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.03934
Pdf URL: https://arxiv.org/pdf/2412.03934
Copy Paste: [[2412.03934]] InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models(https://arxiv.org/abs/2412.03934)
Keywords: generative
Abstract: We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.

Title: AIpparel: A Large Multimodal Generative Model for Digital Garments

Authors: Kiyohiro Nakayama, Jan Ackermann, Timur Levent Kesdogan, Yang Zheng, Maria Korosteleva, Olga Sorkine-Hornung, Leonidas J. Guibas, Guandao Yang, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03937
Pdf URL: https://arxiv.org/pdf/2412.03937
Copy Paste: [[2412.03937]] AIpparel: A Large Multimodal Generative Model for Digital Garments(https://arxiv.org/abs/2412.03937)
Keywords: protect, generative
Abstract: Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a large multimodal model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. \methodname achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and enables novel multimodal garment generation applications such as interactive garment editing. The project website is at this http URL.

Title: Enhancing and Accelerating Diffusion-Based Inverse Problem Solving through Measurements Optimization

Authors: Tianyu Chen, Zhendong Wang, Mingyuan Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03941
Pdf URL: https://arxiv.org/pdf/2412.03941
Copy Paste: [[2412.03941]] Enhancing and Accelerating Diffusion-Based Inverse Problem Solving through Measurements Optimization(https://arxiv.org/abs/2412.03941)
Keywords: diffusion
Abstract: Diffusion models have recently demonstrated notable success in solving inverse problems. However, current diffusion model-based solutions typically require a large number of function evaluations (NFEs) to generate high-quality images conditioned on measurements, as they incorporate only limited information at each step. To accelerate the diffusion-based inverse problem-solving process, we introduce \textbf{M}easurements \textbf{O}ptimization (MO), a more efficient plug-and-play module for integrating measurement information at each step of the inverse problem-solving process. This method is comprehensively evaluated across eight diverse linear and nonlinear tasks on the FFHQ and ImageNet datasets. By using MO, we establish state-of-the-art (SOTA) performance across multiple tasks, with key advantages: (1) it operates with no more than 100 NFEs, with phase retrieval on ImageNet being the sole exception; (2) it achieves SOTA or near-SOTA results even at low NFE counts; and (3) it can be seamlessly integrated into existing diffusion model-based solutions for inverse problems, such as DPS \cite{chung2022diffusion} and Red-diff \cite{mardani2023variational}. For example, DPS-MO attains a peak signal-to-noise ratio (PSNR) of 28.71 dB on the FFHQ 256 dataset for high dynamic range imaging, setting a new SOTA benchmark with only 100 NFEs, whereas current methods require between 1000 and 4000 NFEs for comparable performance.

Title: WACANA: A Concolic Analyzer for Detecting On-chain Data Vulnerabilities in WASM Smart Contracts

Authors: Wansen Wang, Caichang Tu, Zhaoyi Meng, Wenchao Huang, Yan Xiong
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2412.03946
Pdf URL: https://arxiv.org/pdf/2412.03946
Copy Paste: [[2412.03946]] WACANA: A Concolic Analyzer for Detecting On-chain Data Vulnerabilities in WASM Smart Contracts(https://arxiv.org/abs/2412.03946)
Keywords: security
Abstract: WebAssembly (WASM) has emerged as a crucial technology in smart contract development for several blockchain platforms. Unfortunately, since their introduction, WASM smart contracts have been subject to several security incidents caused by contract vulnerabilities, resulting in substantial economic losses. However, existing tools for detecting WASM contract vulnerabilities have accuracy limitations, one of the main reasons being the coarse-grained emulation of the on-chain data APIs. In this paper, we introduce WACANA, an analyzer for WASM contracts that accurately detects vulnerabilities through fine-grained emulation of on-chain data APIs. WACANA precisely simulates both the structure of on-chain data tables and their corresponding API functions, and integrates concrete and symbolic execution within a coverage-guided loop to balance accuracy and efficiency. Evaluations on a vulnerability dataset of 133 contracts show WACANA outperforming state-of-the-art tools in accuracy. Further validation on 5,602 real-world contracts confirms WACANA's practical effectiveness.

Title: BEFL: Balancing Energy Consumption in Federated Learning for Mobile Edge IoT

Authors: Zehao Ju, Tongquan Wei, Fuke Shen
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2412.03950
Pdf URL: https://arxiv.org/pdf/2412.03950
Copy Paste: [[2412.03950]] BEFL: Balancing Energy Consumption in Federated Learning for Mobile Edge IoT(https://arxiv.org/abs/2412.03950)
Keywords: privacy, federate
Abstract: Federated Learning (FL) is a privacy-preserving distributed learning paradigm designed to build a highly accurate global model. In Mobile Edge IoT (MEIoT), the training and communication processes can significantly deplete the limited battery resources of devices. Existing research primarily focuses on reducing overall energy consumption, but this may inadvertently create energy consumption imbalances, leading to the premature dropout of energy-sensitive this http URL address these challenges, we propose BEFL, a joint optimization framework aimed at balancing three objectives: enhancing global model accuracy, minimizing total energy consumption, and reducing energy usage disparities among devices. First, taking into account the communication constraints of MEIoT and the heterogeneity of devices, we employed the Sequential Least Squares Programming (SLSQP) algorithm for the rational allocation of communication resources. Based on this, we introduce a heuristic client selection algorithm that combines cluster partitioning with utility-driven approaches to alleviate both the total energy consumption of all devices and the discrepancies in energy this http URL, we utilize the proposed heuristic client selection algorithm as a template for offline imitation learning during pre-training, while adopting a ranking-based reinforcement learning approach online to further boost training efficiency. Our experiments reveal that BEFL improves global model accuracy by 1.6\%, reduces energy consumption variance by 72.7\%, and lowers total energy consumption by 28.2\% compared to existing methods. The relevant code can be found at \href{URL}{this https URL}.

Title: A Framework For Image Synthesis Using Supervised Contrastive Learning

Authors: Yibin Liu, Jianyu Zhang, Li Zhang, Shijian Li, Gang Pan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03957
Pdf URL: https://arxiv.org/pdf/2412.03957
Copy Paste: [[2412.03957]] A Framework For Image Synthesis Using Supervised Contrastive Learning(https://arxiv.org/abs/2412.03957)
Keywords: generative
Abstract: Text-to-image (T2I) generation aims at producing realistic images corresponding to text descriptions. Generative Adversarial Network (GAN) has proven to be successful in this task. Typical T2I GANs are 2 phase methods that first pretrain an inter-modal representation from aligned image-text pairs and then use GAN to train image generator on that basis. However, such representation ignores the inner-modal semantic correspondence, e.g. the images with same label. The semantic label in priory describes the inherent distribution pattern with underlying cross-image relationships, which is supplement to the text description for understanding the full characteristics of image. In this paper, we propose a framework leveraging both inter- and inner-modal correspondence by label guided supervised contrastive learning. We extend the T2I GANs to two parameter-sharing contrast branches in both pretraining and generation phases. This integration effectively clusters the semantically similar image-text pair representations, thereby fostering the generation of higher-quality images. We demonstrate our framework on four novel T2I GANs by both single-object dataset CUB and multi-object dataset COCO, achieving significant improvements in the Inception Score (IS) and Frechet Inception Distance (FID) metrics of imagegeneration evaluation. Notably, on more complex multi-object COCO, our framework improves FID by 30.1%, 27.3%, 16.2% and 17.1% for AttnGAN, DM-GAN, SSA-GAN and GALIP, respectively. We also validate our superiority by comparing with other label guided T2I GANs. The results affirm the effectiveness and competitiveness of our approach in advancing the state-of-the-art GAN for T2I generation

Title: Local Curvature Smoothing with Stein's Identity for Efficient Score Matching

Authors: Genki Osada, Makoto Shing, Takashi Nishide
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.03962
Pdf URL: https://arxiv.org/pdf/2412.03962
Copy Paste: [[2412.03962]] Local Curvature Smoothing with Stein's Identity for Efficient Score Matching(https://arxiv.org/abs/2412.03962)
Keywords: diffusion
Abstract: The training of score-based diffusion models (SDMs) is based on score matching. The challenge of score matching is that it includes a computationally expensive Jacobian trace. While several methods have been proposed to avoid this computation, each has drawbacks, such as instability during training and approximating the learning as learning a denoising vector field rather than a true score. We propose a novel score matching variant, local curvature smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by applying Stein's identity, enabling regularization effectiveness and efficient computation. We show that LCSS surpasses existing methods in sample generation performance and matches the performance of denoising score matching, widely adopted by most SDMs, in evaluations such as FID, Inception score, and bits per dimension. Furthermore, we show that LCSS enables realistic image generation even at a high resolution of $1024 \times 1024$.

Title: Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation

Authors: Hao Zhu, Yan Zhu, Jiayu Xiao, Tianxiang Xiao, Yike Ma, Yucheng Zhang, Feng Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03968
Pdf URL: https://arxiv.org/pdf/2412.03968
Copy Paste: [[2412.03968]] Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation(https://arxiv.org/abs/2412.03968)
Keywords: segmentation
Abstract: Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. Our code will be publicly available.

Title: HyperDefect-YOLO: Enhance YOLO with HyperGraph Computation for Industrial Defect Detection

Authors: Zuo Zuo, Jiahao Dong, Yue Gao, Zongze Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03969
Pdf URL: https://arxiv.org/pdf/2412.03969
Copy Paste: [[2412.03969]] HyperDefect-YOLO: Enhance YOLO with HyperGraph Computation for Industrial Defect Detection(https://arxiv.org/abs/2412.03969)
Keywords: extraction
Abstract: In the manufacturing industry, defect detection is an essential but challenging task aiming to detect defects generated in the process of production. Though traditional YOLO models presents a good performance in defect detection, they still have limitations in capturing high-order feature interrelationships, which hurdles defect detection in the complex scenarios and across the scales. To this end, we introduce hypergraph computation into YOLO framework, dubbed HyperDefect-YOLO (HD-YOLO), to improve representative ability and semantic exploitation. HD-YOLO consists of Defect Aware Module (DAM) and Mixed Graph Network (MGNet) in the backbone, which specialize for perception and extraction of defect features. To effectively aggregate multi-scale features, we propose HyperGraph Aggregation Network (HGANet) which combines hypergraph and attention mechanism to aggregate multi-scale features. Cross-Scale Fusion (CSF) is proposed to adaptively fuse and handle features instead of simple concatenation and convolution. Finally, we propose Semantic Aware Module (SAM) in the neck to enhance semantic exploitation for accurately localizing defects with different sizes in the disturbed background. HD-YOLO undergoes rigorous evaluation on public HRIPCB and NEU-DET datasets with significant improvements compared to state-of-the-art methods. We also evaluate HD-YOLO on self-built MINILED dataset collected in real industrial scenarios to demonstrate the effectiveness of the proposed method. The source codes are at this https URL.

Title: Digital Twin for Evaluating Detective Countermeasures in Smart Grid Cybersecurity

Authors: Omer Sen, Nathalie Bleser, Andreas Ulbig
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.03973
Pdf URL: https://arxiv.org/pdf/2412.03973
Copy Paste: [[2412.03973]] Digital Twin for Evaluating Detective Countermeasures in Smart Grid Cybersecurity(https://arxiv.org/abs/2412.03973)
Keywords: security, attack, robust
Abstract: As the integration of digital technologies and communication systems continues within distribution grids, new avenues emerge to tackle energy transition challenges. Nevertheless, this deeper technological immersion amplifies the necessity for resilience against threats, encompassing both systemic outages and targeted cyberattacks. To ensure the robustness and safeguarding of vital infrastructure, a thorough examination of potential smart grid vulnerabilities and subsequent countermeasure development is essential. This study delves into the potential of digital twins, replicating a smart grid's cyber-physical laboratory environment, thereby enabling focused cybersecurity assessments. Merging the nuances of communication network emulation and power network simulation, we introduce a flexible, comprehensive digital twin model equipped for hardware-in-the-loop evaluations. Through this innovative framework, we not only verify and refine security countermeasures but also underscore their role in maintaining grid stability and trustworthiness.

Title: AI-based Attacker Models for Enhancing Multi-Stage Cyberattack Simulations in Smart Grids Using Co-Simulation Environments

Authors: Omer Sen, Christoph Pohl, Immanuel Hacker, Markus Stroot, Andreas Ulbig
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.03979
Pdf URL: https://arxiv.org/pdf/2412.03979
Copy Paste: [[2412.03979]] AI-based Attacker Models for Enhancing Multi-Stage Cyberattack Simulations in Smart Grids Using Co-Simulation Environments(https://arxiv.org/abs/2412.03979)
Keywords: security, attack, large language model
Abstract: The transition to smart grids has increased the vulnerability of electrical power systems to advanced cyber threats. To safeguard these systems, comprehensive security measures-including preventive, detective, and reactive strategies-are necessary. As part of the critical infrastructure, securing these systems is a major research focus, particularly against cyberattacks. Many methods are developed to detect anomalies and intrusions and assess the damage potential of attacks. However, these methods require large amounts of data, which are often limited or private due to security concerns. We propose a co-simulation framework that employs an autonomous agent to execute modular cyberattacks within a configurable environment, enabling reproducible and adaptable data generation. The impact of virtual attacks is compared to those in a physical lab targeting real smart grids. We also investigate the use of large language models for automating attack generation, though current models on consumer hardware are unreliable. Our approach offers a flexible, versatile source for data generation, aiding in faster prototyping and reducing development resources and time.

Title: Exploring Fully Convolutional Networks for the Segmentation of Hyperspectral Imaging Applied to Advanced Driver Assistance Systems

Authors: Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe, M. Victoria Martínez, Inés del Campo
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.03982
Pdf URL: https://arxiv.org/pdf/2412.03982
Copy Paste: [[2412.03982]] Exploring Fully Convolutional Networks for the Segmentation of Hyperspectral Imaging Applied to Advanced Driver Assistance Systems(https://arxiv.org/abs/2412.03982)
Keywords: segmentation
Abstract: Advanced Driver Assistance Systems (ADAS) are designed with the main purpose of increasing the safety and comfort of vehicle occupants. Most of current computer vision-based ADAS perform detection and tracking tasks quite successfully under regular conditions, but are not completely reliable, particularly under adverse weather and changing lighting conditions, neither in complex situations with many overlapping objects. In this work we explore the use of hyperspectral imaging (HSI) in ADAS on the assumption that the distinct near infrared (NIR) spectral reflectances of different materials can help to better separate the objects in a driving scene. In particular, this paper describes some experimental results of the application of fully convolutional networks (FCN) to the image segmentation of HSI for ADAS applications. More specifically, our aim is to investigate to what extent the spatial features codified by convolutional filters can be helpful to improve the performance of HSI segmentation systems. With that aim, we use the HSI-Drive v1.1 dataset, which provides a set of labelled images recorded in real driving conditions with a small-size snapshot NIR-HSI camera. Finally, we analyze the implementability of such a HSI segmentation system by prototyping the developed FCN model together with the necessary hyperspectral cube preprocessing stage and characterizing its performance on an MPSoC.

Title: MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for Strengthening LLM

Authors: Changcheng Li, Xiangyu Wang, Qiuju Chen, Xiren Zhou, Huanhuan Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03987
Pdf URL: https://arxiv.org/pdf/2412.03987
Copy Paste: [[2412.03987]] MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for Strengthening LLM(https://arxiv.org/abs/2412.03987)
Keywords: large language model
Abstract: Large language models (LLMs) have shown limitations in tasks requiring complex logical reasoning and multi-step problem-solving. To address these challenges, researchers have employed carefully designed prompts and flowcharts, simulating human cognitive processes to enhance LLM performance, such as the Chain of Thought approach. In this paper, we introduce MTMT (Multi-thinking Modes Tree), a novel method that interacts with LLMs to construct a thought tree, simulating various advanced cognitive processes, including but not limited to association, counterfactual thinking, task decomposition, and comparison. By breaking down the original complex task into simpler sub-questions, MTMT facilitates easier problem-solving for LLMs, enabling more effective utilization of the latent knowledge within LLMs. We evaluate the performance of MTMT under different parameter configurations, using GPT-4o mini as the base model. Our results demonstrate that integrating multiple modes of thinking significantly enhances the ability of LLMs to handle complex tasks.

Title: LaserGuider: A Laser Based Physical Backdoor Attack against Deep Neural Networks

Authors: Yongjie Xu, Guangke Chen, Fu Song, Yuqi Chen
Subjects: cs.CR, cs.AI, cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.03993
Pdf URL: https://arxiv.org/pdf/2412.03993
Copy Paste: [[2412.03993]] LaserGuider: A Laser Based Physical Backdoor Attack against Deep Neural Networks(https://arxiv.org/abs/2412.03993)
Keywords: defense, attack, steal
Abstract: Backdoor attacks embed hidden associations between triggers and targets in deep neural networks (DNNs), causing them to predict the target when a trigger is present while maintaining normal behavior otherwise. Physical backdoor attacks, which use physical objects as triggers, are feasible but lack remote control, temporal stealthiness, flexibility, and mobility. To overcome these limitations, in this work, we propose a new type of backdoor triggers utilizing lasers that feature long-distance transmission and instant-imaging properties. Based on the laser-based backdoor triggers, we present a physical backdoor attack, called LaserGuider, which possesses remote control ability and achieves high temporal stealthiness, flexibility, and mobility. We also introduce a systematic approach to optimize laser parameters for improving attack effectiveness. Our evaluation on traffic sign recognition DNNs, critical in autonomous vehicles, demonstrates that LaserGuider with three different laser-based triggers achieves over 90% attack success rate with negligible impact on normal inputs. Additionally, we release LaserMark, the first dataset of real world traffic signs stamped with physical laser spots, to support further research in backdoor attacks and defenses.

Title: IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

Authors: Sejong Yang, Seoung Wug Oh, Yang Zhou, Seon Joo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04000
Pdf URL: https://arxiv.org/pdf/2412.04000
Copy Paste: [[2412.04000]] IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation(https://arxiv.org/abs/2412.04000)
Keywords: diffusion, generative
Abstract: We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on this https URL.

Title: Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

Authors: Lingfeng Ming, Bo Zeng, Chenyang Lyu, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wang, Weihua Luo, Kaifu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04003
Pdf URL: https://arxiv.org/pdf/2412.04003
Copy Paste: [[2412.04003]] Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement(https://arxiv.org/abs/2412.04003)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

Title: (Blind) Users Really Do Heed Aural Telephone Scam Warnings

Authors: Filipo Sharevski, Jennifer Vander Loop, Bill Evans, Alexander Ponticello
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.04014
Pdf URL: https://arxiv.org/pdf/2412.04014
Copy Paste: [[2412.04014]] (Blind) Users Really Do Heed Aural Telephone Scam Warnings(https://arxiv.org/abs/2412.04014)
Keywords: security, privacy, protect, robust
Abstract: This paper reports on a study exploring how two groups of individuals, legally blind (n=36) and sighted ones (n=36), react to aural telephone scam warnings in naturalistic settings. As spoofing a CallerID is trivial, communicating the context of an incoming call instead offers a better possibility to warn a receiver about a potential scam. Usually, such warnings are visual in nature and fail to cater to users with visual disabilities. To address this exclusion, we developed an aural variant of telephone scam warnings and tested them in three conditions: baseline (no warning), short warning, and contextual warning that preceded the scam's content. We tested the two most common scam scenarios: fraud (interest rate reduction) and identity theft (social security number) by cold-calling participants and recording their action, and debriefing and obtaining consent afterward. Only two participants "pressed one" as the scam demanded, both from the legally blind group that heard the contextual warning for the social security scenario. Upon close inspection, we learned that one of them did so because of accessibility issues with their screen reader and the other did so intentionally because the warning convinced them to waste the scammer's time, so they don't scam vulnerable people. Both the legally blind and the sighted participants found the contextual warnings as powerful usable security cues that, together with STIR/SHAKEN indicators like "Scam Likely", would provide robust protection against any type of scam. We also discussed the potential privacy implications of the contextual warnings and collected recommendations for usably accessible implementation.

Title: PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors

Authors: Kangan Qian, Xinyu Jiao, Yining Shi, Yunlong Wang, Ziang Luo, Zheng Fu, Kun Jiang, Diange Yang
Subjects: cs.CV, cs.PF, cs.RO
Abstract URL: https://arxiv.org/abs/2412.04020
Pdf URL: https://arxiv.org/pdf/2412.04020
Copy Paste: [[2412.04020]] PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors(https://arxiv.org/abs/2412.04020)
Keywords: robust, generative
Abstract: Reliable perception of spatial and motion information is crucial for safe autonomous navigation. Traditional approaches typically fall into two categories: object-centric and class-agnostic methods. While object-centric methods often struggle with missed detections, leading to inaccuracies in motion prediction, many class-agnostic methods focus heavily on encoder design, often overlooking important priors like rigidity and temporal consistency, leading to suboptimal performance, particularly with sparse LiDAR data at distant region. To address these issues, we propose $\textbf{PriorMotion}$, a generative framework that extracts rasterized and vectorized scene representations to model spatio-temporal priors. Our model comprises a BEV encoder, an Raster-Vector prior Encoder, and a Spatio-Temporal prior Generator, improving both spatial and temporal consistency in motion prediction. Additionally, we introduce a standardized evaluation protocol for class-agnostic motion prediction. Experiments on the nuScenes dataset show that PriorMotion achieves state-of-the-art performance, with further validation on advanced FMCW LiDAR confirming its robustness.

Title: M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction

Authors: Jiang Liu, Bobo Li, Xinran Yang, Na Yang, Hao Fei, Mingyao Zhang, Fei Li, Donghong Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04026
Pdf URL: https://arxiv.org/pdf/2412.04026
Copy Paste: [[2412.04026]] M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction(https://arxiv.org/abs/2412.04026)
Keywords: extraction
Abstract: Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M$^{3}$D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.

Title: Mask of truth: model sensitivity to unexpected regions of medical images

Authors: Théo Sourget, Michelle Hestbek-Møller, Amelia Jiménez-Sánchez, Jack Junchi Xu, Veronika Cheplygina
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04030
Pdf URL: https://arxiv.org/pdf/2412.04030
Copy Paste: [[2412.04030]] Mask of truth: model sensitivity to unexpected regions of medical images(https://arxiv.org/abs/2412.04030)
Keywords: explainability
Abstract: The development of larger models for medical image analysis has led to increased performance. However, it also affected our ability to explain and validate model decisions. Models can use non-relevant parts of images, also called spurious correlations or shortcuts, to obtain high performance on benchmark datasets but fail in real-world scenarios. In this work, we challenge the capacity of convolutional neural networks (CNN) to classify chest X-rays and eye fundus images while masking out clinically relevant parts of the image. We show that all models trained on the PadChest dataset, irrespective of the masking strategy, are able to obtain an Area Under the Curve (AUC) above random. Moreover, the models trained on full images obtain good performance on images without the region of interest (ROI), even superior to the one obtained on images only containing the ROI. We also reveal a possible spurious correlation in the Chaksu dataset while the performances are more aligned with the expectation of an unbiased model. We go beyond the performance analysis with the usage of the explainability method SHAP and the analysis of embeddings. We asked a radiology resident to interpret chest X-rays under different masking to complement our findings with clinical knowledge. Our code is available at this https URL and this https URL

Title: Dimension Reduction via Random Projection for Privacy in Multi-Agent Systems

Authors: Puspanjali Ghoshal, Ashok Singh Sairam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.04031
Pdf URL: https://arxiv.org/pdf/2412.04031
Copy Paste: [[2412.04031]] Dimension Reduction via Random Projection for Privacy in Multi-Agent Systems(https://arxiv.org/abs/2412.04031)
Keywords: secure, privacy, attack
Abstract: The agents in a Multi-Agent System (MAS) make observations about the system and send that information to a fusion center. The fusion center aggregates the information and concludes about the system parameters with as much accuracy as possible. However for the purposes of better efficiency of the system at large, the agents need to append some private parameters to the observed data. In this scenario, the data sent to the fusion center is faced with privacy risks. The data communicated to the fusion center must be secured against data privacy breaches and inference attacks in a decentralized manner. However, this in turn leads to a loss of utility of the data being sent to the fusion center. We quantify the utility and privacy of the system using Cosine similarity. We formulate our MAS problem in terms of deducing a concept for which compression-based methods are there in literature. Next, we propose a novel sanitization mechanism for our MAS using one such compression-based method while addressing the utility-privacy tradeoff problem.

Title: Dynamic Graph Representation with Contrastive Learning for Financial Market Prediction: Integrating Temporal Evolution and Static Relations

Authors: Yunhua Pei, Jin Zheng, John Cartlidge
Subjects: cs.LG, cs.NE, q-fin.CP
Abstract URL: https://arxiv.org/abs/2412.04034
Pdf URL: https://arxiv.org/pdf/2412.04034
Copy Paste: [[2412.04034]] Dynamic Graph Representation with Contrastive Learning for Financial Market Prediction: Integrating Temporal Evolution and Static Relations(https://arxiv.org/abs/2412.04034)
Keywords: robust
Abstract: Temporal Graph Learning (TGL) is crucial for capturing the evolving nature of stock markets. Traditional methods often ignore the interplay between dynamic temporal changes and static relational structures between stocks. To address this issue, we propose the Dynamic Graph Representation with Contrastive Learning (DGRCL) framework, which integrates dynamic and static graph relations to improve the accuracy of stock trend prediction. Our framework introduces two key components: the Embedding Enhancement (EE) module and the Contrastive Constrained Training (CCT) module. The EE module focuses on dynamically capturing the temporal evolution of stock data, while the CCT module enforces static constraints based on stock relations, refined within contrastive learning. This dual-relation approach allows for a more comprehensive understanding of stock market dynamics. Our experiments on two major U.S. stock market datasets, NASDAQ and NYSE, demonstrate that DGRCL significantly outperforms state-of-the-art TGL baselines. Ablation studies indicate the importance of both modules. Overall, DGRCL not only enhances prediction ability but also provides a robust framework for integrating temporal and relational data in dynamic graphs. Code and data are available for public access.

Title: AI4EF: Artificial Intelligence for Energy Efficiency in the Building Sector

Authors: Alexandros Menelaos Tzortzis, Georgios Kormpakis, Sotiris Pelekis, Ariadni Michalitsi-Psarrou, Evangelos Karakolis, Christos Ntanos, Dimitris Askounis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04045
Pdf URL: https://arxiv.org/pdf/2412.04045
Copy Paste: [[2412.04045]] AI4EF: Artificial Intelligence for Energy Efficiency in the Building Sector(https://arxiv.org/abs/2412.04045)
Keywords: security
Abstract: AI4EF, Artificial Intelligence for Energy Efficiency, is an advanced, user-centric tool designed to support decision-making in building energy retrofitting and efficiency optimization. Leveraging machine learning (ML) and data-driven insights, AI4EF enables stakeholders such as public sector representatives, energy consultants, and building owners to model, analyze, and predict energy consumption, retrofit costs, and environmental impacts of building upgrades. Featuring a modular framework, AI4EF includes customizable building retrofitting, photovoltaic installation assessment, and predictive modeling tools that allow users to input building parameters and receive tailored recommendations for achieving energy savings and carbon reduction goals. Additionally, the platform incorporates a Training Playground for data scientists to refine ML models used by said framework. Finally, AI4EF provides access to the Enershare Data Space to facilitate seamless data sharing and access within the ecosystem. Its compatibility with open-source identity management, Keycloak, enhances security and accessibility, making it adaptable for various regulatory and organizational contexts. This paper presents an architectural overview of AI4EF, its application in energy efficiency scenarios, and its potential for advancing sustainable energy practices through artificial intelligence (AI).

Title: Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting MPs

Authors: Mugdha Pandya, Mali Jin, Kalina Bontcheva, Diana Maynard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04046
Pdf URL: https://arxiv.org/pdf/2412.04046
Copy Paste: [[2412.04046]] Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting MPs(https://arxiv.org/abs/2412.04046)
Keywords: attack, large language model
Abstract: Numerous politicians use social media platforms, particularly X, to engage with their constituents. This interaction allows constituents to pose questions and offer feedback but also exposes politicians to a barrage of hostile responses, especially given the anonymity afforded by social media. They are typically targeted in relation to their governmental role, but the comments also tend to attack their personal identity. This can discredit politicians and reduce public trust in the government. It can also incite anger and disrespect, leading to offline harm and violence. While numerous models exist for detecting hostility in general, they lack the specificity required for political contexts. Furthermore, addressing hostility towards politicians demands tailored approaches due to the distinct language and issues inherent to each country (e.g., Brexit for the UK). To bridge this gap, we construct a dataset of 3,320 English tweets spanning a two-year period manually annotated for hostility towards UK MPs. Our dataset also captures the targeted identity characteristics (race, gender, religion, none) in hostile tweets. We perform linguistic and topical analyses to delve into the unique content of the UK political data. Finally, we evaluate the performance of pre-trained language models and large language models on binary hostility detection and multi-class targeted identity type classification tasks. Our study offers valuable data and insights for future research on the prevalence and nature of politics-related hostility specific to the UK.

Title: How to design a Public Key Infrastructure for a Central Bank Digital Currency

Authors: Makan Rafiee, Lars Hupel
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2412.04051
Pdf URL: https://arxiv.org/pdf/2412.04051
Copy Paste: [[2412.04051]] How to design a Public Key Infrastructure for a Central Bank Digital Currency(https://arxiv.org/abs/2412.04051)
Keywords: robust
Abstract: Central Bank Digital Currency (CBDC) is a new form of money, issued by a country's or region's central bank, that can be used for a variety of payment scenarios. Depending on its concrete implementation, there are many participants in a production CBDC ecosystem, including the central bank, commercial banks, merchants, individuals, and wallet providers. There is a need for robust and scalable Public Key Infrastructure (PKI) for CBDC to ensure the continued trust of all entities in the system. This paper discusses the criteria that should flow into the design of a PKI and proposes a certificate hierarchy, together with a rollover concept ensuring continuous operation of the system. We further consider several peculiarities, such as the circulation of offline-capable hardware wallets.

Title: TransAdapter: Vision Transformer for Feature-Centric Unsupervised Domain Adaptation

Authors: A. Enes Doruk, Erhan Oztop, Hasan F. Ates
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04073
Pdf URL: https://arxiv.org/pdf/2412.04073
Copy Paste: [[2412.04073]] TransAdapter: Vision Transformer for Feature-Centric Unsupervised Domain Adaptation(https://arxiv.org/abs/2412.04073)
Keywords: transformer
Abstract: Unsupervised Domain Adaptation (UDA) aims to utilize labeled data from a source domain to solve tasks in an unlabeled target domain, often hindered by significant domain gaps. Traditional CNN-based methods struggle to fully capture complex domain relationships, motivating the shift to vision transformers like the Swin Transformer, which excel in modeling both local and global dependencies. In this work, we propose a novel UDA approach leveraging the Swin Transformer with three key modules. A Graph Domain Discriminator enhances domain alignment by capturing inter-pixel correlations through graph convolutions and entropy-based attention differentiation. An Adaptive Double Attention module combines Windows and Shifted Windows attention with dynamic reweighting to align long-range and local features effectively. Finally, a Cross-Feature Transform modifies Swin Transformer blocks to improve generalization across domains. Extensive benchmarks confirm the state-of-the-art performance of our versatile method, which requires no task-specific alignment modules, establishing its adaptability to diverse applications.

Title: SoRA: Singular Value Decomposed Low-Rank Adaptation for Domain Generalizable Representation Learning

Authors: Seokju Yun, Seunghye Chae, Dongheon Lee, Youngmin Ro
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04077
Pdf URL: https://arxiv.org/pdf/2412.04077
Copy Paste: [[2412.04077]] SoRA: Singular Value Decomposed Low-Rank Adaptation for Domain Generalizable Representation Learning(https://arxiv.org/abs/2412.04077)
Keywords: robust, segmentation
Abstract: Domain generalization (DG) aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains. Recently, Parameter-Efficient Fine-Tuning (PEFT) of foundation models has shown promising results in the context of DG problem. Nevertheless, existing PEFT methods still struggle to strike a balance between preserving generalizable components of the pre-trained model and learning task-specific features. To gain insights into the distribution of generalizable components, we begin by analyzing the pre-trained weights through the lens of singular value decomposition. Building on these insights, we introduce Singular Value Decomposed Low-Rank Adaptation (SoRA), an approach that selectively tunes minor singular components while keeping the residual parts frozen. SoRA effectively retains the generalization ability of the pre-trained model while efficiently acquiring task-specific skills. Furthermore, we freeze domain-generalizable blocks and employ an annealing weight decay strategy, thereby achieving an optimal balance in the delicate trade-off between generalizability and discriminability. SoRA attains state-of-the-art results on multiple benchmarks that span both domain generalized semantic segmentation to domain generalized object detection. In addition, our methods introduce no additional inference overhead or regularization loss, maintain compatibility with any backbone or head, and are designed to be versatile, allowing easy integration into a wide range of tasks.

Title: Towards Generalizable Autonomous Penetration Testing via Domain Randomization and Meta-Reinforcement Learning

Authors: Shicheng Zhou, Jingju Liu, Yuliang Lu, Jiahai Yang, Yue Zhang, Jie Chen
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.04078
Pdf URL: https://arxiv.org/pdf/2412.04078
Copy Paste: [[2412.04078]] Towards Generalizable Autonomous Penetration Testing via Domain Randomization and Meta-Reinforcement Learning(https://arxiv.org/abs/2412.04078)
Keywords: large language model
Abstract: With increasing numbers of vulnerabilities exposed on the internet, autonomous penetration testing (pentesting) has emerged as an emerging research area, while reinforcement learning (RL) is a natural fit for studying autonomous pentesting. Previous research in RL-based autonomous pentesting mainly focused on enhancing agents' learning efficacy within abstract simulated training environments. They overlooked the applicability and generalization requirements of deploying agents' policies in real-world environments that differ substantially from their training settings. In contrast, for the first time, we shift focus to the pentesting agents' ability to generalize across unseen real environments. For this purpose, we propose a Generalizable Autonomous Pentesting framework (namely GAP) for training agents capable of drawing inferences from one to another -- a key requirement for the broad application of autonomous pentesting and a hallmark of human intelligence. GAP introduces a Real-to-Sim-to-Real pipeline with two key methods: domain randomization and meta-RL learning. Specifically, we are among the first to apply domain randomization in autonomous pentesting and propose a large language model-powered domain randomization method for synthetic environment generation. We further apply meta-RL to improve the agents' generalization ability in unseen environments by leveraging the synthetic environments. The combination of these two methods can effectively bridge the generalization gap and improve policy adaptation performance. Experiments are conducted on various vulnerable virtual machines, with results showing that GAP can (a) enable policy learning in unknown real environments, (b) achieve zero-shot policy transfer in similar environments, and (c) realize rapid policy adaptation in dissimilar environments.

Title: Federated Learning in Mobile Networks: A Comprehensive Case Study on Traffic Forecasting

Authors: Nikolaos Pavlidis, Vasileios Perifanis, Selim F. Yilmaz, Francesc Wilhelmi, Marco Miozzo, Pavlos S. Efraimidis, Remous-Aris Koutsiamanis, Pavol Mulinka, Paolo Dini
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04081
Pdf URL: https://arxiv.org/pdf/2412.04081
Copy Paste: [[2412.04081]] Federated Learning in Mobile Networks: A Comprehensive Case Study on Traffic Forecasting(https://arxiv.org/abs/2412.04081)
Keywords: privacy, robust, federate
Abstract: The increasing demand for efficient resource allocation in mobile networks has catalyzed the exploration of innovative solutions that could enhance the task of real-time cellular traffic prediction. Under these circumstances, federated learning (FL) stands out as a distributed and privacy-preserving solution to foster collaboration among different sites, thus enabling responsive near-the-edge solutions. In this paper, we comprehensively study the potential benefits of FL in telecommunications through a case study on federated traffic forecasting using real-world data from base stations (BSs) in Barcelona (Spain). Our study encompasses relevant aspects within the federated experience, including model aggregation techniques, outlier management, the impact of individual clients, personalized learning, and the integration of exogenous sources of data. The performed evaluation is based on both prediction accuracy and sustainability, thus showcasing the environmental impact of employed FL algorithms in various settings. The findings from our study highlight FL as a promising and robust solution for mobile traffic prediction, emphasizing its twin merits as a privacy-conscious and environmentally sustainable approach, while also demonstrating its capability to overcome data heterogeneity and ensure high-quality predictions, marking a significant stride towards its integration in mobile traffic management systems.

Title: LossAgent: Towards Any Optimization Objectives for Image Processing with LLM Agents

Authors: Bingchen Li, Xin Li, Yiting Lu, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04090
Pdf URL: https://arxiv.org/pdf/2412.04090
Copy Paste: [[2412.04090]] LossAgent: Towards Any Optimization Objectives for Image Processing with LLM Agents(https://arxiv.org/abs/2412.04090)
Keywords: large language model
Abstract: We present the first loss agent, dubbed LossAgent, for low-level image processing tasks, e.g., image super-resolution and restoration, intending to achieve any customized optimization objectives of low-level image processing in different practical applications. Notably, not all optimization objectives, such as complex hand-crafted perceptual metrics, text description, and intricate human feedback, can be instantiated with existing low-level losses, e.g., MSE loss. which presents a crucial challenge in optimizing image processing networks in an end-to-end manner. To eliminate this, our LossAgent introduces the powerful large language model (LLM) as the loss agent, where the rich textual understanding of prior knowledge empowers the loss agent with the potential to understand complex optimization objectives, trajectory, and state feedback from external environments in the optimization process of the low-level image processing networks. In particular, we establish the loss repository by incorporating existing loss functions that support the end-to-end optimization for low-level image processing. Then, we design the optimization-oriented prompt engineering for the loss agent to actively and intelligently decide the compositional weights for each loss in the repository at each optimization interaction, thereby achieving the required optimization trajectory for any customized optimization objectives. Extensive experiments on three typical low-level image processing tasks and multiple optimization objectives have shown the effectiveness and applicability of our proposed LossAgent. Code and pre-trained models will be available at this https URL.

Title: MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

Authors: Haoning Wu, Ziheng Zhao, Ya Zhang, Weidi Xie, Yanfeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04106
Pdf URL: https://arxiv.org/pdf/2412.04106
Copy Paste: [[2412.04106]] MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities(https://arxiv.org/abs/2412.04106)
Keywords: diffusion, generative, segmentation
Abstract: Medical image segmentation has recently demonstrated impressive progress with deep neural networks, yet the heterogeneous modalities and scarcity of mask annotations limit the development of segmentation models on unannotated modalities. This paper investigates a new paradigm for leveraging generative models in medical applications: controllably synthesizing data for unannotated modalities, without requiring registered data pairs. Specifically, we make the following contributions in this paper: (i) we collect and curate a large-scale radiology image-text dataset, MedGen-1M, comprising modality labels, attributes, region, and organ information, along with a subset of organ mask annotations, to support research in controllable medical image generation; (ii) we propose a diffusion-based data engine, termed MRGen, which enables generation conditioned on text prompts and masks, synthesizing MR images for diverse modalities lacking mask annotations, to train segmentation models on unannotated modalities; (iii) we conduct extensive experiments across various modalities, illustrating that our data engine can effectively synthesize training samples and extend MRI segmentation towards unannotated modalities.

Title: MVUDA: Unsupervised Domain Adaptation for Multi-view Pedestrian Detection

Authors: Erik Brorsson, Lennart Svensson, Kristofer Bengtsson, Knut Åkesson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04117
Pdf URL: https://arxiv.org/pdf/2412.04117
Copy Paste: [[2412.04117]] MVUDA: Unsupervised Domain Adaptation for Multi-view Pedestrian Detection(https://arxiv.org/abs/2412.04117)
Keywords: robust
Abstract: We address multi-view pedestrian detection in a setting where labeled data is collected using a multi-camera setup different from the one used for testing. While recent multi-view pedestrian detectors perform well on the camera rig used for training, their performance declines when applied to a different setup. To facilitate seamless deployment across varied camera rigs, we propose an unsupervised domain adaptation (UDA) method that adapts the model to new rigs without requiring additional labeled data. Specifically, we leverage the mean teacher self-training framework with a novel pseudo-labeling technique tailored to multi-view pedestrian detection. This method achieves state-of-the-art performance on multiple benchmarks, including MultiviewX$\rightarrow$Wildtrack. Unlike previous methods, our approach eliminates the need for external labeled monocular datasets, thereby reducing reliance on labeled data. Extensive evaluations demonstrate the effectiveness of our method and validate key design choices. By enabling robust adaptation across camera setups, our work enhances the practicality of multi-view pedestrian detectors and establishes a strong UDA baseline for future research.

Title: DeepFEA: Deep Learning for Prediction of Transient Finite Element Analysis Solutions

Authors: Georgios Triantafyllou, Panagiotis G. Kalozoumis, George Dimas, Dimitris K. Iakovidis
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2412.04121
Pdf URL: https://arxiv.org/pdf/2412.04121
Copy Paste: [[2412.04121]] DeepFEA: Deep Learning for Prediction of Transient Finite Element Analysis Solutions(https://arxiv.org/abs/2412.04121)
Keywords: robust
Abstract: Finite Element Analysis (FEA) is a powerful but computationally intensive method for simulating physical phenomena. Recent advancements in machine learning have led to surrogate models capable of accelerating FEA. Yet there are still limitations in developing surrogates of transient FEA models that can simultaneously predict the solutions for both nodes and elements with applicability on both the 2D and 3D domains. Motivated by this research gap, this study proposes DeepFEA, a deep learning-based framework that leverages a multilayer Convolutional Long Short-Term Memory (ConvLSTM) network branching into two parallel convolutional neural networks to predict the solutions for both nodes and elements of FEA models. The proposed network is optimized using a novel adaptive learning algorithm, called Node-Element Loss Optimization (NELO). NELO minimizes the error occurring at both branches of the network enabling the prediction of solutions for transient FEA simulations. The experimental evaluation of DeepFEA is performed on three datasets in the context of structural mechanics, generated to serve as publicly available reference datasets. The results show that DeepFEA can achieve less than 3% normalized mean and root mean squared error for 2D and 3D simulation scenarios, and inference times that are two orders of magnitude faster than FEA. In contrast, relevant state-of-the-art methods face challenges with multi-dimensional output and dynamic input prediction. Furthermore, DeepFEA's robustness was demonstrated in a real-life biomedical scenario, confirming its suitability for accurate and efficient predictions of FEA simulations.

Title: Deep priors for satellite image restoration with accurate uncertainties

Authors: Biquard Maud, Marie Chabert, Florence Genin, Christophe Latry, Thomas Oberlin
Subjects: cs.CV, eess.IV, physics.optics
Abstract URL: https://arxiv.org/abs/2412.04130
Pdf URL: https://arxiv.org/pdf/2412.04130
Copy Paste: [[2412.04130]] Deep priors for satellite image restoration with accurate uncertainties(https://arxiv.org/abs/2412.04130)
Keywords: robust
Abstract: Satellite optical images, upon their on-ground receipt, offer a distorted view of the observed scene. Their restoration, classically including denoising, deblurring, and sometimes super-resolution, is required before their exploitation. Moreover, quantifying the uncertainty related to this restoration could be valuable by lowering the risk of hallucination and avoiding propagating these biases in downstream applications. Deep learning methods are now state-of-the-art for satellite image restoration. However, they require to train a specific network for each sensor and they do not provide the associated uncertainties. This paper proposes a generic method involving a single network to restore images from several sensors and a scalable way to derive the uncertainties. We focus on deep regularization (DR) methods, which learn a deep prior on target images before plugging it into a model-based optimization scheme. First, we introduce VBLE-xz, which solves the inverse problem in the latent space of a variational compressive autoencoder, estimating the uncertainty jointly in the latent and in the image spaces. It enables scalable posterior sampling with relevant and calibrated uncertainties. Second, we propose the denoiser-based method SatDPIR, adapted from DPIR, which efficiently computes accurate point estimates. We conduct a comprehensive set of experiments on very high resolution simulated and real Pleiades images, asserting both the performance and robustness of the proposed methods. VBLE-xz and SatDPIR achieve state-of-the-art results compared to direct inversion methods. In particular, VBLE-xz is a scalable method to get realistic posterior samples and accurate uncertainties, while SatDPIR represents a compelling alternative to direct inversion methods when uncertainty quantification is not required.

Title: Compositional Generative Multiphysics and Multi-component Simulation

Authors: Tao Zhang, Zhenhai Liu, Feipeng Qi, Yongjun Jiao, Tailin Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04134
Pdf URL: https://arxiv.org/pdf/2412.04134
Copy Paste: [[2412.04134]] Compositional Generative Multiphysics and Multi-component Simulation(https://arxiv.org/abs/2412.04134)
Keywords: diffusion, generative
Abstract: Multiphysics simulation, which models the interactions between multiple physical processes, and multi-component simulation of complex structures are critical in fields like nuclear and aerospace engineering. Previous studies often rely on numerical solvers or machine learning-based surrogate models to solve or accelerate these simulations. However, multiphysics simulations typically require integrating multiple specialized solvers-each responsible for evolving a specific physical process-into a coupled program, which introduces significant development challenges. Furthermore, no universal algorithm exists for multi-component simulations, which adds to the complexity. Here we propose compositional Multiphysics and Multi-component Simulation with Diffusion models (MultiSimDiff) to overcome these challenges. During diffusion-based training, MultiSimDiff learns energy functions modeling the conditional probability of one physical process/component conditioned on other processes/components. In inference, MultiSimDiff generates coupled multiphysics solutions and multi-component structures by sampling from the joint probability distribution, achieved by composing the learned energy functions in a structured way. We test our method in three tasks. In the reaction-diffusion and nuclear thermal coupling problems, MultiSimDiff successfully predicts the coupling solution using decoupled data, while the surrogate model fails in the more complex second problem. For the thermal and mechanical analysis of the prismatic fuel element, MultiSimDiff trained for single component prediction accurately predicts a larger structure with 64 components, reducing the relative error by 40.3% compared to the surrogate model.

Title: Text Change Detection in Multilingual Documents Using Image Comparison

Authors: Doyoung Park, Naresh Reddy Yarram, Sunjin Kim, Minkyu Kim, Seongho Cho, Taehee Lee
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04137
Pdf URL: https://arxiv.org/pdf/2412.04137
Copy Paste: [[2412.04137]] Text Change Detection in Multilingual Documents Using Image Comparison(https://arxiv.org/abs/2412.04137)
Keywords: segmentation
Abstract: Document comparison typically relies on optical character recognition (OCR) as its core technology. However, OCR requires the selection of appropriate language models for each document and the performance of multilingual or hybrid models remains limited. To overcome these challenges, we propose text change detection (TCD) using an image comparison model tailored for multilingual documents. Unlike OCR-based approaches, our method employs word-level text image-to-image comparison to detect changes. Our model generates bidirectional change segmentation maps between the source and target documents. To enhance performance without requiring explicit text alignment or scaling preprocessing, we employ correlations among multi-scale attention features. We also construct a benchmark dataset comprising actual printed and scanned word pairs in various languages to evaluate our model. We validate our approach using our benchmark dataset and public benchmarks Distorted Document Images and the LRDE Document Binarization Dataset. We compare our model against state-of-the-art semantic segmentation and change detection models, as well as to conventional OCR-based models.

Title: Understanding Memorization in Generative Models via Sharpness in Probability Landscapes

Authors: Dongjae Jeon, Dueun Kim, Albert No
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04140
Pdf URL: https://arxiv.org/pdf/2412.04140
Copy Paste: [[2412.04140]] Understanding Memorization in Generative Models via Sharpness in Probability Landscapes(https://arxiv.org/abs/2412.04140)
Keywords: secure, diffusion, generative
Abstract: In this paper, we introduce a geometric framework to analyze memorization in diffusion models using the eigenvalues of the Hessian of the log probability density. We propose that memorization arises from isolated points in the learned probability distribution, characterized by sharpness in the probability landscape, as indicated by large negative eigenvalues of the Hessian. Through experiments on various datasets, we demonstrate that these eigenvalues effectively detect and quantify memorization. Our approach provides a clear understanding of memorization in diffusion models and lays the groundwork for developing strategies to ensure secure and reliable generative models

Title: Reducing Tool Hallucination via Reliability Alignment

Authors: Hongshen Xu, Su Zhu, Zihan Wang, Hang Zheng, Da Ma, Ruisheng Cao, Shuai Fan, Lu Chen, Kai Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04141
Pdf URL: https://arxiv.org/pdf/2412.04141
Copy Paste: [[2412.04141]] Reducing Tool Hallucination via Reliability Alignment(https://arxiv.org/abs/2412.04141)
Keywords: large language model
Abstract: Large Language Models (LLMs) have extended their capabilities beyond language generation to interact with external systems through tool calling, offering powerful potential for real-world applications. However, the phenomenon of tool hallucinations, which occur when models improperly select or misuse tools, presents critical challenges that can lead to flawed task execution and increased operational costs. This paper investigates the concept of reliable tool calling and highlights the necessity of addressing tool hallucinations. We systematically categorize tool hallucinations into two main types: tool selection hallucination and tool usage hallucination. To mitigate these issues, we propose a reliability-focused alignment framework that enhances the model's ability to accurately assess tool relevance and usage. By proposing a suite of evaluation metrics and evaluating on StableToolBench, we further demonstrate the effectiveness of our framework in mitigating tool hallucination and improving the overall system reliability of LLM tool calling.

Title: AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Authors: Xinghui Li, Qichao Sun, Pengze Zhang, Fulong Ye, Zhichao Liao, Wanquan Feng, Songtao Zhao, Qian He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04146
Pdf URL: https://arxiv.org/pdf/2412.04146
Copy Paste: [[2412.04146]] AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models(https://arxiv.org/abs/2412.04146)
Keywords: diffusion
Abstract: Recent advances in garment-centric image generation from text and image prompts based on diffusion models are impressive. However, existing methods lack support for various combinations of attire, and struggle to preserve the garment details while maintaining faithfulness to the text prompts, limiting their performance across diverse scenarios. In this paper, we focus on a new task, i.e., Multi-Garment Virtual Dressing, and we propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and any personalized text prompts. AnyDressing comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features and generating customized images. Specifically, we propose an efficient and scalable module called Garment-Specific Feature Extractor in GarmentsNet to individually encode garment textures in parallel. This design prevents garment confusion while ensuring network efficiency. Meanwhile, we design an adaptive Dressing-Attention mechanism and a novel Instance-Level Garment Localization Learning strategy in DressingNet to accurately inject multi-garment features into their corresponding regions. This approach efficiently integrates multi-garment texture cues into generated images and further enhances text-image consistency. Additionally, we introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments. Thanks to our well-craft design, AnyDressing can serve as a plug-in module to easily integrate with any community control extensions for diffusion models, improving the diversity and controllability of synthesized images. Extensive experiments show that AnyDressing achieves state-of-the-art results.

Title: Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Authors: Haitian Zhang, Xiangyuan Wang, Chang Xu, Xinya Wang, Fang Xu, Huai Yu, Lei Yu, Wen Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04149
Pdf URL: https://arxiv.org/pdf/2412.04149
Copy Paste: [[2412.04149]] Frequency-Adaptive Low-Latency Object Detection Using Events and Frames(https://arxiv.org/abs/2412.04149)
Keywords: robust
Abstract: Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments and the rich semantic information provided by RGB cameras. However, two critical mismatches: low-latency Events \textit{vs.}~high-latency RGB frames; temporally sparse labels in training \textit{vs.}~continuous flow in inference, significantly hinder the high-frequency fusion-based object detection. To address these challenges, we propose the \textbf{F}requency-\textbf{A}daptive Low-Latency \textbf{O}bject \textbf{D}etector (FAOD). FAOD aligns low-frequency RGB frames with high-frequency Events through an Align Module, which reinforces cross-modal style and spatial proximity to address the Event-RGB Mismatch. We further propose a training strategy, Time Shift, which enforces the module to align the prediction from temporally shifted Event-RGB pairs and their original representation, that is, consistent with Event-aligned annotations. This strategy enables the network to use high-frequency Event data as the primary reference while treating low-frequency RGB images as supplementary information, retaining the low-latency nature of the Event stream toward high-frequency detection. Furthermore, we observe that these corrected Event-RGB pairs demonstrate better generalization from low training frequency to higher inference frequencies compared to using Event data alone. Extensive experiments on the PKU-DAVIS-SOD and DSEC-Detection datasets demonstrate that our FAOD achieves SOTA performance. Specifically, in the PKU-DAVIS-SOD Dataset, FAOD achieves 9.8 points improvement in terms of the mAP in fully paired Event-RGB data with only a quarter of the parameters compared to SODFormer, and even maintains robust performance (only a 3 points drop in mAP) under 80$\times$ Event-RGB frequency mismatch.

Title: On the Lack of Robustness of Binary Function Similarity Systems

Authors: Gianluca Capozzi, Tong Tang, Jie Wan, Ziqi Yang, Daniele Cono D'Elia, Giuseppe Antonio Di Luna, Lorenzo Cavallaro, Leonardo Querzoni
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04163
Pdf URL: https://arxiv.org/pdf/2412.04163
Copy Paste: [[2412.04163]] On the Lack of Robustness of Binary Function Similarity Systems(https://arxiv.org/abs/2412.04163)
Keywords: security, attack, robust
Abstract: Binary function similarity, which often relies on learning-based algorithms to identify what functions in a pool are most similar to a given query function, is a sought-after topic in different communities, including machine learning, software engineering, and security. Its importance stems from the impact it has in facilitating several crucial tasks, from reverse engineering and malware analysis to automated vulnerability detection. Whereas recent work cast light around performance on this long-studied problem, the research landscape remains largely lackluster in understanding the resiliency of the state-of-the-art machine learning models against adversarial attacks. As security requires to reason about adversaries, in this work we assess the robustness of such models through a simple yet effective black-box greedy attack, which modifies the topology and the content of the control flow of the attacked functions. We demonstrate that this attack is successful in compromising all the models, achieving average attack success rates of 57.06% and 95.81% depending on the problem settings (targeted and untargeted attacks). Our findings are insightful: top performance on clean data does not necessarily relate to top robustness properties, which explicitly highlights performance-robustness trade-offs one should consider when deploying such models, calling for further research.

Title: Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based on gradual information disclosure

Authors: Florens Rohde, Victor Christen, Martin Franke, Erhard Rahm
Subjects: cs.CR, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04178
Pdf URL: https://arxiv.org/pdf/2412.04178
Copy Paste: [[2412.04178]] Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based on gradual information disclosure(https://arxiv.org/abs/2412.04178)
Keywords: privacy
Abstract: Privacy-Preserving Record linkage (PPRL) is an essential component in data integration tasks of sensitive information. The linkage quality determines the usability of combined datasets and (machine learning) applications based on them. We present a novel privacy-preserving protocol that integrates clerical review in PPRL using a multi-layer active learning process. Uncertain match candidates are reviewed on several layers by human and non-human oracles to reduce the amount of disclosed information per record and in total. Predictions are propagated back to update previous layers, resulting in an improved linkage performance for non-reviewed candidates as well. The data owners remain in control of the amount of information they share for each record. Therefore, our approach follows need-to-know and data sovereignty principles. The experimental evaluation on real-world datasets shows considerable linkage quality improvements with limited labeling effort and privacy risks.

Title: SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

Authors: Runsheng Bai, Qiang Liu, Bo Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04180
Pdf URL: https://arxiv.org/pdf/2412.04180
Copy Paste: [[2412.04180]] SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization(https://arxiv.org/abs/2412.04180)
Keywords: large language model
Abstract: Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called SKIM: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A greedy algorithm to solve approximately optimal bit allocation across weight channels, and 2. A trainable scaling vector for non-differentiable K-means clustering. These techniques substantially improve performance and can be adapted to any given bit. Notably, in terms of model perplexity, our method narrows the gap between 3-bit quantized LLaMA models and their full precision counterparts by 16.3% on average.

Title: Linear Discriminant Analysis in Credit Scoring: A Transparent Hybrid Model Approach

Authors: Md Shihab Reza, Monirul Islam Mahmud, Ifti Azad Abeer, Nova Ahmed
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04183
Pdf URL: https://arxiv.org/pdf/2412.04183
Copy Paste: [[2412.04183]] Linear Discriminant Analysis in Credit Scoring: A Transparent Hybrid Model Approach(https://arxiv.org/abs/2412.04183)
Keywords: fair, interpretability, explainability
Abstract: The development of computing has made credit scoring approaches possible, with various machine learning (ML) and deep learning (DL) techniques becoming more and more valuable. While complex models yield more accurate predictions, their interpretability is often weakened, which is a concern for credit scoring that places importance on decision fairness. As features of the dataset are a crucial factor for the credit scoring system, we implement Linear Discriminant Analysis (LDA) as a feature reduction technique, which reduces the burden of the models complexity. We compared 6 different machine learning models, 1 deep learning model, and a hybrid model with and without using LDA. From the result, we have found our hybrid model, XG-DNN, outperformed other models with the highest accuracy of 99.45% and a 99% F1 score with LDA. Lastly, to interpret model decisions, we have applied 2 different explainable AI techniques named LIME (local) and Morris Sensitivity Analysis (global). Through this research, we showed how feature reduction techniques can be used without affecting the performance and explainability of the model, which can be very useful in resource-constrained settings to optimize the computational workload.

Title: Instructional Video Generation

Authors: Yayuan Li, Zhi Cao, Jason J. Corso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04189
Pdf URL: https://arxiv.org/pdf/2412.04189
Copy Paste: [[2412.04189]] Instructional Video Generation(https://arxiv.org/abs/2412.04189)
Keywords: diffusion
Abstract: Despite the recent strides in video generation, state-of-the-art methods still struggle with elements of visual detail. One particularly challenging case is the class of egocentric instructional videos in which the intricate motion of the hand coupled with a mostly stable and non-distracting environment is necessary to convey the appropriate visual action instruction. To address these challenges, we introduce a new method for instructional video generation. Our diffusion-based method incorporates two distinct innovations. First, we propose an automatic method to generate the expected region of motion, guided by both the visual context and the action text. Second, we introduce a critical hand structure loss to guide the diffusion model to focus on smooth and consistent hand poses. We evaluate our method on augmented instructional datasets based on EpicKitchens and Ego4D, demonstrating significant improvements over state-of-the-art methods in terms of instructional clarity, especially of the hand motion in the target region, across diverse environments and this http URL results can be found on the project webpage: this https URL

Title: AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic

Authors: Nathaniel R. Robinson, Shahd Abdelmoneim, Kelly Marchisio, Sebastian Ruder
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04193
Pdf URL: https://arxiv.org/pdf/2412.04193
Copy Paste: [[2412.04193]] AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic(https://arxiv.org/abs/2412.04193)
Keywords: large language model
Abstract: Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits language modeling applications, yet the research community lacks operationalized LLM performance measurements in DA. We present a method that comprehensively evaluates LLM fidelity, understanding, quality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA varieties across these four dimensions and provide best practice recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, but does not suggest deterioration in quality when they do. Further analysis suggests that current post-training can degrade DA capabilities, that few-shot examples can overcome this and other LLM deficiencies, and that otherwise no measurable features of input text correlate well with LLM DA performance.

Title: PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models

Authors: Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Vergauwen, Nicolas Audebert, Andrea Nascetti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04204
Pdf URL: https://arxiv.org/pdf/2412.04204
Copy Paste: [[2412.04204]] PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models(https://arxiv.org/abs/2412.04204)
Keywords: robust
Abstract: Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at this https URL.

Title: A Context-aware Framework for Translation-mediated Conversations

Authors: José Pombal, Sweta Agrawal, Patrick Fernandes, Emmanouil Zaranis, André F. T. Martins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04205
Pdf URL: https://arxiv.org/pdf/2412.04205
Copy Paste: [[2412.04205]] A Context-aware Framework for Translation-mediated Conversations(https://arxiv.org/abs/2412.04205)
Keywords: large language model
Abstract: Effective communication is fundamental to any interaction, yet challenges arise when participants do not share a common language. Automatic translation systems offer a powerful solution to bridge language barriers in such scenarios, but they introduce errors that can lead to misunderstandings and conversation breakdown. A key issue is that current systems fail to incorporate the rich contextual information necessary to resolve ambiguities and omitted details, resulting in literal, inappropriate, or misaligned translations. In this work, we present a framework to improve large language model-based translation systems by incorporating contextual information in bilingual conversational settings. During training, we leverage context-augmented parallel data, which allows the model to generate translations sensitive to conversational history. During inference, we perform quality-aware decoding with context-aware metrics to select the optimal translation from a pool of candidates. We validate both components of our framework on two task-oriented domains: customer chat and user-assistant interaction. Across both settings, our framework consistently results in better translations than state-of-the-art systems like GPT-4o and TowerInstruct, as measured by multiple automatic translation quality metrics on several language pairs. We also show that the resulting model leverages context in an intended and interpretable way, improving consistency between the conveyed message and the generated translations.

Title: Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

Authors: Chenyang Zhu, Bin Xiao, Lin Shi, Shoukun Xu, Xu Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04220
Pdf URL: https://arxiv.org/pdf/2412.04220
Copy Paste: [[2412.04220]] Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts(https://arxiv.org/abs/2412.04220)
Keywords: extraction, segmentation
Abstract: The recent Segment Anything Model (SAM) represents a significant breakthrough in scaling segmentation models, delivering strong performance across various downstream applications in the RGB modality. However, directly applying SAM to emerging visual modalities, such as depth and event data results in suboptimal performance in multi-modal segmentation tasks. In this paper, we make the first attempt to adapt SAM for multi-modal semantic segmentation by proposing a Mixture of Low-Rank Adaptation Experts (MoE-LoRA) tailored for different input visual modalities. By training only the MoE-LoRA layers while keeping SAM's weights frozen, SAM's strong generalization and segmentation capabilities can be preserved for downstream tasks. Specifically, to address cross-modal inconsistencies, we propose a novel MoE routing strategy that adaptively generates weighted features across modalities, enhancing multi-modal feature integration. Additionally, we incorporate multi-scale feature extraction and fusion by adapting SAM's segmentation head and introducing an auxiliary segmentation head to combine multi-scale features for improved segmentation performance effectively. Extensive experiments were conducted on three multi-modal benchmarks: DELIVER, MUSES, and MCubeS. The results consistently demonstrate that the proposed method significantly outperforms state-of-the-art approaches across diverse scenarios. Notably, under the particularly challenging condition of missing modalities, our approach exhibits a substantial performance gain, achieving an improvement of 32.15% compared to existing methods.

Title: DistB-VNET: Distributed Cluster-based Blockchain Vehicular Ad-Hoc Networks through SDN-NFV for Smart City

Authors: Anichur Rahman, MD. Zunead Abedin Eidmum, Dipanjali Kundu, Mahir Hossain, MD Tanjum An Tashrif, Md Ahsan Karim, Md. Jahidul Islam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.04222
Pdf URL: https://arxiv.org/pdf/2412.04222
Copy Paste: [[2412.04222]] DistB-VNET: Distributed Cluster-based Blockchain Vehicular Ad-Hoc Networks through SDN-NFV for Smart City(https://arxiv.org/abs/2412.04222)
Keywords: security, privacy
Abstract: In the developing topic of smart cities, Vehicular Ad-Hoc Networks (VANETs) are crucial for providing successful interaction between vehicles and infrastructure. This research proposes a distributed Blockchain-based Vehicular Ad-hoc Network (DistB-VNET) architecture that includes binary malicious traffic classification, Software Defined Networking (SDN), and Network Function Virtualization (NFV) to ensure safe, scalable, and reliable vehicular networks in smart cities. The suggested framework is the decentralized blockchain for safe data management and SDN-NFV for dynamic network management and resource efficiency and a noble isolation forest algorithm works as an IDS (Intrusion Detection System). Further, "DistB-VNET" offers a dual-layer blockchain system, where a distributed blockchain provides safe communication between vehicles, while a centralized blockchain in the cloud is in charge of data verification and storage. This improves security, scalability, and adaptability, ensuring better traffic management, data security, and privacy in VANETs. Furthermore, the unsupervised isolation forest model achieves a high accuracy of 99.23% for detecting malicious traffic. Additionally, reveals that our method greatly improves network performance, offering decreased latency, increased security, and reduced congestion, an effective alternative for existing smart city infrastructures.

Title: DEIM: DETR with Improved Matching for Fast Convergence

Authors: Shihua Huang, Zhichao Lu, Xiaodong Cun, Yongjun Yu, Xiao Zhou, Xi Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04234
Pdf URL: https://arxiv.org/pdf/2412.04234
Copy Paste: [[2412.04234]] DEIM: DETR with Improved Matching for Fast Convergence(https://arxiv.org/abs/2412.04234)
Keywords: transformer
Abstract: We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50%. Notably, paired with RT-DETRv2, DEIM achieves 53.2% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7% and 56.5% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code and pre-trained models are available at this https URL.

Title: Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

Authors: Maria Paola Priola
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04235
Pdf URL: https://arxiv.org/pdf/2412.04235
Copy Paste: [[2412.04235]] Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots(https://arxiv.org/abs/2412.04235)
Keywords: large language model
Abstract: I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.

Title: VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

Authors: Jiahao Zhang, Ryota Yoshihashi, Shunsuke Kitada, Atsuki Osanai, Yuta Nakashima
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04237
Pdf URL: https://arxiv.org/pdf/2412.04237
Copy Paste: [[2412.04237]] VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction(https://arxiv.org/abs/2412.04237)
Keywords: generative, large language model
Abstract: Large language models (LLMs) have proven effective for layout generation due to their ability to produce structure-description languages, such as HTML or JSON, even without access to visual information. Recently, LLM providers have evolved these models into large vision-language models (LVLM), which shows prominent multi-modal understanding capabilities. Then, how can we leverage this multi-modal power for layout generation? To answer this, we propose Visual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based content-aware layout generation. In our method, LVLMs iteratively refine their outputs with reference to rendered layout images, which are visualized as colored bounding boxes on poster backgrounds. In experiments, we demonstrate that our method combined with the Gemini. Without any additional training, VASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming both existing layout-specific generative models and other LLM-based methods.

Title: LMDM:Latent Molecular Diffusion Model For 3D Molecule Generation

Authors: Xiang Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04242
Pdf URL: https://arxiv.org/pdf/2412.04242
Copy Paste: [[2412.04242]] LMDM:Latent Molecular Diffusion Model For 3D Molecule Generation(https://arxiv.org/abs/2412.04242)
Keywords: diffusion
Abstract: n this work, we propose a latent molecular diffusion model that can make the generated 3D molecules rich in diversity and maintain rich geometric features. The model captures the information of the forces and local constraints between atoms so that the generated molecules can maintain Euclidean transformation and high level of effectiveness and diversity. We also use the lowerrank manifold advantage of the latent variables of the latent model to fuse the information of the forces between atoms to better maintain the geometric equivariant properties of the molecules. Because there is no need to perform information fusion encoding in stages like traditional encoders and decoders, this reduces the amount of calculation in the back-propagation process. The model keeps the forces and local constraints of particle bonds in the latent variable space, reducing the impact of underfitting on the surface of the network on the large position drift of the particle geometry, so that our model can converge earlier. We introduce a distribution control variable in each backward step to strengthen exploration and improve the diversity of generation. In the experiment, the quality of the samples we generated and the convergence speed of the model have been significantly improved.

Title: Quantifying the Limits of Segment Anything Model: Analyzing Challenges in Segmenting Tree-Like and Low-Contrast Structures

Authors: Yixin Zhang, Nicholas Konz, Kevin Kramer, Maciej A. Mazurowski
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.04243
Pdf URL: https://arxiv.org/pdf/2412.04243
Copy Paste: [[2412.04243]] Quantifying the Limits of Segment Anything Model: Analyzing Challenges in Segmenting Tree-Like and Low-Contrast Structures(https://arxiv.org/abs/2412.04243)
Keywords: segmentation
Abstract: Segment Anything Model (SAM) has shown impressive performance in interactive and zero-shot segmentation across diverse domains, suggesting that they have learned a general concept of "objects" from their large-scale training. However, we observed that SAM struggles with certain types of objects, particularly those featuring dense, tree-like structures and low textural contrast from their surroundings. These failure modes are critical for understanding its limitations in real-world use. In order to systematically examine this issue, we propose metrics to quantify two key object characteristics: tree-likeness and textural separability. Through extensive controlled synthetic experiments and testing on real datasets, we demonstrate that SAM's performance is noticeably correlated with these factors. We link these behaviors under the concept of "textural confusion", where SAM misinterprets local structure as global texture, leading to over-segmentation, or struggles to differentiate objects from similarly textured backgrounds. These findings offer the first quantitative framework to model SAM's challenges, providing valuable insights into its limitations and guiding future improvements for vision foundation models.

Title: Intriguing Properties of Robust Classification

Authors: Bernd Prach, Christoph H. Lampert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04245
Pdf URL: https://arxiv.org/pdf/2412.04245
Copy Paste: [[2412.04245]] Intriguing Properties of Robust Classification(https://arxiv.org/abs/2412.04245)
Keywords: robust
Abstract: Despite extensive research since the community learned about adversarial examples 10 years ago, we still do not know how to train high-accuracy classifiers that are guaranteed to be robust to small perturbations of their inputs. Previous works often argued that this might be because no classifier exists that is robust and accurate at the same time. However, in computer vision this assumption does not match reality where humans are usually accurate and robust on most tasks of interest. We offer an alternative explanation and show that in certain settings robust generalization is only possible with unrealistically large amounts of data. More precisely we find a setting where a robust classifier exists, it is easy to learn an accurate classifier, yet it requires an exponential amount of data to learn a robust classifier. Based on this theoretical result, we explore how well robust classifiers generalize on datasets such as CIFAR-10. We come to the conclusion that on this datasets, the limitation of current robust models also lies in the generalization, and that they require a lot of data to do well on the test set. We also show that the problem is not in the expressiveness or generalization capabilities of current architectures, and that there are low magnitude features in the data which are useful for non-robust generalization but are not available for robust classifiers.

Title: 3D Part Segmentation via Geometric Aggregation of 2D Visual Features

Authors: Marco Garosi, Riccardo Tedoldi, Davide Boscaini, Massimiliano Mancini, Nicu Sebe, Fabio Poiesi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04247
Pdf URL: https://arxiv.org/pdf/2412.04247
Copy Paste: [[2412.04247]] 3D Part Segmentation via Geometric Aggregation of 2D Visual Features(https://arxiv.org/abs/2412.04247)
Keywords: segmentation
Abstract: Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios. Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts. However, naively applying VLMs in this context introduces several drawbacks, such as the need for meticulous prompt engineering, and fails to leverage the 3D geometric structure of objects. To address these limitations, we propose COPS, a COmprehensive model for Parts Segmentation that blends the semantics extracted from visual concepts and 3D geometry to effectively identify object parts. COPS renders a point cloud from multiple viewpoints, extracts 2D features, projects them back to 3D, and uses a novel geometric-aware feature aggregation procedure to ensure spatial and semantic consistency. Finally, it clusters points into parts and labels them. We demonstrate that COPS is efficient, scalable, and achieves zero-shot state-of-the-art performance across five datasets, covering synthetic and real-world data, texture-less and coloured objects, as well as rigid and non-rigid shapes. The code is available at this https URL.

Title: CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations

Authors: Subash Neupane, Himanshu Tripathi, Shaswata Mitra, Sean Bozorgzad, Sudip Mittal, Shahram Rahimi, Amin Amirlatifi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04254
Pdf URL: https://arxiv.org/pdf/2412.04254
Copy Paste: [[2412.04254]] CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations(https://arxiv.org/abs/2412.04254)
Keywords: robust
Abstract: This paper presents ClinicSum, a novel framework designed to automatically generate clinical summaries from patient-doctor conversations. It utilizes a two-module architecture: a retrieval-based filtering module that extracts Subjective, Objective, Assessment, and Plan (SOAP) information from conversation transcripts, and an inference module powered by fine-tuned Pre-trained Language Models (PLMs), which leverage the extracted SOAP data to generate abstracted clinical summaries. To fine-tune the PLM, we created a training dataset of consisting 1,473 conversations-summaries pair by consolidating two publicly available datasets, FigShare and MTS-Dialog, with ground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's effectiveness is evaluated through both automatic metrics (e.g., ROUGE, BERTScore) and expert human assessments. Results show that ClinicSum outperforms state-of-the-art PLMs, demonstrating superior precision, recall, and F-1 scores in automatic evaluations and receiving high preference from SMEs in human assessment, making it a robust solution for automated clinical summarization.

Title: SCADE: Scalable Command-line Anomaly Detection Engine

Authors: Vaishali Vinay, Anjali Mangal
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04259
Pdf URL: https://arxiv.org/pdf/2412.04259
Copy Paste: [[2412.04259]] SCADE: Scalable Command-line Anomaly Detection Engine(https://arxiv.org/abs/2412.04259)
Keywords: security, attack, robust, steal
Abstract: As command-line interfaces remain an integral part of high-computation environments, the risk of exploitation through stealthy, complex command-line abuse continues to grow. Conventional security solutions often struggle with these command-line-based anomalies due to their context-specific nature and lack of labeled data, especially in detecting rare, malicious patterns amidst legitimate, high-volume activity. This gap has left organizations vulnerable to sophisticated threats like Living-off-the-Land (LOL) attacks, where standard detection tools frequently miss or misclassify anomalous command-line behavior. We introduce Scalable Command-Line Anomaly Detection Engine (SCADE), who addresses these challenges by introducing a dual-layered detection framework that combines a global statistical analysis with local context-specific anomaly detection, innovatively using a novel ensemble of statistical models such as BM25 and Log Entropy, adapted for command-line data. The framework also features a dynamic thresholding mechanism for adaptive anomaly detection, ensuring high precision and recall even in environments with extremely high Signal-to-Noise Ratios (SNRs). Initial experimental results demonstrate the effectiveness of the framework, achieving above 98% SNR in identifying unusual command-line behavior while minimizing false positives. In this paper, we present SCADE's core architecture, including its metadata-enriched approach to anomaly detection and the design choices behind its scalability for enterprise-level deployment. We argue that SCADE represents a significant advancement in command-line anomaly detection, offering a robust, adaptive framework for security analysts and researchers seeking to enhance detection accuracy in high-computation environments.

Title: Enhancing Whole Slide Image Classification through Supervised Contrastive Domain Adaptation

Authors: Ilán Carretero, Pablo Meseguer, Rocío del Amor, Valery Naranjo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04260
Pdf URL: https://arxiv.org/pdf/2412.04260
Copy Paste: [[2412.04260]] Enhancing Whole Slide Image Classification through Supervised Contrastive Domain Adaptation(https://arxiv.org/abs/2412.04260)
Keywords: robust, extraction
Abstract: Domain shift in the field of histopathological imaging is a common phenomenon due to the intra- and inter-hospital variability of staining and digitization protocols. The implementation of robust models, capable of creating generalized domains, represents a need to be solved. In this work, a new domain adaptation method to deal with the variability between histopathological images from multiple centers is presented. In particular, our method adds a training constraint to the supervised contrastive learning approach to achieve domain adaptation and improve inter-class separability. Experiments performed on domain adaptation and classification of whole-slide images of six skin cancer subtypes from two centers demonstrate the method's usefulness. The results reflect superior performance compared to not using domain adaptation after feature extraction or staining normalization.

Title: SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

Authors: Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04262
Pdf URL: https://arxiv.org/pdf/2412.04262
Copy Paste: [[2412.04262]] SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction(https://arxiv.org/abs/2412.04262)
Keywords: extraction, generative, large language model
Abstract: Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.

Title: Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

Authors: Zaid Alyafeai, Michael Pieler, Hannah Teufel, Jonathan Tow, Marco Bellagente, Duy Phung, Nikhil Pinnaparaju, Reshinth Adithyan, Paulo Rocha, Maksym Zhuravinskyi, Carlos Riquelme
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04277
Pdf URL: https://arxiv.org/pdf/2412.04277
Copy Paste: [[2412.04277]] Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic(https://arxiv.org/abs/2412.04277)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown impressive results in multiple domains of natural language processing (NLP) but are mainly focused on the English language. Recently, more LLMs have incorporated a larger proportion of multilingual text to represent low-resource languages. In Arabic NLP, several Arabic-centric LLMs have shown remarkable results on multiple benchmarks in the past two years. However, most Arabic LLMs have more than 7 billion parameters, which increases their hardware requirements and inference latency, when compared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base and chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable LM 1.6B chat model achieves impressive results on several benchmarks beating multiple models with up to 8x the parameters. In addition, we show the benefit of mixing in synthetic instruction tuning data by augmenting our fine-tuning data with a large synthetic dialogue dataset.

Title: Learnable Infinite Taylor Gaussian for Dynamic View Rendering

Authors: Bingbing Hu, Yanyan Li, Rui Xie, Bo Xu, Haoye Dong, Junfeng Yao, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04282
Pdf URL: https://arxiv.org/pdf/2412.04282
Copy Paste: [[2412.04282]] Learnable Infinite Taylor Gaussian for Dynamic View Rendering(https://arxiv.org/abs/2412.04282)
Keywords: robust, interpretability
Abstract: Capturing the temporal evolution of Gaussian properties such as position, rotation, and scale is a challenging task due to the vast number of time-varying parameters and the limited photometric data available, which generally results in convergence issues, making it difficult to find an optimal solution. While feeding all inputs into an end-to-end neural network can effectively model complex temporal dynamics, this approach lacks explicit supervision and struggles to generate high-quality transformation fields. On the other hand, using time-conditioned polynomial functions to model Gaussian trajectories and orientations provides a more explicit and interpretable solution, but requires significant handcrafted effort and lacks generalizability across diverse scenes. To overcome these limitations, this paper introduces a novel approach based on a learnable infinite Taylor Formula to model the temporal evolution of Gaussians. This method offers both the flexibility of an implicit network-based approach and the interpretability of explicit polynomial functions, allowing for more robust and generalizable modeling of Gaussian dynamics across various dynamic scenes. Extensive experiments on dynamic novel view rendering tasks are conducted on public datasets, demonstrating that the proposed method achieves state-of-the-art performance in this domain. More information is available on our project page(this https URL).

Title: Evolutionary Pre-Prompt Optimization for Mathematical Reasoning

Authors: Mathurin Videau, Alessandro Leite, Marc Schoenauer, Olivier Teytaud
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.04291
Pdf URL: https://arxiv.org/pdf/2412.04291
Copy Paste: [[2412.04291]] Evolutionary Pre-Prompt Optimization for Mathematical Reasoning(https://arxiv.org/abs/2412.04291)
Keywords: large language model
Abstract: Recent advancements have highlighted that large language models (LLMs), when given a small set of task-specific examples, demonstrate remarkable proficiency, a capability that extends to complex reasoning tasks. In particular, the combination of few-shot learning with the chain-of-thought (CoT) approach has been pivotal in steering models towards more logically consistent conclusions. This paper explores the optimization of example selection for designing effective CoT pre-prompts and shows that the choice of the optimization algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility. Specifically, thanks to a limited exploitative and overfitted optimization, Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the naive few-shot approach exceeding 10 absolute points in exact match scores on benchmark datasets such as GSM8k and MathQA. These gains are consistent across various contexts and are further amplified when integrated with self-consistency (SC)

Title: SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Authors: Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, Guangliang Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04292
Pdf URL: https://arxiv.org/pdf/2412.04292
Copy Paste: [[2412.04292]] SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model(https://arxiv.org/abs/2412.04292)
Keywords: generative
Abstract: The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.

Title: SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Authors: Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, Cuong Pham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04301
Pdf URL: https://arxiv.org/pdf/2412.04301
Copy Paste: [[2412.04301]] SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion(https://arxiv.org/abs/2412.04301)
Keywords: diffusion
Abstract: Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing (in 0.23s). The advancement of SwiftEdit lies in its two novel contributions: a one-step inversion framework that enables one-step image reconstruction via inversion and a mask-guided editing technique with our proposed attention rescaling mechanism to perform localized image editing. Extensive experiments are provided to demonstrate the effectiveness and efficiency of SwiftEdit. In particular, SwiftEdit enables instant text-guided image editing, which is extremely faster than previous multi-step methods (at least 50 times faster) while maintain a competitive performance in editing results. Our project page is at: this https URL

Title: Towards Zero-shot 3D Anomaly Localization

Authors: Yizhou Wang, Kuan-Chuan Peng, Yun Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04304
Pdf URL: https://arxiv.org/pdf/2412.04304
Copy Paste: [[2412.04304]] Towards Zero-shot 3D Anomaly Localization(https://arxiv.org/abs/2412.04304)
Keywords: privacy
Abstract: 3D anomaly detection and localization is of great significance for industrial inspection. Prior 3D anomaly detection and localization methods focus on the setting that the testing data share the same category as the training data which is normal. However, in real-world applications, the normal training data for the target 3D objects can be unavailable due to issues like data privacy or export control regulation. To tackle these challenges, we identify a new task -- zero-shot 3D anomaly detection and localization, where the training and testing classes do not overlap. To this end, we design 3DzAL, a novel patch-level contrastive learning framework based on pseudo anomalies generated using the inductive bias from task-irrelevant 3D xyz data to learn more representative feature representations. Furthermore, we train a normalcy classifier network to classify the normal patches and pseudo anomalies and utilize the classification result jointly with feature distance to design anomaly scores. Instead of directly using the patch point clouds, we introduce adversarial perturbations to the input patch xyz data before feeding into the 3D normalcy classifier for the classification-based anomaly score. We show that 3DzAL outperforms the state-of-the-art anomaly detection and localization performance.

Title: ALMA: Alignment with Minimal Annotation

Authors: Michihiro Yasunaga, Leonid Shamis, Chunting Zhou, Andrew Cohen, Jason Weston, Luke Zettlemoyer, Marjan Ghazvininejad
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04305
Pdf URL: https://arxiv.org/pdf/2412.04305
Copy Paste: [[2412.04305]] ALMA: Alignment with Minimal Annotation(https://arxiv.org/abs/2412.04305)
Keywords: large language model
Abstract: Recent approaches to large language model (LLM) alignment typically require millions of human annotations or rely on external aligned models for synthetic data generation. This paper introduces ALMA: Alignment with Minimal Annotation, demonstrating that effective alignment can be achieved using only 9,000 labeled examples -- less than 1% of conventional approaches. ALMA generates large amounts of high-quality synthetic alignment data through new techniques: diverse prompt synthesis via few-shot learning, diverse response generation with multiple model checkpoints, and judge (reward model) enhancement through score aggregation and self-distillation. Using only a pretrained Llama3 base model, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves performance close to Llama3-Instruct across diverse alignment benchmarks (e.g., 0.1% difference on AlpacaEval 2.0 score). These results are achieved with a multi-round, self-bootstrapped data synthesis and training recipe that continues to improve for 10 rounds, surpassing the typical 3-round ceiling of previous methods. These results suggest that base models already possess sufficient knowledge for effective alignment, and that synthetic data generation methods can expose it.

Title: FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

Authors: Bo Tong, Bokai Lai, Yiyi Zhou, Gen Luo, Yunhang Shen, Ke Li, Xiaoshuai Sun, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04317
Pdf URL: https://arxiv.org/pdf/2412.04317
Copy Paste: [[2412.04317]] FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression(https://arxiv.org/abs/2412.04317)
Keywords: large language model
Abstract: Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks.

Title: The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

Authors: Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04318
Pdf URL: https://arxiv.org/pdf/2412.04318
Copy Paste: [[2412.04318]] The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation(https://arxiv.org/abs/2412.04318)
Keywords: generative, large language model
Abstract: This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.

Title: GRAM: Generalization in Deep RL with a Robust Adaptation Module

Authors: James Queeney, Xiaoyi Cai, Mouhacine Benosman, Jonathan P. How
Subjects: cs.LG, cs.AI, cs.RO, stat.ML
Abstract URL: https://arxiv.org/abs/2412.04323
Pdf URL: https://arxiv.org/pdf/2412.04323
Copy Paste: [[2412.04323]] GRAM: Generalization in Deep RL with a Robust Adaptation Module(https://arxiv.org/abs/2412.04323)
Keywords: robust
Abstract: The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for dynamics generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environment dynamics, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate on a variety of realistic simulated locomotion tasks with a quadruped robot.

Title: Understanding Student Sentiment on Mental Health Support in Colleges Using Large Language Models

Authors: Palak Sood, Chengyang He, Divyanshu Gupta, Yue Ning, Ping Wang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2412.04326
Pdf URL: https://arxiv.org/pdf/2412.04326
Copy Paste: [[2412.04326]] Understanding Student Sentiment on Mental Health Support in Colleges Using Large Language Models(https://arxiv.org/abs/2412.04326)
Keywords: large language model
Abstract: Mental health support in colleges is vital in educating students by offering counseling services and organizing supportive events. However, evaluating its effectiveness faces challenges like data collection difficulties and lack of standardized metrics, limiting research scope. Student feedback is crucial for evaluation but often relies on qualitative analysis without systematic investigation using advanced machine learning methods. This paper uses public Student Voice Survey data to analyze student sentiments on mental health support with large language models (LLMs). We created a sentiment analysis dataset, SMILE-College, with human-machine collaboration. The investigation of both traditional machine learning methods and state-of-the-art LLMs showed the best performance of GPT-3.5 and BERT on this new dataset. The analysis highlights challenges in accurately predicting response sentiments and offers practical insights on how LLMs can enhance mental health-related research and improve college mental health services. This data-driven approach will facilitate efficient and informed mental health support evaluation, management, and decision-making.

Title: Liquid: Language Models are Scalable Multi-modal Generators

Authors: Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04332
Pdf URL: https://arxiv.org/pdf/2412.04332
Copy Paste: [[2412.04332]] Liquid: Language Models are Scalable Multi-modal Generators(https://arxiv.org/abs/2412.04332)
Keywords: large language model
Abstract: We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model (LLM), eliminating the need for external pretrained visual embeddings such as CLIP. For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks diminishes as the model size increases. Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other, effectively removing the typical interference seen in earlier models. We show that existing LLMs can serve as strong foundations for Liquid, saving 100x in training costs while outperforming Chameleon in multimodal capabilities and maintaining language performance comparable to mainstream LLMs like LLAMA2. Liquid also outperforms models like SD v2.1 and SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and text-only tasks. This work demonstrates that LLMs such as LLAMA3.2 and GEMMA2 are powerful multimodal generators, offering a scalable solution for enhancing both vision-language understanding and generation. The code and models will be released.

Title: Reflective Teacher: Semi-Supervised Multimodal 3D Object Detection in Bird's-Eye-View via Uncertainty Measure

Authors: Saheli Hazra, Sudip Das, Rohit Choudhary, Arindam Das, Ganesh Sistu, Ciaran Eising, Ujjwal Bhattacharya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04337
Pdf URL: https://arxiv.org/pdf/2412.04337
Copy Paste: [[2412.04337]] Reflective Teacher: Semi-Supervised Multimodal 3D Object Detection in Bird's-Eye-View via Uncertainty Measure(https://arxiv.org/abs/2412.04337)
Keywords: extraction
Abstract: Applying pseudo labeling techniques has been found to be advantageous in semi-supervised 3D object detection (SSOD) in Bird's-Eye-View (BEV) for autonomous driving, particularly where labeled data is limited. In the literature, Exponential Moving Average (EMA) has been used for adjustments of the weights of teacher network by the student network. However, the same induces catastrophic forgetting in the teacher network. In this work, we address this issue by introducing a novel concept of Reflective Teacher where the student is trained by both labeled and pseudo labeled data while its knowledge is progressively passed to the teacher through a regularizer to ensure retention of previous knowledge. Additionally, we propose Geometry Aware BEV Fusion (GA-BEVFusion) for efficient alignment of multi-modal BEV features, thus reducing the disparity between the modalities - camera and LiDAR. This helps to map the precise geometric information embedded among LiDAR points reliably with the spatial priors for extraction of semantic information from camera images. Our experiments on the nuScenes and Waymo datasets demonstrate: 1) improved performance over state-of-the-art methods in both fully supervised and semi-supervised settings; 2) Reflective Teacher achieves equivalent performance with only 25% and 22% of labeled data for nuScenes and Waymo datasets respectively, in contrast to other fully supervised methods that utilize the full labeled dataset.

Title: Retrieval-Augmented Machine Translation with Unstructured Knowledge

Authors: Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04342
Pdf URL: https://arxiv.org/pdf/2412.04342
Copy Paste: [[2412.04342]] Retrieval-Augmented Machine Translation with Unstructured Knowledge(https://arxiv.org/abs/2412.04342)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance models' MT ability. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples collected via GPT-4o and human translators. Besides, documents from different languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.

Title: RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse

Authors: Zhouyingcheng Liao, Mingyuan Zhang, Wenjia Wang, Lei Yang, Taku Komura
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.04343
Pdf URL: https://arxiv.org/pdf/2412.04343
Copy Paste: [[2412.04343]] RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse(https://arxiv.org/abs/2412.04343)
Keywords: diffusion
Abstract: While motion generation has made substantial progress, its practical application remains constrained by dataset diversity and scale, limiting its ability to handle out-of-distribution scenarios. To address this, we propose a simple and effective baseline, RMD, which enhances the generalization of motion generation through retrieval-augmented techniques. Unlike previous retrieval-based methods, RMD requires no additional training and offers three key advantages: (1) the external retrieval database can be flexibly replaced; (2) body parts from the motion database can be reused, with an LLM facilitating splitting and recombination; and (3) a pre-trained motion diffusion model serves as a prior to improve the quality of motions obtained through retrieval and direct combination. Without any training, RMD achieves state-of-the-art performance, with notable advantages on out-of-distribution data.

Title: Distributionally Robust Performative Prediction

Authors: Songkai Xue, Yuekai Sun
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.04346
Pdf URL: https://arxiv.org/pdf/2412.04346
Copy Paste: [[2412.04346]] Distributionally Robust Performative Prediction(https://arxiv.org/abs/2412.04346)
Keywords: robust
Abstract: Performative prediction aims to model scenarios where predictive outcomes subsequently influence the very systems they target. The pursuit of a performative optimum (PO) -- minimizing performative risk -- is generally reliant on modeling of the distribution map, which characterizes how a deployed ML model alters the data distribution. Unfortunately, inevitable misspecification of the distribution map can lead to a poor approximation of the true PO. To address this issue, we introduce a novel framework of distributionally robust performative prediction and study a new solution concept termed as distributionally robust performative optimum (DRPO). We show provable guarantees for DRPO as a robust approximation to the true PO when the nominal distribution map is different from the actual one. Moreover, distributionally robust performative prediction can be reformulated as an augmented performative prediction problem, enabling efficient optimization. The experimental results demonstrate that DRPO offers potential advantages over traditional PO approach when the distribution map is misspecified at either micro- or macro-level.

Title: VMGuard: Reputation-Based Incentive Mechanism for Poisoning Attack Detection in Vehicular Metaverse

Authors: Ismail Lotfi, Marwa Qaraqe, Ali Ghrayeb, Dusit Niyato
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.04349
Pdf URL: https://arxiv.org/pdf/2412.04349
Copy Paste: [[2412.04349]] VMGuard: Reputation-Based Incentive Mechanism for Poisoning Attack Detection in Vehicular Metaverse(https://arxiv.org/abs/2412.04349)
Keywords: security, protect, attack
Abstract: The vehicular Metaverse represents an emerging paradigm that merges vehicular communications with virtual environments, integrating real-world data to enhance in-vehicle services. However, this integration faces critical security challenges, particularly in the data collection layer where malicious sensing IoT (SIoT) devices can compromise service quality through data poisoning attacks. The security aspects of the Metaverse services should be well addressed both when creating the digital twins of the physical systems and when delivering the virtual service to the vehicular Metaverse users (VMUs). This paper introduces vehicular Metaverse guard (VMGuard), a novel four-layer security framework that protects vehicular Metaverse systems from data poisoning attacks. Specifically, when the virtual service providers (VSPs) collect data about physical environment through SIoT devices in the field, the delivered content might be tampered. Malicious SIoT devices with moral hazard might have private incentives to provide poisoned data to the VSP to degrade the service quality (QoS) and user experience (QoE) of the VMUs. The proposed framework implements a reputation-based incentive mechanism that leverages user feedback and subjective logic modeling to assess the trustworthiness of participating SIoT devices. More precisely, the framework entails the use of reputation scores assigned to participating SIoT devices based on their historical engagements with the VSPs. Ultimately, we validate our proposed model using comprehensive simulations. Our key findings indicate that our mechanism effectively prevents the initiation of poisoning attacks by malicious SIoT devices. Additionally, our system ensures that reliable SIoT devices, previously missclassified, are not barred from participating in future rounds of the market.

Title: ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Authors: Dayoung Gong, Suha Kwak, Minsu Cho
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04353
Pdf URL: https://arxiv.org/pdf/2412.04353
Copy Paste: [[2412.04353]] ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation(https://arxiv.org/abs/2412.04353)
Keywords: diffusion, segmentation
Abstract: Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion. The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner; the visible part is for temporal segmentation, and the invisible part is for future anticipation. To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future. Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation. ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.

Title: Machine Theory of Mind for Autonomous Cyber-Defence

Authors: Luke Swaby, Matthew Stewart, Daniel Harrold, Chris Willis, Gregory Palmer
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2412.04367
Pdf URL: https://arxiv.org/pdf/2412.04367
Copy Paste: [[2412.04367]] Machine Theory of Mind for Autonomous Cyber-Defence(https://arxiv.org/abs/2412.04367)
Keywords: security, attack, robust
Abstract: Intelligent autonomous agents hold much potential for the domain of cyber-security. However, due to many state-of-the-art approaches relying on uninterpretable black-box models, there is growing demand for methods that offer stakeholders clear and actionable insights into their latent beliefs and motivations. To address this, we evaluate Theory of Mind (ToM) approaches for Autonomous Cyber Operations. Upon learning a robust prior, ToM models can predict an agent's goals, behaviours, and contextual beliefs given only a handful of past behaviour observations. In this paper, we introduce a novel Graph Neural Network (GNN)-based ToM architecture tailored for cyber-defence, Graph-In, Graph-Out (GIGO)-ToM, which can accurately predict both the targets and attack trajectories of adversarial cyber agents over arbitrary computer network topologies. To evaluate the latter, we propose a novel extension of the Wasserstein distance for measuring the similarity of graph-based probability distributions. Whereas the standard Wasserstein distance lacks a fixed reference scale, we introduce a graph-theoretic normalization factor that enables a standardized comparison between networks of different sizes. We furnish this metric, which we term the Network Transport Distance (NTD), with a weighting function that emphasizes predictions according to custom node features, allowing network operators to explore arbitrary strategic considerations. Benchmarked against a Graph-In, Dense-Out (GIDO)-ToM architecture in an abstract cyber-defence environment, our empirical evaluations show that GIGO-ToM can accurately predict the goals and behaviours of various unseen cyber-attacking agents across a range of network topologies, as well as learn embeddings that can effectively characterize their policies.

Title: A Hitchhiker's Guide to Understanding Performances of Two-Class Classifiers

Authors: Anaïs Halin, Sébastien Piérard, Anthony Cioppa, Marc Van Droogenbroeck
Subjects: cs.CV, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2412.04377
Pdf URL: https://arxiv.org/pdf/2412.04377
Copy Paste: [[2412.04377]] A Hitchhiker's Guide to Understanding Performances of Two-Class Classifiers(https://arxiv.org/abs/2412.04377)
Keywords: segmentation
Abstract: Properly understanding the performances of classifiers is essential in various scenarios. However, the literature often relies only on one or two standard scores to compare classifiers, which fails to capture the nuances of application-specific requirements, potentially leading to suboptimal classifier selection. Recently, a paper on the foundations of the theory of performance-based ranking introduced a tool, called the Tile, that organizes an infinity of ranking scores into a 2D map. Thanks to the Tile, it is now possible to evaluate and compare classifiers efficiently, displaying all possible application-specific preferences instead of having to rely on a pair of scores. In this paper, we provide a first hitchhiker's guide for understanding the performances of two-class classifiers by presenting four scenarios, each showcasing a different user profile: a theoretical analyst, a method designer, a benchmarker, and an application developer. Particularly, we show that we can provide different interpretative flavors that are adapted to the user's needs by mapping different values on the Tile. As an illustration, we leverage the newly introduced Tile tool and the different flavors to rank and analyze the performances of 74 state-of-the-art semantic segmentation models in two-class classification through the eyes of the four user profiles. Through these user profiles, we demonstrate that the Tile effectively captures the behavior of classifiers in a single visualization, while accommodating an infinite number of ranking scores.

Title: Discriminative Fine-tuning of LVLMs

Authors: Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Brais Martinez
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04378
Pdf URL: https://arxiv.org/pdf/2412.04378
Copy Paste: [[2412.04378]] Discriminative Fine-tuning of LVLMs(https://arxiv.org/abs/2412.04378)
Keywords: generative
Abstract: Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Title: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction

Authors: Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, Jiwen Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04384
Pdf URL: https://arxiv.org/pdf/2412.04384
Copy Paste: [[2412.04384]] Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction(https://arxiv.org/abs/2412.04384)
Keywords: robust
Abstract: 3D semantic occupancy prediction is an important task for robust vision-centric autonomous driving, which predicts fine-grained geometry and semantics of the surrounding scene. Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes. Although 3D semantic Gaussian serves as an object-centric sparse alternative, most of the Gaussians still describe the empty region with low efficiency. To address this, we propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied and conforms to probabilistic multiplication to derive the overall geometry. Furthermore, we adopt the exact Gaussian mixture model for semantics calculation to avoid unnecessary overlapping of Gaussians. To effectively initialize Gaussians in non-empty region, we design a distribution-based initialization module which learns the pixel-aligned occupancy distribution instead of the depth of surfaces. We conduct extensive experiments on nuScenes and KITTI-360 datasets and our GaussianFormer-2 achieves state-of-the-art performance with high efficiency. Code: this https URL.

Title: Federated Automated Feature Engineering

Authors: Tom Overman, Diego Klabjan
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2412.04404
Pdf URL: https://arxiv.org/pdf/2412.04404
Copy Paste: [[2412.04404]] Federated Automated Feature Engineering(https://arxiv.org/abs/2412.04404)
Keywords: federate
Abstract: Automated feature engineering (AutoFE) is used to automatically create new features from original features to improve predictive performance without needing significant human intervention and expertise. Many algorithms exist for AutoFE, but very few approaches exist for the federated learning (FL) setting where data is gathered across many clients and is not shared between clients or a central server. We introduce AutoFE algorithms for the horizontal, vertical, and hybrid FL settings, which differ in how the data is gathered across clients. To the best of our knowledge, we are the first to develop AutoFE algorithms for the horizontal and hybrid FL cases, and we show that the downstream model performance of federated AutoFE is similar to the case where data is held centrally and AutoFE is performed centrally.

Title: FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning

Authors: Pranab Sahoo, Ashutosh Tripathi, Sriparna Saha, Samrat Mondal
Subjects: cs.LG, cs.AI, cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2412.04416
Pdf URL: https://arxiv.org/pdf/2412.04416
Copy Paste: [[2412.04416]] FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning(https://arxiv.org/abs/2412.04416)
Keywords: privacy, robust, federate
Abstract: Federated Learning (FL) marks a transformative approach to distributed model training by combining locally optimized models from various clients into a unified global model. While FL preserves data privacy by eliminating centralized storage, it encounters significant challenges such as performance degradation, slower convergence, and reduced robustness of the global model due to the heterogeneity in client data distributions. Among the various forms of data heterogeneity, label skew emerges as a particularly formidable and prevalent issue, especially in domains such as image classification. To address these challenges, we begin with comprehensive experiments to pinpoint the underlying issues in the FL training process. Based on our findings, we then introduce an innovative dual-strategy approach designed to effectively resolve these issues. First, we introduce an adaptive loss function for client-side training, meticulously crafted to preserve previously acquired knowledge while maintaining an optimal equilibrium between local optimization and global model coherence. Secondly, we develop a dynamic aggregation strategy for aggregating client models at the server. This approach adapts to each client's unique learning patterns, effectively addressing the challenges of diverse data across the network. Our comprehensive evaluation, conducted across three diverse real-world datasets, coupled with theoretical convergence guarantees, demonstrates the superior efficacy of our method compared to several established state-of-the-art approaches.

Title: Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Authors: Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04424
Pdf URL: https://arxiv.org/pdf/2412.04424
Copy Paste: [[2412.04424]] Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion(https://arxiv.org/abs/2412.04424)
Keywords: transformer, generative, large language model
Abstract: We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. this https URL

Title: Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Authors: Shaunak Halbe, Junjiao Tian, K J Joseph, James Seale Smith, Katherine Stevo, Vineeth N Balasubramanian, Zsolt Kira
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.04429
Pdf URL: https://arxiv.org/pdf/2412.04429
Copy Paste: [[2412.04429]] Grounding Descriptions in Images informs Zero-Shot Visual Recognition(https://arxiv.org/abs/2412.04429)
Keywords: large language model
Abstract: Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations. To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets. Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model's ability to recognize these concepts by benchmarking on it. Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach. Code available at this https URL .

Title: Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Authors: Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04431
Pdf URL: https://arxiv.org/pdf/2412.04431
Copy Paste: [[2412.04431]] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis(https://arxiv.org/abs/2412.04431)
Keywords: diffusion, transformer
Abstract: We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism, remarkably improving the generation capacity and details. By theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities compared to vanilla VAR. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and establishing it as the fastest text-to-image model. Models and codes will be released to promote further exploration of Infinity for visual generation and unified tokenizer modeling.

Title: Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Authors: Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04432
Pdf URL: https://arxiv.org/pdf/2412.04432
Copy Paste: [[2412.04432]] Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation(https://arxiv.org/abs/2412.04432)
Keywords: robust, diffusion, large language model
Abstract: In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.

Title: Towards Real-Time Open-Vocabulary Video Instance Segmentation

Authors: Bin Yan, Martin Sundermeyer, David Joseph Tan, Huchuan Lu, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04434
Pdf URL: https://arxiv.org/pdf/2412.04434
Copy Paste: [[2412.04434]] Towards Real-Time Open-Vocabulary Video Instance Segmentation(https://arxiv.org/abs/2412.04434)
Keywords: segmentation
Abstract: In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better accuracy. These results demonstrate TROY-VIS's potential for real-time applications in dynamic environments such as mobile robotics and augmented reality. Code and model will be released at this https URL.

Title: Learning Artistic Signatures: Symmetry Discovery and Style Transfer

Authors: Emma Finn, T. Anderson Keller, Emmanouil Theodosis, Demba E. Ba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04441
Pdf URL: https://arxiv.org/pdf/2412.04441
Copy Paste: [[2412.04441]] Learning Artistic Signatures: Symmetry Discovery and Style Transfer(https://arxiv.org/abs/2412.04441)
Keywords: robust, diffusion
Abstract: Despite nearly a decade of literature on style transfer, there is no undisputed definition of artistic style. State-of-the-art models produce impressive results but are difficult to interpret since, without a coherent definition of style, the problem of style transfer is inherently ill-posed. Early work framed style-transfer as an optimization problem but treated style as a measure only of texture. This led to artifacts in the outputs of early models where content features from the style image sometimes bled into the output image. Conversely, more recent work with diffusion models offers compelling empirical results but provides little theoretical grounding. To address these issues, we propose an alternative definition of artistic style. We suggest that style should be thought of as a set of global symmetries that dictate the arrangement of local textures. We validate this perspective empirically by learning the symmetries of a large dataset of paintings and showing that symmetries are predictive of the artistic movement to which each painting belongs. Finally, we show that by considering both local and global features, using both Lie generators and traditional measures of texture, we can quantitatively capture the stylistic similarity between artists better than with either set of features alone. This approach not only aligns well with art historians' consensus but also offers a robust framework for distinguishing nuanced stylistic differences, allowing for a more interpretable, theoretically grounded approach to style transfer.

Title: DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ping Luo, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04446
Pdf URL: https://arxiv.org/pdf/2412.04446
Copy Paste: [[2412.04446]] DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models(https://arxiv.org/abs/2412.04446)
Keywords: diffusion
Abstract: Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language processing. We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual "language" into another. By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation. DiCoDe is scalable using readily available AR architectures, and is capable of generating videos ranging from a few seconds to one minute using only 4 A100 GPUs for training. We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality while ensuring efficient training. To showcase its scalability, we release a series of DiCoDe configurations with varying parameter sizes and observe a consistent improvement in performance as the model size increases from 100M to 3B. We believe that DiCoDe's exploration in academia represents a promising initial step toward scalable video modeling with AR language models, paving the way for the development of larger and more powerful video generation models.

Title: MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Authors: Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04448
Pdf URL: https://arxiv.org/pdf/2412.04448
Copy Paste: [[2412.04448]] MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation(https://arxiv.org/abs/2412.04448)
Keywords: diffusion
Abstract: Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.

Title: p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Authors: Jun Zhang, Desen Meng, Ji Qi, Zhenpeng Huang, Tao Wu, Limin Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.04449
Pdf URL: https://arxiv.org/pdf/2412.04449
Copy Paste: [[2412.04449]] p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay(https://arxiv.org/abs/2412.04449)
Keywords: transformer, large language model
Abstract: Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. The majority of computation stems from the overwhelming volume of vision tokens processed by the transformer decoder. In this paper, we propose to build efficient MLLMs by leveraging the Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layer and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. To validate the effectiveness of our approach, we conduct extensive experiments with two baseline models across 14 benchmarks. Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

Title: Four-Plane Factorized Video Autoencoders

Authors: Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04452
Pdf URL: https://arxiv.org/pdf/2412.04452
Copy Paste: [[2412.04452]] Four-Plane Factorized Video Autoencoders(https://arxiv.org/abs/2412.04452)
Keywords: diffusion, generative
Abstract: Latent variable generative models have emerged as powerful tools for generative tasks including image and video synthesis. These models are enabled by pretrained autoencoders that map high resolution data into a compressed lower dimensional latent space, where the generative models can subsequently be developed while requiring fewer computational resources. Despite their effectiveness, the direct application of latent variable models to higher dimensional domains such as videos continues to pose challenges for efficient training and inference. In this paper, we propose an autoencoder that projects volumetric data onto a four-plane factorized latent space that grows sublinearly with the input size, making it ideal for higher dimensional data like videos. The design of our factorized model supports straightforward adoption in a number of conditional generation tasks with latent diffusion models (LDMs), such as class-conditional generation, frame prediction, and video interpolation. Our results show that the proposed four-plane latent space retains a rich representation needed for high-fidelity reconstructions despite the heavy compression, while simultaneously enabling LDMs to operate with significant improvements in speed and memory.

Title: HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery

Authors: Yuto Matsubara, Ko Nishino
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04456
Pdf URL: https://arxiv.org/pdf/2412.04456
Copy Paste: [[2412.04456]] HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery(https://arxiv.org/abs/2412.04456)
Keywords: robust, transformer
Abstract: We introduce a novel method for human shape and pose recovery that can fully leverage multiple static views. We target fixed-multiview people monitoring, including elderly care and safety monitoring, in which calibrated cameras can be installed at the corners of a room or an open space but whose configuration may vary depending on the environment. Our key idea is to formulate it as neural optimization. We achieve this with HeatFormer, a neural optimizer that iteratively refines the SMPL parameters given multiview images, which is fundamentally agonistic to the configuration of views. HeatFormer realizes this SMPL parameter estimation as heat map generation and alignment with a novel transformer encoder and decoder. We demonstrate the effectiveness of HeatFormer including its accuracy, robustness to occlusion, and generalizability through an extensive set of experiments. We believe HeatFormer can serve a key role in passive human behavior modeling.

Title: Cubify Anything: Scaling Indoor 3D Object Detection

Authors: Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, Afshin Dehghan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04458
Pdf URL: https://arxiv.org/pdf/2412.04458
Copy Paste: [[2412.04458]] Cubify Anything: Scaling Indoor 3D Object Detection(https://arxiv.org/abs/2412.04458)
Keywords: transformer
Abstract: We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.

Title: LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors

Authors: Yusuf Dalva, Yijun Li, Qing Liu, Nanxuan Zhao, Jianming Zhang, Zhe Lin, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04460
Pdf URL: https://arxiv.org/pdf/2412.04460
Copy Paste: [[2412.04460]] LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors(https://arxiv.org/abs/2412.04460)
Keywords: diffusion, generative
Abstract: Large-scale diffusion models have achieved remarkable success in generating high-quality images from textual descriptions, gaining popularity across various applications. However, the generation of layered content, such as transparent images with foreground and background layers, remains an under-explored area. Layered content generation is crucial for creative workflows in fields like graphic design, animation, and digital art, where layer-based approaches are fundamental for flexible editing and composition. In this paper, we propose a novel image generation pipeline based on Latent Diffusion Models (LDMs) that generates images with two layers: a foreground layer (RGBA) with transparency information and a background layer (RGB). Unlike existing methods that generate these layers sequentially, our approach introduces a harmonized generation mechanism that enables dynamic interactions between the layers for more coherent outputs. We demonstrate the effectiveness of our method through extensive qualitative and quantitative experiments, showing significant improvements in visual coherence, image quality, and layer consistency compared to baseline methods.

Title: 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Authors: Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, Hsin-Ying Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04462
Pdf URL: https://arxiv.org/pdf/2412.04462
Copy Paste: [[2412.04462]] 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion(https://arxiv.org/abs/2412.04462)
Keywords: diffusion, transformer
Abstract: We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. We propose a novel two-stream architecture. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization. This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore and Dust3R-Confidence).

Title: MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

Authors: Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, Noah Snavely
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04463
Pdf URL: https://arxiv.org/pdf/2412.04463
Copy Paste: [[2412.04463]] MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos(https://arxiv.org/abs/2412.04463)
Keywords: robust
Abstract: We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network-based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of a deep visual SLAM framework: with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times. See interactive results on our project page: this https URL

Title: Turbo3D: Ultra-fast Text-to-3D Generation

Authors: Hanzhe Hu, Tianwei Yin, Fujun Luan, Yiwei Hu, Hao Tan, Zexiang Xu, Sai Bi, Shubham Tulsiani, Kai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04470
Pdf URL: https://arxiv.org/pdf/2412.04470
Copy Paste: [[2412.04470]] Turbo3D: Ultra-fast Text-to-3D Generation(https://arxiv.org/abs/2412.04470)
Keywords: diffusion, transformer
Abstract: We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.

Title: PaintScene4D: Consistent 4D Scene Generation from Text Prompts

Authors: Vinayak Gupta, Yunze Man, Yu-Xiong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04471
Pdf URL: https://arxiv.org/pdf/2412.04471
Copy Paste: [[2412.04471]] PaintScene4D: Consistent 4D Scene Generation from Text Prompts(https://arxiv.org/abs/2412.04471)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at this https URL

Title: Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

Authors: Luca Bartolomei, Fabio Tosi, Matteo Poggi, Stefano Mattoccia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04472
Pdf URL: https://arxiv.org/pdf/2412.04472
Copy Paste: [[2412.04472]] Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail(https://arxiv.org/abs/2412.04472)
Keywords: robust
Abstract: We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.