2024-11-26

Title: Can Open-source LLMs Enhance Data Augmentation for Toxic Detection?: An Experimental Study

Authors: Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, Congrui Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15175
Pdf URL: https://arxiv.org/pdf/2411.15175
Copy Paste: [[2411.15175]] Can Open-source LLMs Enhance Data Augmentation for Toxic Detection?: An Experimental Study(https://arxiv.org/abs/2411.15175)
Keywords: robust, explainability
Abstract: High-quality, diverse harmful data is essential to addressing real-time applications in content moderation. Current state-of-the-art approaches to toxic content detection using GPT series models are costly and lack explainability. This paper investigates the use of prompt engineering and fine-tuning techniques on open-source LLMs to enhance harmful data augmentation specifically for toxic content detection. We conduct a two-stage empirical study, with stage 1 evaluating six open-source LLMs across multiple datasets using only prompt engineering and stage 2 focusing on fine-tuning. Our findings indicate that Mistral can excel in generating harmful data with minimal hallucination. While fine-tuning these models improves data quality and diversity, challenges such as data duplication and overfitting persist. Our experimental results highlight scalable, cost-effective strategies for enhancing toxic content detection systems. These findings not only demonstrate the potential of open-source LLMs in creating robust content moderation tools. The application of this method in real industrial scenarios further proves the feasibility and efficiency of the fine-tuned open-source LLMs for data augmentation. We hope our study will aid in understanding the capabilities and limitations of current models in toxic content detection and drive further advancements in this field.

Title: Hybrid Gaussian Process Regression with Temporal Feature Extraction for Partially Interpretable Remaining Useful Life Interval Prediction in Aeroengine Prognostics

Authors: Tian Niu, Zijun Xu, Heng Luo, Ziqing Zhou
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15185
Pdf URL: https://arxiv.org/pdf/2411.15185
Copy Paste: [[2411.15185]] Hybrid Gaussian Process Regression with Temporal Feature Extraction for Partially Interpretable Remaining Useful Life Interval Prediction in Aeroengine Prognostics(https://arxiv.org/abs/2411.15185)
Keywords: robust, extraction, interpretability
Abstract: The estimation of Remaining Useful Life (RUL) plays a pivotal role in intelligent manufacturing systems and Industry 4.0 technologies. While recent advancements have improved RUL prediction, many models still face interpretability and compelling uncertainty modeling challenges. This paper introduces a modified Gaussian Process Regression (GPR) model for RUL interval prediction, tailored for the complexities of manufacturing process development. The modified GPR predicts confidence intervals by learning from historical data and addresses uncertainty modeling in a more structured way. The approach effectively captures intricate time-series patterns and dynamic behaviors inherent in modern manufacturing systems by coupling GPR with deep adaptive learning-enhanced AI process models. Moreover, the model evaluates feature significance to ensure more transparent decision-making, which is crucial for optimizing manufacturing processes. This comprehensive approach supports more accurate RUL predictions and provides transparent, interpretable insights into uncertainty, contributing to robust process development and management.

Title: Gradient-Weighted Feature Back-Projection: A Fast Alternative to Feature Distillation in 3D Gaussian Splatting

Authors: Joji Joseph, Bharadwaj Amrutur, Shalabh Bhatnagar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15193
Pdf URL: https://arxiv.org/pdf/2411.15193
Copy Paste: [[2411.15193]] Gradient-Weighted Feature Back-Projection: A Fast Alternative to Feature Distillation in 3D Gaussian Splatting(https://arxiv.org/abs/2411.15193)
Keywords: segmentation
Abstract: We introduce a training-free method for feature field rendering in Gaussian splatting. Our approach back-projects 2D features into pre-trained 3D Gaussians, using a weighted sum based on each Gaussian's influence in the final rendering. While most training-based feature field rendering methods excel at 2D segmentation but perform poorly at 3D segmentation without post-processing, our method achieves high-quality results in both 2D and 3D segmentation. Experimental results demonstrate that our approach is fast, scalable, and offers performance comparable to training-based methods.

Title: Graph Neural Network-Based Entity Extraction and Relationship Reasoning in Complex Knowledge Graphs

Authors: Junliang Du, Guiran Liu, Jia Gao, Xiaoxuan Liao, Jiacheng Hu, Linxiao Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15195
Pdf URL: https://arxiv.org/pdf/2411.15195
Copy Paste: [[2411.15195]] Graph Neural Network-Based Entity Extraction and Relationship Reasoning in Complex Knowledge Graphs(https://arxiv.org/abs/2411.15195)
Keywords: extraction
Abstract: This study proposed a knowledge graph entity extraction and relationship reasoning algorithm based on a graph neural network, using a graph convolutional network and graph attention network to model the complex structure in the knowledge graph. By building an end-to-end joint model, this paper achieves efficient recognition and reasoning of entities and relationships. In the experiment, this paper compared the model with a variety of deep learning algorithms and verified its superiority through indicators such as AUC, recall rate, precision rate, and F1 value. The experimental results show that the model proposed in this paper performs well in all indicators, especially in complex knowledge graphs, it has stronger generalization ability and stability. This provides strong support for further research on knowledge graphs and also demonstrates the application potential of graph neural networks in entity extraction and relationship reasoning.

Title: Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation

Authors: Yucheng Xing, Xiaodong Liu, Xin Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15199
Pdf URL: https://arxiv.org/pdf/2411.15199
Copy Paste: [[2411.15199]] Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation(https://arxiv.org/abs/2411.15199)
Keywords: diffusion, generative
Abstract: With the development of artificial intelligence, more and more attention has been put onto generative models, which represent the creativity, a very important aspect of intelligence. In recent years, diffusion models have been studied and proven to be more reasonable and effective than previous methods. However, common diffusion frameworks suffer from controllability problems. Although extra conditions have been considered by some work to guide the diffusion process for a specific target generation, it only controls the generation result but not its process. In this work, we propose a new adaptive framework, $\textit{Adaptively Controllable Diffusion (AC-Diff) Model}$, to automatically and fully control the generation process, including not only the type of generation result but also the length and parameters of the generation process. Both inputs and conditions will be first fed into a $\textit{Conditional Time-Step (CTS) Module}$ to determine the number of steps needed for a generation. Then according to the length of the process, the diffusion rate parameters will be estimated through our $\textit{Adaptive Hybrid Noise Schedule (AHNS) Module}$. We further train the network with the corresponding adaptive sampling mechanism to learn how to adjust itself according to the conditions for the overall performance improvement. To enable its practical applications, AC-Diff is expected to largely reduce the average number of generation steps and execution time while maintaining the same performance as done in the literature diffusion models.

Title: Deep Learning-Based Classification of Hyperkinetic Movement Disorders in Children

Authors: Nandika Ramamurthy, Dr Daniel Lumsden, Dr Rachel Sparks
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15200
Pdf URL: https://arxiv.org/pdf/2411.15200
Copy Paste: [[2411.15200]] Deep Learning-Based Classification of Hyperkinetic Movement Disorders in Children(https://arxiv.org/abs/2411.15200)
Keywords: interpretability
Abstract: Hyperkinetic movement disorders (HMDs) in children, including dystonia (abnormal twisting) and chorea (irregular, random movements), pose significant diagnostic challenges due to overlapping clinical features. The prevalence of dystonia ranges from 2 to 50 per million, and chorea from 5 to 10 per 100,000. These conditions are often diagnosed with delays averaging 4.75 to 7.83 years. Traditional diagnostic methods depend on clinical history and expert physical examinations, but specialized tests are ineffective due to the complex pathophysiology of these disorders. This study develops a neural network model to differentiate between dystonia and chorea from video recordings of paediatric patients performing motor tasks. The model integrates a Graph Convolutional Network (GCN) to capture spatial relationships and Long Short-Term Memory (LSTM) networks to account for temporal dynamics. Attention mechanisms were incorporated to improve model interpretability. The model was trained and validated on a dataset of 50 videos (31 chorea-predominant, 19 dystonia-predominant) collected under regulatory approval from Guy's and St Thomas' NHS Foundation Trust. The model achieved 85% accuracy, 81% sensitivity, and 88% specificity at 15 frames per second. Attention maps highlighted the model's ability to correctly identify involuntary movement patterns, with misclassifications often due to occluded body parts or subtle movement variations. This work demonstrates the potential of deep learning to improve the accuracy and efficiency of HMD diagnosis and could contribute to more reliable, interpretable clinical tools.

Title: Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking

Authors: Harsha Vardhan Khurdula, Basem Rizk, Indus Khaitan, Janit Anjaria, Aviral Srivastava, Rajvardhan Khaitan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15201
Pdf URL: https://arxiv.org/pdf/2411.15201
Copy Paste: [[2411.15201]] Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking(https://arxiv.org/abs/2411.15201)
Keywords: robust
Abstract: Current benchmarks for evaluating Vision Language Models (VLMs) often fall short in thoroughly assessing model abilities to understand and process complex visual and textual content. They typically focus on simple tasks that do not require deep reasoning or the integration of multiple data modalities to solve an original problem. To address this gap, we introduce the PARROT-360V Benchmark, a novel and comprehensive benchmark featuring 2487 challenging visual puzzles designed to test VLMs on complex visual reasoning tasks. We evaluated leading models: GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro, using PARROT-360V to assess their capabilities in combining visual clues with language skills to solve tasks in a manner akin to human problem-solving. Our findings reveal a notable performance gap: state-of-the-art models scored between 28 to 56 percentage on our benchmark, significantly lower than their performance on popular benchmarks. This underscores the limitations of current VLMs in handling complex, multi-step reasoning tasks and highlights the need for more robust evaluation frameworks to advance the field.

Title: Multimodal large language model for wheat breeding: a new exploration of smart breeding

Authors: Guofeng Yang, Yu Li, Yong He, Zhenjiang Zhou, Lingzhen Ye, Hui Fang, Yiqi Luo, Xuping Feng
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.15203
Pdf URL: https://arxiv.org/pdf/2411.15203
Copy Paste: [[2411.15203]] Multimodal large language model for wheat breeding: a new exploration of smart breeding(https://arxiv.org/abs/2411.15203)
Keywords: large language model
Abstract: UAV remote sensing technology has become a key technology in crop breeding, which can achieve high-throughput and non-destructive collection of crop phenotyping data. However, the multidisciplinary nature of breeding has brought technical barriers and efficiency challenges to knowledge mining. Therefore, it is important to develop a smart breeding goal tool to mine cross-domain multimodal data. Based on different pre-trained open-source multimodal large language models (MLLMs) (e.g., Qwen-VL, InternVL, Deepseek-VL), this study used supervised fine-tuning (SFT), retrieval-augmented generation (RAG), and reinforcement learning from human feedback (RLHF) technologies to inject cross-domain knowledge into MLLMs, thereby constructing multiple multimodal large language models for wheat breeding (WBLMs). The above WBLMs were evaluated using the newly created evaluation benchmark in this study. The results showed that the WBLM constructed using SFT, RAG and RLHF technologies and InternVL2-8B has leading performance. Then, subsequent experiments were conducted using the WBLM. Ablation experiments indicated that the combination of SFT, RAG, and RLHF technologies can improve the overall generation performance, enhance the generated quality, balance the timeliness and adaptability of the generated answer, and reduce hallucinations and biases. The WBLM performed best in wheat yield prediction using cross-domain data (remote sensing, phenotyping, weather, germplasm) simultaneously, with R2 and RMSE of 0.821 and 489.254 kg/ha, respectively. Furthermore, the WBLM can generate professional decision support answers for phenotyping estimation, environmental stress assessment, target germplasm screening, cultivation technique recommendation, and seed price query tasks.

Title: Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training

Authors: Ameera Bawazir, Kebin Wu, Wenbin Li
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15207
Pdf URL: https://arxiv.org/pdf/2411.15207
Copy Paste: [[2411.15207]] Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training(https://arxiv.org/abs/2411.15207)
Keywords: privacy
Abstract: Recent advancements in vision-language pre-training via contrastive learning have significantly improved performance across computer vision tasks. However, in the medical domain, obtaining multimodal data is often costly and challenging due to privacy, sensitivity, and annotation complexity. To mitigate data scarcity while boosting model performance, we introduce \textbf{Uni-Mlip}, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training. Uni-Mlip seamlessly integrates cross-modality, uni-modality, and fused-modality self-supervision techniques at the data-level and the feature-level. Additionally, Uni-Mlip tailors uni-modal image self-supervision to accommodate the unique characteristics of medical images. Our experiments across datasets of varying scales demonstrate that Uni-Mlip significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).

Title: Quantized symbolic time series approximation

Authors: Erin Carson, Xinye Chen, Cheng Kang
Subjects: cs.LG, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15209
Pdf URL: https://arxiv.org/pdf/2411.15209
Copy Paste: [[2411.15209]] Quantized symbolic time series approximation(https://arxiv.org/abs/2411.15209)
Keywords: large language model
Abstract: Time series are ubiquitous in numerous science and engineering domains, e.g., signal processing, bioinformatics, and astronomy. Previous work has verified the efficacy of symbolic time series representation in a variety of engineering applications due to its storage efficiency and numerosity reduction. The most recent symbolic aggregate approximation technique, ABBA, has been shown to preserve essential shape information of time series and improve downstream applications, e.g., neural network inference regarding prediction and anomaly detection in time series. Motivated by the emergence of high-performance hardware which enables efficient computation for low bit-width representations, we present a new quantization-based ABBA symbolic approximation technique, QABBA, which exhibits improved storage efficiency while retaining the original speed and accuracy of symbolic reconstruction. We prove an upper bound for the error arising from quantization and discuss how the number of bits should be chosen to balance this with other errors. An application of QABBA with large language models (LLMs) for time series regression is also presented, and its utility is investigated. By representing the symbolic chain of patterns on time series, QABBA not only avoids the training of embedding from scratch, but also achieves a new state-of-the-art on Monash regression dataset. The symbolic approximation to the time series offers a more efficient way to fine-tune LLMs on the time series regression task which contains various application domains. We further present a set of extensive experiments performed across various well-established datasets to demonstrate the advantages of the QABBA method for symbolic approximation.

Title: Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

Authors: Yong Xie, Weijie Zheng, Hanxun Huang, Guangnan Ye, Xingjun Ma
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15210
Pdf URL: https://arxiv.org/pdf/2411.15210
Copy Paste: [[2411.15210]] Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks(https://arxiv.org/abs/2411.15210)
Keywords: attack, robust
Abstract: As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness evaluation methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks. In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, and show that PMA can outperform the current state-of-the-art individual methods. Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.

Title: LightLLM: A Versatile Large Language Model for Predictive Light Sensing

Authors: Jiawei Hu, Hong Jia, Mahbub Hassan, Lina Yao, Brano Kusy, Wen Hu
Subjects: cs.LG, cs.AI, cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2411.15211
Pdf URL: https://arxiv.org/pdf/2411.15211
Copy Paste: [[2411.15211]] LightLLM: A Versatile Large Language Model for Predictive Light Sensing(https://arxiv.org/abs/2411.15211)
Keywords: large language model
Abstract: We propose LightLLM, a model that fine tunes pre-trained large language models (LLMs) for light-based sensing tasks. It integrates a sensor data encoder to extract key features, a contextual prompt to provide environmental information, and a fusion layer to combine these inputs into a unified representation. This combined input is then processed by the pre-trained LLM, which remains frozen while being fine-tuned through the addition of lightweight, trainable components, allowing the model to adapt to new tasks without altering its original parameters. This approach enables flexible adaptation of LLM to specialized light sensing tasks with minimal computational overhead and retraining effort. We have implemented LightLLM for three light sensing tasks: light-based localization, outdoor solar forecasting, and indoor solar estimation. Using real-world experimental datasets, we demonstrate that LightLLM significantly outperforms state-of-the-art methods, achieving 4.4x improvement in localization accuracy and 3.4x improvement in indoor solar estimation when tested in previously unseen environments. We further demonstrate that LightLLM outperforms ChatGPT-4 with direct prompting, highlighting the advantages of LightLLM's specialized architecture for sensor data fusion with textual prompts.

Title: Image Harmonization using Robust Restricted CDF Matching

Authors: Roman Stoklasa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15213
Pdf URL: https://arxiv.org/pdf/2411.15213
Copy Paste: [[2411.15213]] Image Harmonization using Robust Restricted CDF Matching(https://arxiv.org/abs/2411.15213)
Keywords: robust
Abstract: Deployment of machine learning algorithms into real-world practice is still a difficult task. One of the challenges lies in the unpredictable variability of input data, which may differ significantly among individual users, institutions, scanners, etc. The input data variability can be decreased by using suitable data preprocessing with robust data harmonization. In this paper, we present a method of image harmonization using Cumulative Distribution Function (CDF) matching based on curve fitting. This approach does not ruin local variability and individual important features. The transformation of image intensities is non-linear but still ``smooth and elastic", as compared to other known histogram matching algorithms. Non-linear transformation allows for a very good match to the template. At the same time, elasticity constraints help to preserve local variability among individual inputs, which may encode important features for subsequent machine-learning processing. The pre-defined template CDF offers a better and more intuitive control for the input data transformation compared to other methods, especially ML-based ones. Even though we demonstrate our method for MRI images, the method is generic enough to apply to other types of imaging data.

Title: Urban Region Embeddings from Service-Specific Mobile Traffic Data

Authors: Giulio Loddi, Chiara Pugliese, Francesco Lettich, Fabio Pinelli, Chiara Renso
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2411.15214
Pdf URL: https://arxiv.org/pdf/2411.15214
Copy Paste: [[2411.15214]] Urban Region Embeddings from Service-Specific Mobile Traffic Data(https://arxiv.org/abs/2411.15214)
Keywords: transformer
Abstract: With the advent of advanced 4G/5G mobile networks, mobile phone data collected by operators now includes detailed, service-specific traffic information with high spatio-temporal resolution. In this paper, we leverage this type of data to explore its potential for generating high-quality representations of urban regions. To achieve this, we present a methodology for creating urban region embeddings from service-specific mobile traffic data, employing a temporal convolutional network-based autoencoder, transformers, and learnable weighted sum models to capture key urban features. In the extensive experimental evaluation conducted using a real-world dataset, we demonstrate that the embeddings generated by our methodology effectively capture urban characteristics. Specifically, our embeddings are compared against those of a state-of-the-art competitor across two downstream tasks. Additionally, through clustering techniques, we investigate how well the embeddings produced by our methodology capture the temporal dynamics and characteristics of the underlying urban regions. Overall, this work highlights the potential of service-specific mobile traffic data for urban research and emphasizes the importance of making such data accessible to support public innovation.

Title: S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Authors: Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2411.15215
Pdf URL: https://arxiv.org/pdf/2411.15215
Copy Paste: [[2411.15215]] S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning(https://arxiv.org/abs/2411.15215)
Keywords: large language model
Abstract: Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.

Title: Sampling with Adaptive Variance for Multimodal Distributions

Authors: Björn Engquist, Kui Ren, Yunan Yang
Subjects: cs.LG, math.NA, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15220
Pdf URL: https://arxiv.org/pdf/2411.15220
Copy Paste: [[2411.15220]] Sampling with Adaptive Variance for Multimodal Distributions(https://arxiv.org/abs/2411.15220)
Keywords: diffusion
Abstract: We propose and analyze a class of adaptive sampling algorithms for multimodal distributions on a bounded domain, which share a structural resemblance to the classic overdamped Langevin dynamics. We first demonstrate that this class of linear dynamics with adaptive diffusion coefficients and vector fields can be interpreted and analyzed as weighted Wasserstein gradient flows of the Kullback--Leibler (KL) divergence between the current distribution and the target Gibbs distribution, which directly leads to the exponential convergence of both the KL and $\chi^2$ divergences, with rates depending on the weighted Wasserstein metric and the Gibbs potential. We then show that a derivative-free version of the dynamics can be used for sampling without gradient information of the Gibbs potential and that for Gibbs distributions with nonconvex potentials, this approach could achieve significantly faster convergence than the classical overdamped Langevin dynamics. A comparison of the mean transition times between local minima of a nonconvex potential further highlights the better efficiency of the derivative-free dynamics in sampling.

Title: Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

Authors: Yoel Zimmermann, Adib Bazgir, Zartashia Afzal, Fariha Agbere, Qianxiang Ai, Nawaf Alampara, Alexander Al-Feghali, Mehrad Ansari, Dmytro Antypov, Amro Aswad, Jiaru Bai, Viktoriia Baibakova, Devi Dutta Biswajeet, Erik Bitzek, Joshua D. Bocarsly, Anna Borisova, Andres M Bran, L. Catherine Brinson, Marcel Moran Calderon, Alessandro Canalicchio, Victor Chen, Yuan Chiang, Defne Circi, Benjamin Charmes, Vikrant Chaudhary, Zizhang Chen, Min-Hsueh Chiu, Judith Clymo, Kedar Dabhadkar, Nathan Daelman, Archit Datar, Matthew L. Evans, Maryam Ghazizade Fard, Giuseppe Fisicaro, Abhijeet Sadashiv Gangan, Janine George, Jose D. Cojal Gonzalez, Michael Götte, Ankur K. Gupta, Hassan Harb, Pengyu Hong, Abdelrahman Ibrahim, Ahmed Ilyas, Alishba Imran, Kevin Ishimwe, Ramsey Issa, Kevin Maik Jablonka, Colin Jones, Tyler R. Josephson, Greg Juhasz, Sarthak Kapoor, Rongda Kang, Ghazal Khalighinejad, Sartaaj Khan, Sascha Klawohn, Suneel Kuman, Alvin Noe Ladines, Sarom Leang, Magdalena Lederbauer, Sheng-Lun Mark Liao, Hao Liu, Xuefeng Liu, Stanley Lo, Sandeep Madireddy, Piyush Ranjan Maharana, Shagun Maheshwari, Soroush Mahjoubi, José A. Márquez, Rob Mills, Trupti Mohanty, Bernadette Mohr, Seyed Mohamad Moosavi, Alexander Moßhammer, Amirhossein D. Naghdi, Aakash Naik, Oleksandr Narykov, Hampus Näsström, Xuan Vu Nguyen, Xinyi Ni, Dana O'Connor, Teslim Olayiwola, Federico Ottomano, Aleyna Beste Ozhan, Sebastian Pagel, Chiku Parida, Jaehee Park, Vraj Patel, Elena Patyukova, Martin Hoffmann Petersen, Luis Pinto, José M. Pizarro, Dieter Plessers, Tapashree Pradhan, Utkarsh Pratiush, Charishma Puli, Andrew Qin, Mahyar Rajabi, Francesco Ricci, Elliot Risch, Martiño Ríos-García
Subjects: cs.LG, cond-mat.mtrl-sci, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2411.15221
Pdf URL: https://arxiv.org/pdf/2411.15221
Copy Paste: [[2411.15221]] Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry(https://arxiv.org/abs/2411.15221)
Keywords: extraction, large language model
Abstract: Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.

Title: Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation

Authors: Ke Zhao (1), Huayang Huang (1), Miao Li (1), Yu Wu (1) ((1) Wuhan University)
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2411.15222
Pdf URL: https://arxiv.org/pdf/2411.15222
Copy Paste: [[2411.15222]] Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation(https://arxiv.org/abs/2411.15222)
Keywords: security, attack, robust
Abstract: Language-conditioned robotic learning has significantly enhanced robot adaptability by enabling a single model to execute diverse tasks in response to verbal commands. Despite these advancements, security vulnerabilities within this domain remain largely unexplored. This paper addresses this gap by proposing a novel adversarial prompt attack tailored to language-conditioned robotic models. Our approach involves crafting a universal adversarial prefix that induces the model to perform unintended actions when added to any original prompt. We demonstrate that existing adversarial techniques exhibit limited effectiveness when directly transferred to the robotic domain due to the inherent robustness of discretized robotic action spaces. To overcome this challenge, we propose to optimize adversarial prefixes based on continuous action representations, circumventing the discretization process. Additionally, we identify the beneficial impact of intermediate features on adversarial attacks and leverage the negative gradient of intermediate self-attention features to further enhance attack efficacy. Extensive experiments on VIMA models across 13 robot manipulation tasks validate the superiority of our method over existing approaches and demonstrate its transferability across different model variants.

Title: Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation

Authors: Seokil Ham, Hee-Seon Kim, Sangmin Woo, Changick Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15224
Pdf URL: https://arxiv.org/pdf/2411.15224
Copy Paste: [[2411.15224]] Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation(https://arxiv.org/abs/2411.15224)
Keywords: transformer
Abstract: Despite the growing interest in Mamba architecture as a potential replacement for Transformer architecture, parameter-efficient fine-tuning (PEFT) approaches for Mamba remain largely unexplored. In our study, we introduce two key insights-driven strategies for PEFT in Mamba architecture: (1) While state-space models (SSMs) have been regarded as the cornerstone of Mamba architecture, then expected to play a primary role in transfer learning, our findings reveal that Projectors -- not SSMs -- are the predominant contributors to transfer learning, and (2) Based on our observation that adapting pretrained Projectors to new tasks can be effectively approximated through a near-diagonal linear transformation, we propose a novel PEFT method specialized to Mamba architecture: Projector-targeted Diagonal-centric Linear Transformation (ProDiaL). ProDiaL focuses on optimizing only diagonal-centric linear transformation matrices, without directly fine-tuning the pretrained Projector weights. This targeted approach allows efficient task adaptation, utilizing less than 1% of the total parameters, and exhibits strong performance across both vision and language Mamba models, highlighting its versatility and effectiveness.

Title: IterIS: Iterative Inference-Solving Alignment for LoRA Merging

Authors: Hongxu Chen, Runshi Li, Bowei Zhu, Zhen Wang, Long Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15231
Pdf URL: https://arxiv.org/pdf/2411.15231
Copy Paste: [[2411.15231]] IterIS: Iterative Inference-Solving Alignment for LoRA Merging(https://arxiv.org/abs/2411.15231)
Keywords: privacy, diffusion, large language model
Abstract: Low-rank adaptations (LoRA) are widely used to fine-tune large models across various domains for specific downstream tasks. While task-specific LoRAs are often available, concerns about data privacy and intellectual property can restrict access to training data, limiting the acquisition of a multi-task model through gradient-based training. In response, LoRA merging presents an effective solution by combining multiple LoRAs into a unified adapter while maintaining data privacy. Prior works on LoRA merging primarily frame it as an optimization problem, yet these approaches face several limitations, including the rough assumption about input features utilized in optimization, massive sample requirements, and the unbalanced optimization objective. These limitations can significantly degrade performance. To address these, we propose a novel optimization-based method, named IterIS: 1) We formulate LoRA merging as an advanced optimization problem to mitigate the rough assumption. Additionally, we employ an iterative inference-solving framework in our algorithm. It can progressively refine the optimization objective for improved performance. 2) We introduce an efficient regularization term to reduce the need for massive sample requirements (requiring only 1-5% of the unlabeled samples compared to prior methods). 3) We utilize adaptive weights in the optimization objective to mitigate potential unbalances in LoRA merging process. Our method demonstrates significant improvements over multiple baselines and state-of-the-art methods in composing tasks for text-to-image diffusion, vision-language models, and large language models. Furthermore, our layer-wise algorithm can achieve convergence with minimal steps, ensuring efficiency in both memory and computation.

Title: BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

Authors: Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.15232
Pdf URL: https://arxiv.org/pdf/2411.15232
Copy Paste: [[2411.15232]] BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models(https://arxiv.org/abs/2411.15232)
Keywords: large language model
Abstract: Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code will be publicly available at this https URL.

Title: Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

Authors: Jeeyung Kim, Erfan Esmaeili, Qiang Qiu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15236
Pdf URL: https://arxiv.org/pdf/2411.15236
Copy Paste: [[2411.15236]] Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps(https://arxiv.org/abs/2411.15236)
Keywords: diffusion
Abstract: In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, "a black car and a white clock", the cross-attention maps for "black" and "car" should focus on overlapping regions to depict a black car, while "car" and "clock" should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens -- used as conditioning inputs -- can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.

Title: Stain-Invariant Representation for Tissue Classification in Histology Images

Authors: Manahil Raza, Saad Bashir, Talha Qaiser, Nasir Rajpoot
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15237
Pdf URL: https://arxiv.org/pdf/2411.15237
Copy Paste: [[2411.15237]] Stain-Invariant Representation for Tissue Classification in Histology Images(https://arxiv.org/abs/2411.15237)
Keywords: robust
Abstract: The process of digitising histology slides involves multiple factors that can affect a whole slide image's (WSI) final appearance, including the staining protocol, scanner, and tissue type. This variability constitutes a domain shift and results in significant problems when training and testing deep learning (DL) algorithms in multi-cohort settings. As such, developing robust and generalisable DL models in computational pathology (CPath) remains an open challenge. In this regard, we propose a framework that generates stain-augmented versions of the training images using stain matrix perturbation. Thereafter, we employed a stain regularisation loss to enforce consistency between the feature representations of the source and augmented images. Doing so encourages the model to learn stain-invariant and, consequently, domain-invariant feature representations. We evaluate the performance of the proposed model on cross-domain multi-class tissue type classification of colorectal cancer images and have achieved improved performance compared to other state-of-the-art methods.

Title: Faithful Label-free Knowledge Distillation

Authors: Evelyn J. Mannix, Liam Hodgkinson, Howard Bondell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15239
Pdf URL: https://arxiv.org/pdf/2411.15239
Copy Paste: [[2411.15239]] Faithful Label-free Knowledge Distillation(https://arxiv.org/abs/2411.15239)
Keywords: robust
Abstract: Knowledge distillation approaches are model compression techniques, with the goal of training a highly performant student model by using a teacher network that is larger or contains a different inductive bias. These approaches are particularly useful when applied to large computer vision foundation models, which can be compressed into smaller variants that retain desirable properties such as improved robustness. This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM), which improves on previous methods by learning an approximately orthogonal mapping from the latent space of the teacher to the student network. This produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection. It is further shown that knowledge distillation with TinTeM on task specific datasets leads to more accurate models with greater generalisability and OOD detection performance, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets.

Title: Is Attention All You Need For Actigraphy? Foundation Models of Wearable Accelerometer Data for Mental Health Research

Authors: Franklin Y. Ruan, Aiwei Zhang, Jenny Y. Oh, SouYoung Jin, Nicholas C Jacobson
Subjects: cs.LG, cs.AI, cs.HC, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.15240
Pdf URL: https://arxiv.org/pdf/2411.15240
Copy Paste: [[2411.15240]] Is Attention All You Need For Actigraphy? Foundation Models of Wearable Accelerometer Data for Mental Health Research(https://arxiv.org/abs/2411.15240)
Keywords: robust, explainability, transformer
Abstract: Wearable accelerometry (actigraphy) has provided valuable data for clinical insights since the 1970s and is increasingly important as wearable devices continue to become widespread. The effectiveness of actigraphy in research and clinical contexts is heavily dependent on the modeling architecture utilized. To address this, we developed the Pretrained Actigraphy Transformer (PAT)--the first pretrained and fully attention-based model designed specifically to handle actigraphy. PAT was pretrained on actigraphy from 29,307 participants in NHANES, enabling it to deliver state-of-the-art performance when fine-tuned across various actigraphy prediction tasks in the mental health domain, even in data-limited scenarios. For example, when trained to predict benzodiazepine usage using actigraphy from only 500 labeled participants, PAT achieved an 8.8 percentage-point AUC improvement over the best baseline. With fewer than 2 million parameters and built-in model explainability, PAT is robust yet easy to deploy in health research settings. GitHub: this https URL

Title: The Zamba2 Suite: Technical Report

Authors: Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, Beren Millidge
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.15242
Pdf URL: https://arxiv.org/pdf/2411.15242
Copy Paste: [[2411.15242]] The Zamba2 Suite: Technical Report(https://arxiv.org/abs/2411.15242)
Keywords: transformer
Abstract: In this technical report, we present the Zamba2 series -- a suite of 1.2B, 2.7B, and 7.4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency. The Zamba2 series builds upon our initial work with Zamba1-7B, optimizing its architecture, training and annealing datasets, and training for up to three trillion tokens. We provide open-source weights for all models of the Zamba2 series as well as instruction-tuned variants that are strongly competitive against comparable instruct-tuned models of their class. We additionally open-source the pretraining dataset, which we call Zyda-2, used to train the Zamba2 series of models. The models and datasets used in this work are openly available at this https URL

Title: Adversarial Prompt Distillation for Vision-Language Models

Authors: Lin Luo, Xin Wang, Bojia Zi, Shihao Zhao, Xingjun Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15244
Pdf URL: https://arxiv.org/pdf/2411.15244
Copy Paste: [[2411.15244]] Adversarial Prompt Distillation for Vision-Language Models(https://arxiv.org/abs/2411.15244)
Keywords: attack, robust
Abstract: Large pre-trained Vision-Language Models (VLMs) such as Contrastive Language-Image Pre-Training (CLIP) have been shown to be susceptible to adversarial attacks, raising concerns about their deployment in safety-critical scenarios like autonomous driving and medical diagnosis. One promising approach for improving the robustness of pre-trained VLMs is Adversarial Prompt Tuning (APT), which combines adversarial training with prompt tuning. However, existing APT methods are mostly single-modal methods that design prompt(s) for only the visual or textual modality, limiting their effectiveness in either robustness or clean accuracy. In this work, we propose a novel method called Adversarial Prompt Distillation (APD) that combines APT with knowledge distillation to boost the adversarial robustness of CLIP. Specifically, APD is a bimodal method that adds prompts for both the visual and textual modalities while leveraging a cleanly pre-trained teacher CLIP model to distill and boost the performance of the student CLIP model on downstream tasks. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our APD over the current state-of-the-art APT methods in terms of both natural and adversarial performances. The effectiveness of our APD method validates the possibility of using a non-robust teacher to improve the generalization and robustness of VLMs.

Title: Exploring the Robustness and Transferability of Patch-Based Adversarial Attacks in Quantized Neural Networks

Authors: Amira Guesmi, Bassem Ouni, Muhammad Shafique
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.15246
Pdf URL: https://arxiv.org/pdf/2411.15246
Copy Paste: [[2411.15246]] Exploring the Robustness and Transferability of Patch-Based Adversarial Attacks in Quantized Neural Networks(https://arxiv.org/abs/2411.15246)
Keywords: secure, security, defense, attack, robust
Abstract: Quantized neural networks (QNNs) are increasingly used for efficient deployment of deep learning models on resource-constrained platforms, such as mobile devices and edge computing systems. While quantization reduces model size and computational demands, its impact on adversarial robustness-especially against patch-based attacks-remains inadequately addressed. Patch-based attacks, characterized by localized, high-visibility perturbations, pose significant security risks due to their transferability and resilience. In this study, we systematically evaluate the vulnerability of QNNs to patch-based adversarial attacks across various quantization levels and architectures, focusing on factors that contribute to the robustness of these attacks. Through experiments analyzing feature representations, quantization strength, gradient alignment, and spatial sensitivity, we find that patch attacks consistently achieve high success rates across bitwidths and architectures, demonstrating significant transferability even in heavily quantized models. Contrary to the expectation that quantization might enhance adversarial defenses, our results show that QNNs remain highly susceptible to patch attacks due to the persistence of distinct, localized features within quantized representations. These findings underscore the need for quantization-aware defenses that address the specific challenges posed by patch-based attacks. Our work contributes to a deeper understanding of adversarial robustness in QNNs and aims to guide future research in developing secure, quantization-compatible defenses for real-world applications.

Title: Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Authors: Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15247
Pdf URL: https://arxiv.org/pdf/2411.15247
Copy Paste: [[2411.15247]] Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward(https://arxiv.org/abs/2411.15247)
Keywords: diffusion
Abstract: Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at this https URL.

Title: LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation

Authors: Fan Deng, Yaguang Wu, Xinyang Yu, Xiangjun Huang, Jian Yang, Guangyu Yan, Qiang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15252
Pdf URL: https://arxiv.org/pdf/2411.15252
Copy Paste: [[2411.15252]] LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation(https://arxiv.org/abs/2411.15252)
Keywords: diffusion
Abstract: Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.

Title: The Explabox: Model-Agnostic Machine Learning Transparency & Analysis

Authors: Marcel Robeer, Michiel Bron, Elize Herrewijnen, Riwish Hoeseni, Floris Bex
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2411.15257
Pdf URL: https://arxiv.org/pdf/2411.15257
Copy Paste: [[2411.15257]] The Explabox: Model-Agnostic Machine Learning Transparency & Analysis(https://arxiv.org/abs/2411.15257)
Keywords: security, robust, fair, explainability
Abstract: We present the Explabox: an open-source toolkit for transparent and responsible machine learning (ML) model development and usage. Explabox aids in achieving explainable, fair and robust models by employing a four-step strategy: explore, examine, explain and expose. These steps offer model-agnostic analyses that transform complex 'ingestibles' (models and data) into interpretable 'digestibles'. The toolkit encompasses digestibles for descriptive statistics, performance metrics, model behavior explanations (local and global), and robustness, security, and fairness assessments. Implemented in Python, Explabox supports multiple interaction modes and builds on open-source packages. It empowers model developers and testers to operationalize explainability, fairness, auditability, and security. The initial release focuses on text data and models, with plans for expansion. Explabox's code and documentation are available open-source at this https URL.

Title: VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

Authors: Jiahao Hu, Tianxiong Zhong, Xuebo Wang, Boyuan Jiang, Xingye Tian, Fei Yang, Pengfei Wan, Di Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15260
Pdf URL: https://arxiv.org/pdf/2411.15260
Copy Paste: [[2411.15260]] VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing(https://arxiv.org/abs/2411.15260)
Keywords: diffusion
Abstract: Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing datasets based on real-world data, as constructing such datasets is both time-consuming and costly. Moreover, video data requires a significantly larger number of tokens for representation, which substantially increases the training costs for video editing models. Lastly, current video editing models offer limited interactivity, often making it difficult for users to express their editing requirements effectively in a single attempt. To address these challenges, this paper introduces a dataset VIVID-10M and a baseline model VIVID. VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs, which comprises 9.7M samples that encompass a wide range of video editing tasks. VIVID is a Versatile and Interactive VIdeo local eDiting model trained on VIVID-10M, which supports entity addition, modification, and deletion. At its core, a keyframe-guided interactive video editing mechanism is proposed, enabling users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes. Extensive experimental evaluations show that our approach achieves state-of-the-art performance in video local editing, surpassing baseline methods in both automated metrics and user studies. The VIVID-10M dataset and the VIVID editing model will be available at \url{this https URL}.

Title: MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Authors: Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15262
Pdf URL: https://arxiv.org/pdf/2411.15262
Copy Paste: [[2411.15262]] MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation(https://arxiv.org/abs/2411.15262)
Keywords: diffusion
Abstract: Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: this https URL.

Title: Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI

Authors: Won Jun Kim, Hyungjin Chung, Jaemin Kim, Sangmin Lee, Byeongsu Sim, Jong Chul Ye
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15265
Pdf URL: https://arxiv.org/pdf/2411.15265
Copy Paste: [[2411.15265]] Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI(https://arxiv.org/abs/2411.15265)
Keywords: attack, explainability, diffusion
Abstract: Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model's gradient projected onto the data manifold, requiring access only to the model's outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.

Title: BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques

Authors: Muhammad Rafsan Kabir, Md. Mohibur Rahman Nabil, Mohammad Ashrafuzzaman Khan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15270
Pdf URL: https://arxiv.org/pdf/2411.15270
Copy Paste: [[2411.15270]] BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques(https://arxiv.org/abs/2411.15270)
Keywords: transformer
Abstract: Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.

Title: EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration

Authors: Linrui Gong, Jiuming Liu, Junyi Ma, Lihao Liu, Yaonan Wang, Hesheng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15271
Pdf URL: https://arxiv.org/pdf/2411.15271
Copy Paste: [[2411.15271]] EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration(https://arxiv.org/abs/2411.15271)
Keywords: robust, diffusion
Abstract: Diffusion models have shown the great potential in the point cloud registration (PCR) task, especially for enhancing the robustness to challenging cases. However, existing diffusion-based PCR methods primarily focus on instance-level scenarios and struggle with outdoor LiDAR points, where the sparsity, irregularity, and huge point scale inherent in LiDAR points pose challenges to establishing dense global point-to-point correspondences. To address this issue, we propose a novel framework named EADReg for efficient and robust registration of LiDAR point clouds based on autoregressive diffusion models. EADReg follows a coarse-to-fine registration paradigm. In the coarse stage, we employ a Bi-directional Gaussian Mixture Model (BGMM) to reject outlier points and obtain purified point cloud pairs. BGMM establishes correspondences between the Gaussian Mixture Models (GMMs) from the source and target frames, enabling reliable coarse registration based on filtered features and geometric information. In the fine stage, we treat diffusion-based PCR as an autoregressive process to generate robust point correspondences, which are then iteratively refined on upper layers. Despite common criticisms of diffusion-based methods regarding inference speed, EADReg achieves runtime comparable to convolutional-based methods. Extensive experiments on the KITTI and NuScenes benchmark datasets highlight the state-of-the-art performance of our proposed method. Codes will be released upon publication.

Title: Curriculum-enhanced GroupDRO: Challenging the Norm of Avoiding Curriculum Learning in Subpopulation Shift Setups

Authors: Antonio Barbalau
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15272
Pdf URL: https://arxiv.org/pdf/2411.15272
Copy Paste: [[2411.15272]] Curriculum-enhanced GroupDRO: Challenging the Norm of Avoiding Curriculum Learning in Subpopulation Shift Setups(https://arxiv.org/abs/2411.15272)
Keywords: robust
Abstract: In subpopulation shift scenarios, a Curriculum Learning (CL) approach would only serve to imprint the model weights, early on, with the easily learnable spurious correlations featured. To the best of our knowledge, none of the current state-of-the-art subpopulation shift approaches employ any kind of curriculum. To overcome this, we design a CL approach aimed at initializing the model weights in an unbiased vantage point in the hypothesis space which sabotages easy convergence towards biased hypotheses during the final optimization based on the entirety of the available data. We hereby propose a Curriculum-enhanced Group Distributionally Robust Optimization (CeGDRO) approach, which prioritizes the hardest bias-confirming samples and the easiest bias-conflicting samples, leveraging GroupDRO to balance the initial discrepancy in terms of difficulty. We benchmark our proposed method against the most popular subpopulation shift datasets, showing an increase over the state-of-the-art results across all scenarios, up to 6.2% on Waterbirds.

Title: ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation

Authors: Junzhang Liu, Tingkai Liu, Yueyuan Sui, Stephen Xia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15281
Pdf URL: https://arxiv.org/pdf/2411.15281
Copy Paste: [[2411.15281]] ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation(https://arxiv.org/abs/2411.15281)
Keywords: robust, transformer
Abstract: We introduce ElastiFormer, a post-training technique that adapts pretrained Transformer models into an elastic counterpart with variable inference time compute. ElastiFormer introduces small routing modules (as low as .00006% additional trainable parameters) to dynamically selects subsets of network parameters and input tokens to be processed by each layer of the pretrained network in an inputdependent manner. The routing modules are trained using self-distillation losses to minimize the differences between the output of the pretrained-model and their elastic counterparts. As ElastiFormer makes no assumption regarding the modality of the pretrained Transformer model, it can be readily applied to all modalities covering causal language modeling, image modeling as well as visual-language modeling tasks. We show that 20% to 50% compute saving could be achieved for different components of the transformer layer, which could be further reduced by adding very low rank LoRA weights (rank 1) trained via the same distillation objective. Finally, by comparing routing trained on different subsets of ImageNet, we show that ElastiFormer is robust against the training domain.

Title: When Spatial meets Temporal in Action Recognition

Authors: Huilin Chen, Lei Wang, Yifan Chen, Tom Gedeon, Piotr Koniusz
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15284
Pdf URL: https://arxiv.org/pdf/2411.15284
Copy Paste: [[2411.15284]] When Spatial meets Temporal in Action Recognition(https://arxiv.org/abs/2411.15284)
Keywords: transformer
Abstract: Video action recognition has made significant strides, but challenges remain in effectively using both spatial and temporal information. While existing methods often focus on either spatial features (e.g., object appearance) or temporal dynamics (e.g., motion), they rarely address the need for a comprehensive integration of both. Capturing the rich temporal evolution of video frames, while preserving their spatial details, is crucial for improving accuracy. In this paper, we introduce the Temporal Integration and Motion Enhancement (TIME) layer, a novel preprocessing technique designed to incorporate temporal information. The TIME layer generates new video frames by rearranging the original sequence, preserving temporal order while embedding $N^2$ temporally evolving frames into a single spatial grid of size $N \times N$. This transformation creates new frames that balance both spatial and temporal information, making them compatible with existing video models. When $N=1$, the layer captures rich spatial details, similar to existing methods. As $N$ increases ($N\geq2$), temporal information becomes more prominent, while the spatial information decreases to ensure compatibility with model inputs. We demonstrate the effectiveness of the TIME layer by integrating it into popular action recognition models, such as ResNet-50, Vision Transformer, and Video Masked Autoencoders, for both RGB and depth video data. Our experiments show that the TIME layer enhances recognition accuracy, offering valuable insights for video processing tasks.

Title: Forecasting Unseen Points of Interest Visits Using Context and Proximity Priors

Authors: Ziyao Li, Shang-Ling Hsu, Cyrus Shahabi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15285
Pdf URL: https://arxiv.org/pdf/2411.15285
Copy Paste: [[2411.15285]] Forecasting Unseen Points of Interest Visits Using Context and Proximity Priors(https://arxiv.org/abs/2411.15285)
Keywords: robust
Abstract: Understanding human mobility behavior is crucial for numerous applications, including crowd management, location-based recommendations, and the estimation of pandemic spread. Machine learning models can predict the Points of Interest (POIs) that individuals are likely to visit in the future by analyzing their historical visit patterns. Previous studies address this problem by learning a POI classifier, where each class corresponds to a POI. However, this limits their applicability to predict a new POI that was not in the training data, such as the opening of new restaurants. To address this challenge, we propose a model designed to predict a new POI outside the training data as long as its context is aligned with the user's interests. Unlike existing approaches that directly predict specific POIs, our model first forecasts the semantic context of potential future POIs, then combines this with a proximity-based prior probability distribution to determine the exact POI. Experimental results on real-world visit data demonstrate that our model outperforms baseline methods that do not account for semantic contexts, achieving a 17% improvement in accuracy. Notably, as new POIs are introduced over time, our model remains robust, exhibiting a lower decline rate in prediction accuracy compared to existing methods.

Title: Sycophancy in Large Language Models: Causes and Mitigations

Authors: Lars Malmqvist
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15287
Pdf URL: https://arxiv.org/pdf/2411.15287
Copy Paste: [[2411.15287]] Sycophancy in Large Language Models: Causes and Mitigations(https://arxiv.org/abs/2411.15287)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their tendency to exhibit sycophantic behavior - excessively agreeing with or flattering users - poses significant risks to their reliability and ethical deployment. This paper provides a technical survey of sycophancy in LLMs, analyzing its causes, impacts, and potential mitigation strategies. We review recent work on measuring and quantifying sycophantic tendencies, examine the relationship between sycophancy and other challenges like hallucination and bias, and evaluate promising techniques for reducing sycophancy while maintaining model performance. Key approaches explored include improved training data, novel fine-tuning methods, post-deployment control mechanisms, and decoding strategies. We also discuss the broader implications of sycophancy for AI alignment and propose directions for future research. Our analysis suggests that mitigating sycophancy is crucial for developing more robust, reliable, and ethically-aligned language models.

Title: MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Authors: Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.15296
Pdf URL: https://arxiv.org/pdf/2411.15296
Copy Paste: [[2411.15296]] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs(https://arxiv.org/abs/2411.15296)
Keywords: large language model
Abstract: As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.

Title: PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models

Authors: Gerald Friedland, Xin Huang, Yueying Cui, Vishaal Kapoor, Ashish Khetan, Sanjiv Das
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15320
Pdf URL: https://arxiv.org/pdf/2411.15320
Copy Paste: [[2411.15320]] PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models(https://arxiv.org/abs/2411.15320)
Keywords: generative, large language model
Abstract: We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model for a given task. Our single metric assesses LLMs with an approach that subsumes, but is not explicitly based on, coherence and fluency (quality of writing) and relevance and consistency (appropriateness of response) to the query. PPLqa performs as well as other related metrics, and works better with long-form Q\&A. Thus, PPLqa enables bypassing the lengthy annotation process required for ground truth evaluations, and it also correlates well with human and LLM rankings.

Title: Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

Authors: Soumil Datta, Shih-Chieh Dai, Leo Yu, Guanhong Tao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15367
Pdf URL: https://arxiv.org/pdf/2411.15367
Copy Paste: [[2411.15367]] Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage(https://arxiv.org/abs/2411.15367)
Keywords: privacy, protect, defense, robust, watermark, diffusion, generative
Abstract: Text-to-image diffusion models, such as Stable Diffusion, have shown exceptional potential in generating high-quality images. However, recent studies highlight concerns over the use of unauthorized data in training these models, which may lead to intellectual property infringement or privacy violations. A promising approach to mitigate these issues is to apply a watermark to images and subsequently check if generative models reproduce similar watermark features. In this paper, we examine the robustness of various watermark-based protection methods applied to text-to-image models. We observe that common image transformations are ineffective at removing the watermark effect. Therefore, we propose \tech{}, that leverages the diffusion process to conduct controlled image generation on the protected input, preserving the high-level features of the input while ignoring the low-level details utilized by watermarks. A small number of generated images are then used to fine-tune protected models. Our experiments on three datasets and 140 text-to-image diffusion models reveal that existing state-of-the-art protections are not robust against RATTAN.

Title: Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering

Authors: Mostafa Varzaneh, Pooja Voladoddi, Tanmay Bakshi, Uma Gunturi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15372
Pdf URL: https://arxiv.org/pdf/2411.15372
Copy Paste: [[2411.15372]] Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering(https://arxiv.org/abs/2411.15372)
Keywords: robust, transformer
Abstract: Real-time conversational AI agents face challenges in performing Natural Language Understanding (NLU) in dynamic, outdoor environments like automated drive-thru systems. These settings require NLU models to handle background noise, diverse accents, and multi-intent queries while operating under strict latency and memory constraints on edge devices. Additionally, robustness to errors from upstream Automatic Speech Recognition (ASR) is crucial, as ASR outputs in these environments are often noisy. We introduce Babylon, a transformer-based architecture that tackles NLU as an intent translation task, converting natural language inputs into sequences of regular language units ('transcodes') that encode both intents and slot information. This formulation allows Babylon to manage multi-intent scenarios in a single dialogue turn. Furthermore, Babylon incorporates an LSTM-based token pooling mechanism to preprocess phoneme sequences, reducing input length and optimizing for low-latency, low-memory edge deployment. This also helps mitigate inaccuracies in ASR outputs, enhancing system robustness. While this work focuses on drive-thru ordering, Babylon's design extends to similar noise-prone scenarios, for e.g. ticketing kiosks. Our experiments show that Babylon achieves significantly better accuracy-latency-memory footprint trade-offs over typically employed NMT models like Flan-T5 and BART, demonstrating its effectiveness for real-time NLU in edge deployment settings.

Title: On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

Authors: Elita Lobo, Chirag Agarwal, Himabindu Lakkaraju
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15382
Pdf URL: https://arxiv.org/pdf/2411.15382
Copy Paste: [[2411.15382]] On the Impact of Fine-Tuning on Chain-of-Thought Reasoning(https://arxiv.org/abs/2411.15382)
Keywords: privacy, large language model
Abstract: Large language models have emerged as powerful tools for general intelligence, showcasing advanced natural language processing capabilities that find applications across diverse domains. Despite their impressive performance, recent studies have highlighted the potential for significant enhancements in LLMs' task-specific performance through fine-tuning strategies like Reinforcement Learning with Human Feedback (RLHF), supervised fine-tuning (SFT), and Quantized Low-Rank Adapters (Q-LoRA) method. However, previous works have shown that while fine-tuning offers significant performance gains, it also leads to challenges such as catastrophic forgetting and privacy and safety risks. To this end, there has been little to no work in \textit{understanding the impact of fine-tuning on the reasoning capabilities of LLMs}. Our research investigates the effect of fine-tuning on the reasoning abilities of LLMs, addressing critical questions regarding the impact of task-specific fine-tuning on overall reasoning capabilities, the influence of fine-tuning on Chain-of-Thought (CoT) reasoning performance, and the implications for the faithfulness of CoT reasonings. By exploring these dimensions, our study shows the impact of fine-tuning on LLM reasoning capabilities, where the faithfulness of CoT reasoning, on average across four datasets, decreases, highlighting potential shifts in internal mechanisms of the LLMs resulting from fine-tuning processes.

Title: From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set

Authors: Mara Finkelstein, Dan Deutsch, Parker Riley, Juraj Juraska, Geza Kovacs, Markus Freitag
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15387
Pdf URL: https://arxiv.org/pdf/2411.15387
Copy Paste: [[2411.15387]] From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set(https://arxiv.org/abs/2411.15387)
Keywords: robust
Abstract: As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method for designing automatic metrics across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.

Title: A Constrast-Agnostic Method for Ultra-High Resolution Claustrum Segmentation

Authors: Chiara Mauri, Ryan Fritz, Jocelyn Mora, Benjamin Billot, Juan Eugenio Iglesias, Koen Van Leemput, Jean Augustinack, Douglas N Greve
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15388
Pdf URL: https://arxiv.org/pdf/2411.15388
Copy Paste: [[2411.15388]] A Constrast-Agnostic Method for Ultra-High Resolution Claustrum Segmentation(https://arxiv.org/abs/2411.15388)
Keywords: robust, segmentation
Abstract: The claustrum is a band-like gray matter structure located between putamen and insula whose exact functions are still actively researched. Its sheet-like structure makes it barely visible in in vivo Magnetic Resonance Imaging (MRI) scans at typical resolutions and neuroimaging tools for its study, including methods for automatic segmentation, are currently very limited. In this paper, we propose a contrast- and resolution-agnostic method for claustrum segmentation at ultra-high resolution (0.35 mm isotropic); the method is based on the SynthSeg segmentation framework (Billot et al., 2023), which leverages the use of synthetic training intensity images to achieve excellent generalization. In particular, SynthSeg requires only label maps to be trained, since corresponding intensity images are synthesized on the fly with random contrast and resolution. We trained a deep learning network for automatic claustrum segmentation, using claustrum manual labels obtained from 18 ultra-high resolution MRI scans (mostly ex vivo). We demonstrated the method to work on these 18 high resolution cases (Dice score = 0.632, mean surface distance = 0.458 mm, and volumetric similarity = 0.867 using 6-fold Cross Validation (CV)), and also on in vivo T1-weighted MRI scans at typical resolutions (~1 mm isotropic). We also demonstrated that the method is robust in a test-retest setting and when applied to multimodal imaging (T2-weighted, Proton Density and quantitative T1 scans). To the best of our knowledge this is the first accurate method for automatic ultra-high resolution claustrum segmentation, which is robust against changes in contrast and resolution. The method is released at this https URL and as part of the neuroimaging package Freesurfer (Fischl, 2012).

Title: Gradient-Free Classifier Guidance for Diffusion Model Sampling

Authors: Rahul Shenoy, Zhihong Pan, Kaushik Balakrishnan, Qisen Cheng, Yongmoon Jeon, Heejune Yang, Jaewon Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15393
Pdf URL: https://arxiv.org/pdf/2411.15393
Copy Paste: [[2411.15393]] Gradient-Free Classifier Guidance for Diffusion Model Sampling(https://arxiv.org/abs/2411.15393)
Keywords: diffusion
Abstract: Image generation using diffusion models have demonstrated outstanding learning capabilities, effectively capturing the full distribution of the training dataset. They are known to generate wide variations in sampled images, albeit with a trade-off in image fidelity. Guided sampling methods, such as classifier guidance (CG) and classifier-free guidance (CFG), focus sampling in well-learned high-probability regions to generate images of high fidelity, but each has its limitations. CG is computationally expensive due to the use of back-propagation for classifier gradient descent, while CFG, being gradient-free, is more efficient but compromises class label alignment compared to CG. In this work, we propose an efficient guidance method that fully utilizes a pre-trained classifier without using gradient descent. By using the classifier solely in inference mode, a time-adaptive reference class label and corresponding guidance scale are determined at each time step for guided sampling. Experiments on both class-conditioned and text-to-image generation diffusion models demonstrate that the proposed Gradient-free Classifier Guidance (GFCG) method consistently improves class prediction accuracy. We also show GFCG to be complementary to other guided sampling methods like CFG. When combined with the state-of-the-art Autoguidance (ATG), without additional computational overhead, it enhances image fidelity while preserving diversity. For ImageNet 512$\times$512, we achieve a record $\text{FD}_{\text{DINOv2}}$ of 23.09, while simultaneously attaining a higher classification Precision (94.3%) compared to ATG (90.2%)

Title: Efficient Online Inference of Vision Transformers by Training-Free Tokenization

Authors: Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15397
Pdf URL: https://arxiv.org/pdf/2411.15397
Copy Paste: [[2411.15397]] Efficient Online Inference of Vision Transformers by Training-Free Tokenization(https://arxiv.org/abs/2411.15397)
Keywords: transformer
Abstract: The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression requires additional end-to-end fine-tuning or incurs a significant drawback to runtime, thus making them ill-suited for online inference. We introduce the $\textbf{Visual Word Tokenizer}$ (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups patches (visual subwords) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for compression. Experimentally, we demonstrate a reduction in wattage of up to 19% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to $2\times$ or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.

Title: Partial Knowledge Distillation for Alleviating the Inherent Inter-Class Discrepancy in Federated Learning

Authors: Xiaoyu Gan, Xizi Chen, Jingyang Zhu, Xiaomeng Wang, Jingbo Jiang, Chi-Ying Tsui
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15403
Pdf URL: https://arxiv.org/pdf/2411.15403
Copy Paste: [[2411.15403]] Partial Knowledge Distillation for Alleviating the Inherent Inter-Class Discrepancy in Federated Learning(https://arxiv.org/abs/2411.15403)
Keywords: federate, fair
Abstract: Substantial efforts have been devoted to alleviating the impact of the long-tailed class distribution in federated learning. In this work, we observe an interesting phenomenon that weak classes consistently exist even for class-balanced learning. These weak classes, different from the minority classes in the previous works, are inherent to data and remain fairly consistent for various network structures and learning paradigms. The inherent inter-class accuracy discrepancy can reach over 36.9% for federated learning on the FashionMNIST and CIFAR-10 datasets, even when the class distribution is balanced both globally and locally. In this study, we empirically analyze the potential reason for this phenomenon. Furthermore, a class-specific partial knowledge distillation method is proposed to improve the model's classification accuracy for weak classes. In this approach, knowledge transfer is initiated upon the occurrence of specific misclassifications within certain weak classes. Experimental results show that the accuracy of weak classes can be improved by 10.7%, reducing the inherent interclass discrepancy effectively.

Title: A Comparative Analysis of Transformer and LSTM Models for Detecting Suicidal Ideation on Reddit

Authors: Khalid Hasan, Jamil Saquer
Subjects: cs.LG, cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2411.15404
Pdf URL: https://arxiv.org/pdf/2411.15404
Copy Paste: [[2411.15404]] A Comparative Analysis of Transformer and LSTM Models for Detecting Suicidal Ideation on Reddit(https://arxiv.org/abs/2411.15404)
Keywords: robust, transformer
Abstract: Suicide is a critical global health problem involving more than 700,000 deaths yearly, particularly among young adults. Many people express their suicidal thoughts on social media platforms such as Reddit. This paper evaluates the effectiveness of the deep learning transformer-based models BERT, RoBERTa, DistilBERT, ALBERT, and ELECTRA and various Long Short-Term Memory (LSTM) based models in detecting suicidal ideation from user posts on Reddit. Toward this objective, we curated an extensive dataset from diverse subreddits and conducted linguistic, topic modeling, and statistical analyses to ensure data quality. Our results indicate that each model could reach high accuracy and F1 scores, but among them, RoBERTa emerged as the most effective model with an accuracy of 93.22% and F1 score of 93.14%. An LSTM model that uses attention and BERT embeddings performed as the second best, with an accuracy of 92.65% and an F1 score of 92.69%. Our findings show that transformer-based models have the potential to improve suicide ideation detection, thereby providing a path to develop robust mental health monitoring tools from social media. This research, therefore, underlines the undeniable prospect of advanced techniques in Natural Language Processing (NLP) while improving suicide prevention efforts.

Title: Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges, Benchmarks, and Future Directions

Authors: Shezheng Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15408
Pdf URL: https://arxiv.org/pdf/2411.15408
Copy Paste: [[2411.15408]] Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges, Benchmarks, and Future Directions(https://arxiv.org/abs/2411.15408)
Keywords: large language model
Abstract: Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect terms and their corresponding sentiment polarities from multimodal information, including text and images. While traditional supervised learning methods have shown effectiveness in this task, the adaptability of large language models (LLMs) to MABSA remains uncertain. Recent advances in LLMs, such as Llama2, LLaVA, and ChatGPT, demonstrate strong capabilities in general tasks, yet their performance in complex and fine-grained scenarios like MABSA is underexplored. In this study, we conduct a comprehensive investigation into the suitability of LLMs for MABSA. To this end, we construct a benchmark to evaluate the performance of LLMs on MABSA tasks and compare them with state-of-the-art supervised learning methods. Our experiments reveal that, while LLMs demonstrate potential in multimodal understanding, they face significant challenges in achieving satisfactory results for MABSA, particularly in terms of accuracy and inference time. Based on these findings, we discuss the limitations of current LLMs and outline directions for future research to enhance their capabilities in multimodal sentiment analysis.

Title: FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Authors: Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Zhifei Zhang, Yilin Wang, Jianming Zhang, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15411
Pdf URL: https://arxiv.org/pdf/2411.15411
Copy Paste: [[2411.15411]] FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity(https://arxiv.org/abs/2411.15411)
Keywords: segmentation
Abstract: The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely describing the compositional aspects of the referred regions. However, compositionality - the ability to understand and generate novel combinations of known visual and textual components - is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different granularity levels. To support this endeavor, we introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning. Empirical results demonstrate the effectiveness of our proposed model compared to other state-of-the-art VLMs. Additionally, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.

Title: FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation

Authors: Trong Thang Pham, Ngoc-Vuong Ho, Nhat-Tan Bui, Thinh Phan, Patel Brijesh, Donald Adjeroh, Gianfranco Doretto, Anh Nguyen, Carol C. Wu, Hien Nguyen, Ngan Le
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15413
Pdf URL: https://arxiv.org/pdf/2411.15413
Copy Paste: [[2411.15413]] FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation(https://arxiv.org/abs/2411.15413)
Keywords: interpretability
Abstract: Developing an interpretable system for generating reports in chest X-ray (CXR) analysis is becoming increasingly crucial in Computer-aided Diagnosis (CAD) systems, enabling radiologists to comprehend the decisions made by these systems. Despite the growth of diverse datasets and methods focusing on report generation, there remains a notable gap in how closely these models' generated reports align with the interpretations of real radiologists. In this study, we tackle this challenge by initially introducing Fine-Grained CXR (FG-CXR) dataset, which provides fine-grained paired information between the captions generated by radiologists and the corresponding gaze attention heatmaps for each anatomy. Unlike existing datasets that include a raw sequence of gaze alongside a report, with significant misalignment between gaze location and report content, our FG-CXR dataset offers a more grained alignment between gaze attention and diagnosis transcript. Furthermore, our analysis reveals that simply applying black-box image captioning methods to generate reports cannot adequately explain which information in CXR is utilized and how long needs to attend to accurately generate reports. Consequently, we propose a novel explainable radiologist's attention generator network (Gen-XAI) that mimics the diagnosis process of radiologists, explicitly constraining its output to closely align with both radiologist's gaze attention and transcript. Finally, we perform extensive experiments to illustrate the effectiveness of our method. Our datasets and checkpoint is available at this https URL.

Title: Least Privilege Access for Persistent Storage Mechanisms in Web Browsers

Authors: Gayatri Priyadarsini Kancherla, Dishank Goel, Abhishek Bichhawat
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.15416
Pdf URL: https://arxiv.org/pdf/2411.15416
Copy Paste: [[2411.15416]] Least Privilege Access for Persistent Storage Mechanisms in Web Browsers(https://arxiv.org/abs/2411.15416)
Keywords: security
Abstract: Web applications often include third-party content and scripts to personalize a user's online experience. These scripts have unrestricted access to a user's private data stored in the browser's persistent storage like cookies, localstorage and IndexedDB, associated with the host page. Various mechanisms have been implemented to restrict access to these storage objects, e.g., content security policy, the HttpOnly attribute with cookies, etc. However, the existing mechanisms provide an all-or-none access and do not work in scenarios where web applications need to allow controlled access to cookies and localstorage objects by third-party scripts. If some of these scripts behave maliciously, they can easily access and modify private user information that are stored in the browser objects. The goal of our work is to design a mechanism to enforce fine-grained control of persistent storage objects. We perform an empirical study of persistent storage access by third-party scripts on Tranco's top 10,000 websites and find that 89.84% of all cookie accesses, 90.98% of all localstorage accesses and 72.49% of IndexedDB accesses are done by third-party scripts. Our approach enforces least privilege access for third-party scripts on these objects to ensure their security by attaching labels to the storage objects that specify which domains are allowed to read from and write to these objects. We implement our approach on the Firefox browser and show that it effectively blocks scripts from other domains, which are not allowed access based on these labels, from accessing the storage objects. We show that our enforcement results in some functionality breakage in websites with the default settings, which can be fixed by correctly labeling the storage objects used by the third-party scripts.

Title: OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Authors: Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, Siyuan Yan, Vinkle Srivastav, Diping Song, Tianbin Li, Danli Shi, Jin Ye, Nicolas Padoy, Nassir Navab, Junjun He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15421
Pdf URL: https://arxiv.org/pdf/2411.15421
Copy Paste: [[2411.15421]] OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining(https://arxiv.org/abs/2411.15421)
Keywords: robust
Abstract: Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.

Title: LDM-Morph: Latent diffusion model guided deformable image registration

Authors: Jiong Wu, Kuang Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15426
Pdf URL: https://arxiv.org/pdf/2411.15426
Copy Paste: [[2411.15426]] LDM-Morph: Latent diffusion model guided deformable image registration(https://arxiv.org/abs/2411.15426)
Keywords: diffusion, transformer
Abstract: Deformable image registration plays an essential role in various medical image tasks. Existing deep learning-based deformable registration frameworks primarily utilize convolutional neural networks (CNNs) or Transformers to learn features to predict the deformations. However, the lack of semantic information in the learned features limits the registration performance. Furthermore, the similarity metric of the loss function is often evaluated only in the pixel space, which ignores the matching of high-level anatomical features and can lead to deformation folding. To address these issues, in this work, we proposed LDM-Morph, an unsupervised deformable registration algorithm for medical image registration. LDM-Morph integrated features extracted from the latent diffusion model (LDM) to enrich the semantic information. Additionally, a latent and global feature-based cross-attention module (LGCA) was designed to enhance the interaction of semantic information from LDM and global information from multi-head self-attention operations. Finally, a hierarchical metric was proposed to evaluate the similarity of image pairs in both the original pixel space and latent-feature space, enhancing topology preservation while improving registration accuracy. Extensive experiments on four public 2D cardiac image datasets show that the proposed LDM-Morph framework outperformed existing state-of-the-art CNNs- and Transformers-based registration methods regarding accuracy and topology preservation with comparable computational efficiency. Our code is publicly available at this https URL.

Title: Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts

Authors: Qizhou Chen, Chengyu Wang, Dakan Wang, Taolin Zhang, Wangyue Li, Xiaofeng He
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15432
Pdf URL: https://arxiv.org/pdf/2411.15432
Copy Paste: [[2411.15432]] Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts(https://arxiv.org/abs/2411.15432)
Keywords: robust, large language model
Abstract: Model editing aims to correct inaccurate knowledge, update outdated information, and incorporate new data into Large Language Models (LLMs) without the need for retraining. This task poses challenges in lifelong scenarios where edits must be continuously applied for real-world applications. While some editors demonstrate strong robustness for lifelong editing in pure LLMs, Vision LLMs (VLLMs), which incorporate an additional vision modality, are not directly adaptable to existing LLM editors. In this paper, we propose LiveEdit, a LIfelong Vision language modEl Edit to bridge the gap between lifelong LLM editing and VLLMs. We begin by training an editing expert generator to independently produce low-rank experts for each editing instance, with the goal of correcting the relevant responses of the VLLM. A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby coarsely eliminating visually irrelevant experts for input queries during the inference stage of the post-edited model. Finally, to integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion. For evaluation, we establish a benchmark for lifelong VLLM editing. Extensive experiments demonstrate that LiveEdit offers significant advantages in lifelong VLLM editing scenarios. Further experiments validate the rationality and effectiveness of each module design in LiveEdit.

Title: What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15435
Pdf URL: https://arxiv.org/pdf/2411.15435
Copy Paste: [[2411.15435]] What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation(https://arxiv.org/abs/2411.15435)
Keywords: fair, large language model
Abstract: While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.

Title: ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

Authors: Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15436
Pdf URL: https://arxiv.org/pdf/2411.15436
Copy Paste: [[2411.15436]] ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance(https://arxiv.org/abs/2411.15436)
Keywords: diffusion
Abstract: Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: this https URL

Title: Twin Trigger Generative Networks for Backdoor Attacks against Object Detection

Authors: Zhiying Li, Zhi Liu, Guanggang Geng, Shreyank N Gowda, Shuyuan Lin, Jian Weng, Xiaobo Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15439
Pdf URL: https://arxiv.org/pdf/2411.15439
Copy Paste: [[2411.15439]] Twin Trigger Generative Networks for Backdoor Attacks against Object Detection(https://arxiv.org/abs/2411.15439)
Keywords: attack, steal, generative
Abstract: Object detectors, which are widely used in real-world applications, are vulnerable to backdoor attacks. This vulnerability arises because many users rely on datasets or pre-trained models provided by third parties due to constraints on data and resources. However, most research on backdoor attacks has focused on image classification, with limited investigation into object detection. Furthermore, the triggers for most existing backdoor attacks on object detection are manually generated, requiring prior knowledge and consistent patterns between the training and inference stages. This approach makes the attacks either easy to detect or difficult to adapt to various scenarios. To address these limitations, we propose novel twin trigger generative networks in the frequency domain to generate invisible triggers for implanting stealthy backdoors into models during training, and visible triggers for steady activation during inference, making the attack process difficult to trace. Specifically, for the invisible trigger generative network, we deploy a Gaussian smoothing layer and a high-frequency artifact classifier to enhance the stealthiness of backdoor implantation in object detectors. For the visible trigger generative network, we design a novel alignment loss to optimize the visible triggers so that they differ from the original patterns but still align with the malicious activation behavior of the invisible triggers. Extensive experimental results and analyses prove the possibility of using different triggers in the training stage and the inference stage, and demonstrate the attack effectiveness of our proposed visible trigger and invisible trigger generative networks, significantly reducing the mAP_0.5 of the object detectors by 70.0% and 84.5%, including YOLOv5 and YOLOv7 with different settings, respectively.

Title: Unveiling the Achilles' Heel: Backdoor Watermarking Forgery Attack in Public Dataset Protection

Authors: Zhiying Li, Zhi Liu, Dongjie Liu, Shengda Zhuo, Guanggang Geng, Jian Weng, Shanxiang Lyu, Xiaobo Jin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.15450
Pdf URL: https://arxiv.org/pdf/2411.15450
Copy Paste: [[2411.15450]] Unveiling the Achilles' Heel: Backdoor Watermarking Forgery Attack in Public Dataset Protection(https://arxiv.org/abs/2411.15450)
Keywords: protect, attack, watermark
Abstract: High-quality datasets can greatly promote the development of technology. However, dataset construction is expensive and time-consuming, and public datasets are easily exploited by opportunists who are greedy for quick gains, which seriously infringes the rights and interests of dataset owners. At present, backdoor watermarks redefine dataset protection as proof of ownership and become a popular method to protect the copyright of public datasets, which effectively safeguards the rights of owners and promotes the development of open source communities. In this paper, we question the reliability of backdoor watermarks and re-examine them from the perspective of attackers. On the one hand, we refine the process of backdoor watermarks by introducing a third-party judicial agency to enhance its practical applicability in real-world scenarios. On the other hand, by exploring the problem of forgery attacks, we reveal the inherent flaws of the dataset ownership verification process. Specifically, we design a Forgery Watermark Generator (FW-Gen) to generate forged watermarks and define a distillation loss between the original watermark and the forged watermark to transfer the information in the original watermark to the forged watermark. Extensive experiments show that forged watermarks have the same statistical significance as original watermarks in copyright verification tests under various conditions and scenarios, indicating that dataset ownership verification results are insufficient to determine infringement. These findings highlight the unreliability of backdoor watermarking methods for dataset ownership verification and suggest new directions for enhancing methods for protecting public datasets.

Title: Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Authors: Te Yang, Jian Jia, Xiangyu Zhu, Weisong Zhao, Bo Wang, Yanhua Cheng, Yan Li, Shengyuan Liu, Quan Chen, Peng Jiang, Kun Gai, Zhen Lei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15453
Pdf URL: https://arxiv.org/pdf/2411.15453
Copy Paste: [[2411.15453]] Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy(https://arxiv.org/abs/2411.15453)
Keywords: large language model
Abstract: Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs. However, there is a significant gap in the instruction-following capabilities between the MLLMs and LLMs. In this study, we conduct a pilot experiment, which demonstrates that spatially down-sampling visual tokens significantly enhances the instruction-following capability of MLLMs. This is attributed to the substantial redundancy in visual modality. However, this intuitive method severely impairs the MLLM's multimodal understanding capability. In this paper, we propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap between MLLMs and LLMs by inhibiting the influence of irrelevant visual tokens during content generation, increasing the instruction-following ability of the MLLMs while retaining their multimodal understanding capacity. In VMTC module, the primary tokens are retained and the redundant tokens are condensed by token clustering and merging. In CMAI process, we aggregate text-to-image attentions by text-to-text attentions to obtain a text-to-image focus score. Attention inhibition is performed on the text-image token pairs with low scores. Our comprehensive experiments over instruction-following capabilities and VQA-V2, GQA, TextVQA, MME and MMBench five benchmarks, demonstrate that proposed strategy significantly enhances the instruction following capability of MLLMs while preserving the ability to understand and process multimodal inputs.

Title: TANGNN: a Concise, Scalable and Effective Graph Neural Networks with Top-m Attention Mechanism for Graph Representation Learning

Authors: Jiawei E, Yinglong Zhang, Xuewen Xia, Xing Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15458
Pdf URL: https://arxiv.org/pdf/2411.15458
Copy Paste: [[2411.15458]] TANGNN: a Concise, Scalable and Effective Graph Neural Networks with Top-m Attention Mechanism for Graph Representation Learning(https://arxiv.org/abs/2411.15458)
Keywords: transformer
Abstract: In the field of deep learning, Graph Neural Networks (GNNs) and Graph Transformer models, with their outstanding performance and flexible architectural designs, have become leading technologies for processing structured data, especially graph data. Traditional GNNs often face challenges in capturing information from distant vertices effectively. In contrast, Graph Transformer models are particularly adept at managing long-distance node relationships. Despite these advantages, Graph Transformer models still encounter issues with computational and storage efficiency when scaled to large graph datasets. To address these challenges, we propose an innovative Graph Neural Network (GNN) architecture that integrates a Top-m attention mechanism aggregation component and a neighborhood aggregation component, effectively enhancing the model's ability to aggregate relevant information from both local and extended neighborhoods at each layer. This method not only improves computational efficiency but also enriches the node features, facilitating a deeper analysis of complex graph structures. Additionally, to assess the effectiveness of our proposed model, we have applied it to citation sentiment prediction, a novel task previously unexplored in the GNN field. Accordingly, we constructed a dedicated citation network, ArXivNet. In this dataset, we specifically annotated the sentiment polarity of the citations (positive, neutral, negative) to enable in-depth sentiment analysis. Our approach has shown superior performance across a variety of tasks including vertex classification, link prediction, sentiment prediction, graph regression, and visualization. It outperforms existing methods in terms of effectiveness, as demonstrated by experimental results on multiple datasets.

Title: MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

Authors: Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, Zhenyu He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15459
Pdf URL: https://arxiv.org/pdf/2411.15459
Copy Paste: [[2411.15459]] MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking(https://arxiv.org/abs/2411.15459)
Keywords: robust, transformer
Abstract: The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.

Title: Towards Robust Evaluation of Unlearning in LLMs via Data Transformations

Authors: Abhinav Joshi, Shaswati Saha, Divyaksh Shukla, Sriram Vema, Harsh Jhamtani, Manas Gaur, Ashutosh Modi
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15477
Pdf URL: https://arxiv.org/pdf/2411.15477
Copy Paste: [[2411.15477]] Towards Robust Evaluation of Unlearning in LLMs via Data Transformations(https://arxiv.org/abs/2411.15477)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. LLMs have been trained on a vast corpus of texts from various sources; despite the best efforts during the data pre-processing stage while training the LLMs, they may pick some undesirable information such as personally identifiable information (PII). Consequently, in recent times research in the area of Machine Unlearning (MUL) has become active, the main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks. In this work, we examine the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs. In particular, we examine the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input? Our findings on the TOFU dataset highlight the necessity of using diverse data formats to quantify unlearning in LLMs more reliably.

Title: Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Authors: Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15484
Pdf URL: https://arxiv.org/pdf/2411.15484
Copy Paste: [[2411.15484]] Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai(https://arxiv.org/abs/2411.15484)
Keywords: data-free, large language model
Abstract: We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at this https URL.

Title: Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark

Authors: Rong-Cheng Tu, Zi-Ao Ma, Tian Lan, Yuehao Zhao, Heyan Huang, Xian-Ling Mao
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15488
Pdf URL: https://arxiv.org/pdf/2411.15488
Copy Paste: [[2411.15488]] Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark(https://arxiv.org/abs/2411.15488)
Keywords: diffusion, large language model
Abstract: Driven by the remarkable progress in diffusion models, text-to-image generation has made significant strides, creating a pressing demand for automatic quality evaluation of generated images. Current state-of-the-art automatic evaluation methods heavily rely on Multi-modal Large Language Models (MLLMs), particularly powerful commercial models like GPT-4o. While these models are highly effective, their substantial costs limit scalability in large-scale evaluations. Adopting open-source MLLMs is an alternative; however, their performance falls short due to significant limitations in processing multi-modal data compared to commercial MLLMs. To tackle these problems, we first propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset, where the complex evaluation task is decoupled into simpler sub-tasks, effectively reducing the learning complexity. Based on this dataset, we design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Furthermore, to reliably and comprehensively assess prior works and our proposed model, we manually annotate a meta-evaluation benchmark that includes chain-of-thought explanations alongside quality scores for generated images. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline, VIEScore, with over 4.6\% improvement in Spearman and Kendall correlations with human judgments.

Title: Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation

Authors: Junhyeok Lee, Yujin Oh, Dahyoun Lee, Hyon Keun Joh, Chul-Ho Sohn, Sung Hyun Baik, Cheol Kyu Jung, Jung Hyun Park, Kyu Sung Choi, Byung-Hoon Kim, Jong Chul Ye
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15490
Pdf URL: https://arxiv.org/pdf/2411.15490
Copy Paste: [[2411.15490]] Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation(https://arxiv.org/abs/2411.15490)
Keywords: diffusion
Abstract: Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.

Title: SilentWood: Private Inference Over Gradient-Boosting Decision Forests

Authors: Ronny Ko, Rasoul Akhavan Mahdavi, Byoungwoo Yoon, Makoto Onizuka, Florian Kerschbaum
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2411.15494
Pdf URL: https://arxiv.org/pdf/2411.15494
Copy Paste: [[2411.15494]] SilentWood: Private Inference Over Gradient-Boosting Decision Forests(https://arxiv.org/abs/2411.15494)
Keywords: privacy
Abstract: Gradient-boosting decision forests, as used by algorithms such as XGBoost or AdaBoost, offer higher accuracy and lower training times for large datasets than decision trees. Protocols for private inference over decision trees can be used to preserve the privacy of the input data as well as the privacy of the trees. However, naively extending private inference over decision trees to private inference over decision forests by replicating the protocols leads to impractical running times. In this paper, we explore extending the private decision inference protocol using homomorphic encryption by Mahdavi et al. (CCS 2023) to decision forests. We present several optimizations that identify and then remove (approximate) duplication between the trees in a forest and hence achieve significant improvements in communication and computation cost over the naive approach. To the best of our knowledge, we present the first private inference protocol for highly scalable gradient-boosting decision forests. Our optimizations extend beyond Mahdavi et al.'s protocol to various private inference protocols for gradient-boosting decision trees. Our protocol's inference time is faster than the baseline of parallel running the protocol by Mahdavi et al.~by up to 28.1x, and faster than Zama's Concrete ML XGBoost by up to 122.25x.

Title: AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

Authors: Datao Tang, Xiangyong Cao, Xuan Wu, Jialin Li, Jing Yao, Xueru Bai, Dongsheng Jiang, Yin Li, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15497
Pdf URL: https://arxiv.org/pdf/2411.15497
Copy Paste: [[2411.15497]] AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation(https://arxiv.org/abs/2411.15497)
Keywords: diffusion, generative
Abstract: Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7\%, 4.3\%, and 2.43\%, respectively. The code is available at this https URL.

Title: Interactive Visual Assessment for Text-to-Image Generation Models

Authors: Xiaoyue Mi, Fan Tang, Juan Cao, Qiang Sheng, Ziyao Huang, Peng Li, Yang Liu, Tong-Yee Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15509
Pdf URL: https://arxiv.org/pdf/2411.15509
Copy Paste: [[2411.15509]] Interactive Visual Assessment for Text-to-Image Generation Models(https://arxiv.org/abs/2411.15509)
Keywords: generative
Abstract: Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 times generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.

Title: CellPilot

Authors: Philipp Endres, Valentin Koch, Julia A. Schnabel, Carsten Marr
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15514
Pdf URL: https://arxiv.org/pdf/2411.15514
Copy Paste: [[2411.15514]] CellPilot(https://arxiv.org/abs/2411.15514)
Keywords: robust, segmentation
Abstract: Histopathology, the microscopic study of diseased tissue, is increasingly digitized, enabling improved visualization and streamlined workflows. An important task in histopathology is the segmentation of cells and glands, essential for determining shape and frequencies that can serve as indicators of disease. Deep learning tools are widely used in histopathology. However, variability in tissue appearance and cell morphology presents challenges for achieving reliable segmentation, often requiring manual correction to improve accuracy. This work introduces CellPilot, a framework that bridges the gap between automatic and interactive segmentation by providing initial automatic segmentation as well as guided interactive refinement. Our model was trained on over 675,000 masks of nine diverse cell and gland segmentation datasets, spanning 16 organs. CellPilot demonstrates superior performance compared to other interactive tools on three held-out histopathological datasets while enabling automatic segmentation. We make the model and a graphical user interface designed to assist practitioners in creating large-scale annotated datasets available as open-source, fostering the development of more robust and generalized diagnostic models.

Title: Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset

Authors: Rahul Nihalani, Kushal Shah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15523
Pdf URL: https://arxiv.org/pdf/2411.15523
Copy Paste: [[2411.15523]] Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset(https://arxiv.org/abs/2411.15523)
Keywords: transformer
Abstract: This paper presents an improved LLM based model for Grammatical Error Detection (GED), which is a very challenging and equally important problem for many applications. The traditional approach to GED involved hand-designed features, but recently, Neural Networks (NN) have automated the discovery of these features, improving performance in GED. Traditional rule-based systems have an F1 score of 0.50-0.60 and earlier machine learning models give an F1 score of 0.65-0.75, including decision trees and simple neural networks. Previous deep learning models, for example, Bi-LSTM, have reported F1 scores within the range from 0.80 to 0.90. In our study, we have fine-tuned various transformer models using the Lang8 dataset rigorously cleaned by us. In our experiments, the BERT-base-uncased model gave an impressive performance with an F1 score of 0.91 and accuracy of 98.49% on training data and 90.53% on testing data, also showcasing the importance of data cleaning. Increasing model size using BERT-large-uncased or RoBERTa-large did not give any noticeable improvements in performance or advantage for this task, underscoring that larger models are not always better. Our results clearly show how far rigorous data cleaning and simple transformer-based models can go toward significantly improving the quality of GED.

Title: Haar-Laplacian for directed graphs

Authors: Theodor-Adrian Badea, Bogdan Dumitrescu
Subjects: cs.LG, cs.SI, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15527
Pdf URL: https://arxiv.org/pdf/2411.15527
Copy Paste: [[2411.15527]] Haar-Laplacian for directed graphs(https://arxiv.org/abs/2411.15527)
Keywords: robust
Abstract: This paper introduces a novel Laplacian matrix aiming to enable the construction of spectral convolutional networks and to extend the signal processing applications for directed graphs. Our proposal is inspired by a Haar-like transformation and produces a Hermitian matrix which is not only in one-to-one relation with the adjacency matrix, preserving both direction and weight information, but also enjoys desirable additional properties like scaling robustness, sensitivity, continuity, and directionality. We take a theoretical standpoint and support the conformity of our approach with the spectral graph theory. Then, we address two use-cases: graph learning (by introducing HaarNet, a spectral graph convolutional network built with our Haar-Laplacian) and graph signal processing. We show that our approach gives better results in applications like weight prediction and denoising on directed graphs.

Title: MUNBa: Machine Unlearning via Nash Bargaining

Authors: Jing Wu, Mehrtash Harandi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15537
Pdf URL: https://arxiv.org/pdf/2411.15537
Copy Paste: [[2411.15537]] MUNBa: Machine Unlearning via Nash Bargaining(https://arxiv.org/abs/2411.15537)
Keywords: attack, robust, diffusion
Abstract: Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto front, effectively avoiding the gradient conflicts. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving superior performance on several benchmarks. For example, in the challenging scenario of sample-wise forgetting, our algorithm approaches the gold standard retrain baseline. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.

Title: Large Language Model with Region-guided Referring and Grounding for CT Report Generation

Authors: Zhixuan Chen, Yequan Bie, Haibo Jin, Hao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15539
Pdf URL: https://arxiv.org/pdf/2411.15539
Copy Paste: [[2411.15539]] Large Language Model with Region-guided Referring and Grounding for CT Report Generation(https://arxiv.org/abs/2411.15539)
Keywords: interpretability, large language model, segmentation
Abstract: Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code will be made publicly available.

Title: Optical-Flow Guided Prompt Optimization for Coherent Video Generation

Authors: Hyelin Nam, Jaemin Kim, Dohun Lee, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15540
Pdf URL: https://arxiv.org/pdf/2411.15540
Copy Paste: [[2411.15540]] Optical-Flow Guided Prompt Optimization for Coherent Video Generation(https://arxiv.org/abs/2411.15540)
Keywords: diffusion
Abstract: While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.

Title: Hierarchical Cross-Attention Network for Virtual Try-On

Authors: Hao Tang, Bin Ren, Pingping Wu, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15542
Pdf URL: https://arxiv.org/pdf/2411.15542
Copy Paste: [[2411.15542]] Hierarchical Cross-Attention Network for Virtual Try-On(https://arxiv.org/abs/2411.15542)
Keywords: robust
Abstract: In this paper, we present an innovative solution for the challenges of the virtual try-on task: our novel Hierarchical Cross-Attention Network (HCANet). HCANet is crafted with two primary stages: geometric matching and try-on, each playing a crucial role in delivering realistic virtual try-on outcomes. A key feature of HCANet is the incorporation of a novel Hierarchical Cross-Attention (HCA) block into both stages, enabling the effective capture of long-range correlations between individual and clothing modalities. The HCA block enhances the depth and robustness of the network. By adopting a hierarchical approach, it facilitates a nuanced representation of the interaction between the person and clothing, capturing intricate details essential for an authentic virtual try-on experience. Our experiments establish the prowess of HCANet. The results showcase its performance across both quantitative metrics and subjective evaluations of visual realism. HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that excel in accuracy and realism. This marks a significant step in advancing virtual try-on technologies.

Title: NeRF Inpainting with Geometric Diffusion Prior and Balanced Score Distillation

Authors: Menglin Zhang, Xin Luo, Yunwei Lan, Chang Liu, Rui Li, Kaidong Zhang, Ganlin Yang, Dong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15551
Pdf URL: https://arxiv.org/pdf/2411.15551
Copy Paste: [[2411.15551]] NeRF Inpainting with Geometric Diffusion Prior and Balanced Score Distillation(https://arxiv.org/abs/2411.15551)
Keywords: diffusion
Abstract: Recent advances in NeRF inpainting have leveraged pretrained diffusion models to enhance performance. However, these methods often yield suboptimal results due to their ineffective utilization of 2D diffusion priors. The limitations manifest in two critical aspects: the inadequate capture of geometric information by pretrained diffusion models and the suboptimal guidance provided by existing Score Distillation Sampling (SDS) methods. To address these problems, we introduce GB-NeRF, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors. Our approach incorporates two key innovations: a fine-tuning strategy that simultaneously learns appearance and geometric priors and a specialized normal distillation loss that integrates these geometric priors into NeRF inpainting. We propose a technique called Balanced Score Distillation (BSD) that surpasses existing methods such as Score Distillation (SDS) and the improved version, Conditional Score Distillation (CSD). BSD offers improved inpainting quality in appearance and geometric aspects. Extensive experiments show that our method provides superior appearance fidelity and geometric consistency compared to existing approaches.

Title: Improving Transferable Targeted Attacks with Feature Tuning Mixup

Authors: Kaisheng Liang, Xuelong Dai, Yanjie Li, Dong Wang, Bin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15553
Pdf URL: https://arxiv.org/pdf/2411.15553
Copy Paste: [[2411.15553]] Improving Transferable Targeted Attacks with Feature Tuning Mixup(https://arxiv.org/abs/2411.15553)
Keywords: attack, robust
Abstract: Deep neural networks exhibit vulnerability to adversarial examples that can transfer across different models. A particularly challenging problem is developing transferable targeted attacks that can mislead models into predicting specific target classes. While various methods have been proposed to enhance attack transferability, they often incur substantial computational costs while yielding limited improvements. Recent clean feature mixup methods use random clean features to perturb the feature space but lack optimization for disrupting adversarial examples, overlooking the advantages of attack-specific perturbations. In this paper, we propose Feature Tuning Mixup (FTM), a novel method that enhances targeted attack transferability by combining both random and optimized noises in the feature space. FTM introduces learnable feature perturbations and employs an efficient stochastic update strategy for optimization. These learnable perturbations facilitate the generation of more robust adversarial examples with improved transferability. We further demonstrate that attack performance can be enhanced through an ensemble of multiple FTM-perturbed surrogate models. Extensive experiments on the ImageNet-compatible dataset across various models demonstrate that our method achieves significant improvements over state-of-the-art methods while maintaining low computational cost.

Title: Enhancing the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation

Authors: Fengfan Zhou, Bangjie Yin, Hefei Ling, Qianyu Zhou, Wenxuan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15555
Pdf URL: https://arxiv.org/pdf/2411.15555
Copy Paste: [[2411.15555]] Enhancing the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation(https://arxiv.org/abs/2411.15555)
Keywords: attack
Abstract: Face Recognition (FR) models are vulnerable to adversarial examples that subtly manipulate benign face images, underscoring the urgent need to improve the transferability of adversarial attacks in order to expose the blind spots of these systems. Existing adversarial attack methods often overlook the potential benefits of augmenting the surrogate model with diverse initializations, which limits the transferability of the generated adversarial examples. To address this gap, we propose a novel method called Diverse Parameters Augmentation (DPA) attack method, which enhances surrogate models by incorporating diverse parameter initializations, resulting in a broader and more diverse set of surrogate models. Specifically, DPA consists of two key stages: Diverse Parameters Optimization (DPO) and Hard Model Aggregation (HMA). In the DPO stage, we initialize the parameters of the surrogate model using both pre-trained and random parameters. Subsequently, we save the models in the intermediate training process to obtain a diverse set of surrogate models. During the HMA stage, we enhance the feature maps of the diversified surrogate models by incorporating beneficial perturbations, thereby further improving the transferability. Experimental results demonstrate that our proposed attack method can effectively enhance the transferability of the crafted adversarial face examples.

Title: ReWind: Understanding Long Videos with Instructed Learnable Memory

Authors: Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, Ioannis Patras
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15556
Pdf URL: https://arxiv.org/pdf/2411.15556
Copy Paste: [[2411.15556]] ReWind: Understanding Long Videos with Instructed Learnable Memory(https://arxiv.org/abs/2411.15556)
Keywords: large language model
Abstract: Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel \textbf{read-perceive-write} cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13\% score gain and a +12\% accuracy improvement on the MovieChat-1K VQA dataset and an +8\% mIoU increase on Charades-STA for temporal grounding.

Title: Reassessing Layer Pruning in LLMs: New Insights and Methods

Authors: Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, Zhaowei Zhu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15558
Pdf URL: https://arxiv.org/pdf/2411.15558
Copy Paste: [[2411.15558]] Reassessing Layer Pruning in LLMs: New Insights and Methods(https://arxiv.org/abs/2411.15558)
Keywords: large language model
Abstract: Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final 25\% of layers followed by fine-tuning the \texttt{lm\_head} and the remaining last three layer, yields remarkably strong performance. Following this guide, we prune Llama-3.1-8B-It and obtain a model that outperforms many popular LLMs of similar size, such as ChatGLM2-6B, Vicuna-7B-v1.5, Qwen1.5-7B and Baichuan2-7B. We release the optimal model weights on Huggingface, and the code is available on GitHub.

Title: TKG-DM: Training-free Chroma Key Content Generation Diffusion Model

Authors: Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Takahiro Shirakawa, Ko Watanabe, Andreas Dengel, Jinjia Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15580
Pdf URL: https://arxiv.org/pdf/2411.15580
Copy Paste: [[2411.15580]] TKG-DM: Training-free Chroma Key Content Generation Diffusion Model(https://arxiv.org/abs/2411.15580)
Keywords: diffusion, generative
Abstract: Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.

Title: FLD+: Data-efficient Evaluation Metric for Generative Models

Authors: Pranav Jeevan, Neeraj Nixon, Amit Sethi
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15584
Pdf URL: https://arxiv.org/pdf/2411.15584
Copy Paste: [[2411.15584]] FLD+: Data-efficient Evaluation Metric for Generative Models(https://arxiv.org/abs/2411.15584)
Keywords: diffusion, generative
Abstract: We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics -- such as InceptionNetV3 pre-trained on ImageNet.

Title: Transparent but Powerful: Explainability, Accuracy, and Generalizability in ADHD Detection from Social Media Data

Authors: D. Wiechmann, E. Kempa, E. Kerz, Y. Qiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15586
Pdf URL: https://arxiv.org/pdf/2411.15586
Copy Paste: [[2411.15586]] Transparent but Powerful: Explainability, Accuracy, and Generalizability in ADHD Detection from Social Media Data(https://arxiv.org/abs/2411.15586)
Keywords: interpretability, explainability, transformer
Abstract: Attention-deficit/hyperactivity disorder (ADHD) is a prevalent mental health condition affecting both children and adults, yet it remains severely underdiagnosed. Recent advances in artificial intelligence, particularly in Natural Language Processing (NLP) and Machine Learning (ML), offer promising solutions for scalable and non-invasive ADHD screening methods using social media data. This paper presents a comprehensive study on ADHD detection, leveraging both shallow machine learning models and deep learning approaches, including BiLSTM and transformer-based models, to analyze linguistic patterns in ADHD-related social media text. Our results highlight the trade-offs between interpretability and performance across different models, with BiLSTM offering a balance of transparency and accuracy. Additionally, we assess the generalizability of these models using cross-platform data from Reddit and Twitter, uncovering key linguistic features associated with ADHD that could contribute to more effective digital screening tools.

Title: A Survey on LLM-as-a-Judge

Authors: Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, Jian Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15594
Pdf URL: https://arxiv.org/pdf/2411.15594
Copy Paste: [[2411.15594]] A Survey on LLM-as-a-Judge(https://arxiv.org/abs/2411.15594)
Keywords: large language model
Abstract: Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.

Title: An adversarial feature learning based semantic communication method for Human 3D Reconstruction

Authors: Shaojiang Liu, Jiajun Zou, Zhendan Liu, Meixia Dong, Zhiping Wan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15595
Pdf URL: https://arxiv.org/pdf/2411.15595
Copy Paste: [[2411.15595]] An adversarial feature learning based semantic communication method for Human 3D Reconstruction(https://arxiv.org/abs/2411.15595)
Keywords: extraction, diffusion
Abstract: With the widespread application of human body 3D reconstruction technology across various fields, the demands for data transmission and processing efficiency continue to rise, particularly in scenarios where network bandwidth is limited and low latency is required. This paper introduces an Adversarial Feature Learning-based Semantic Communication method (AFLSC) for human body 3D reconstruction, which focuses on extracting and transmitting semantic information crucial for the 3D reconstruction task, thereby significantly optimizing data flow and alleviating bandwidth pressure. At the sender's end, we propose a multitask learning-based feature extraction method to capture the spatial layout, keypoints, posture, and depth information from 2D human images, and design a semantic encoding technique based on adversarial feature learning to encode these feature information into semantic data. We also develop a dynamic compression technique to efficiently transmit this semantic data, greatly enhancing transmission efficiency and reducing latency. At the receiver's end, we design an efficient multi-level semantic feature decoding method to convert semantic data back into key image features. Finally, an improved ViT-diffusion model is employed for 3D reconstruction, producing human body 3D mesh models. Experimental results validate the advantages of our method in terms of data transmission efficiency and reconstruction quality, demonstrating its excellent potential for application in bandwidth-limited environments.

Title: Knowledge Transfer Across Modalities with Natural Language Supervision

Authors: Carlo Alberto Barbano, Luca Molinaro, Emanuele Aiello, Marco Grangetto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15611
Pdf URL: https://arxiv.org/pdf/2411.15611
Copy Paste: [[2411.15611]] Knowledge Transfer Across Modalities with Natural Language Supervision(https://arxiv.org/abs/2411.15611)
Keywords: segmentation
Abstract: We present a way to learn novel concepts by only using their textual description. We call this method Knowledge Transfer. Similarly to human perception, we leverage cross-modal interaction to introduce new concepts. We hypothesize that in a pre-trained visual encoder there are enough low-level features already learned (e.g. shape, appearance, color) that can be used to describe previously unknown high-level concepts. Provided with a textual description of the novel concept, our method works by aligning the known low-level features of the visual encoder to its high-level textual description. We show that Knowledge Transfer can successfully introduce novel concepts in multimodal models, in a very efficient manner, by only requiring a single description of the target concept. Our approach is compatible with both separate textual and visual encoders (e.g. CLIP) and shared parameters across modalities. We also show that, following the same principle, Knowledge Transfer can improve concepts already known by the model. Leveraging Knowledge Transfer we improve zero-shot performance across different tasks such as classification, segmentation, image-text retrieval, and captioning.

Title: A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation

Authors: Vennela Yarabolu, Govind Waghmare, Sonia Gupta, Siddhartha Asthana
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15616
Pdf URL: https://arxiv.org/pdf/2411.15616
Copy Paste: [[2411.15616]] A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation(https://arxiv.org/abs/2411.15616)
Keywords: robust, segmentation
Abstract: In many real-world applications, continuous machine learning (ML) systems are crucial but prone to data drift, a phenomenon where discrepancies between historical training data and future test data lead to significant performance degradation and operational inefficiencies. Traditional drift adaptation methods typically update models using ensemble techniques, often discarding drifted historical data, and focus primarily on either covariate drift or concept drift. These methods face issues such as high resource demands, inability to manage all types of drifts effectively, and neglecting the valuable context that historical data can provide. We contend that explicitly incorporating drifted data into the model training process significantly enhances model accuracy and robustness. This paper introduces an advanced framework that integrates the strengths of data-centric approaches with adaptive management of both covariate and concept drift in a scalable and efficient manner. Our framework employs sophisticated data segmentation techniques to identify optimal data batches that accurately reflect test data patterns. These data batches are then utilized for training on test data, ensuring that the models remain relevant and accurate over time. By leveraging the advantages of both data segmentation and scalable drift management, our solution ensures robust model accuracy and operational efficiency in large-scale ML deployments. It also minimizes resource consumption and computational overhead by selecting and utilizing relevant data subsets, leading to significant cost savings. Experimental results on classification task on real-world and synthetic datasets show our approach improves model accuracy while reducing operational costs and latency. This practical solution overcomes inefficiencies in current methods, providing a robust, adaptable, and scalable approach.

Title: Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Authors: Jinwoo Ahn, Hyeokjoon Kwon, Hwiyeon Yoo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15620
Pdf URL: https://arxiv.org/pdf/2411.15620
Copy Paste: [[2411.15620]] Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation(https://arxiv.org/abs/2411.15620)
Keywords: segmentation
Abstract: Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively guide the detection process in the intended direction. Our results show that FOCUS effectively enhances the detection capabilities of baseline models and shows consistent performance across varying object types.

Title: Multi-label Sequential Sentence Classification via Large Language Model

Authors: Mengfei Lan, Lecheng Zheng, Shufan Ming, Halil Kilicoglu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15623
Pdf URL: https://arxiv.org/pdf/2411.15623
Copy Paste: [[2411.15623]] Multi-label Sequential Sentence Classification via Large Language Model(https://arxiv.org/abs/2411.15623)
Keywords: large language model
Abstract: Sequential sentence classification (SSC) in scientific publications is crucial for supporting downstream tasks such as fine-grained information retrieval and extractive summarization. However, current SSC methods are constrained by model size, sequence length, and single-label setting. To address these limitations, this paper proposes LLM-SSC, a large language model (LLM)-based framework for both single- and multi-label SSC tasks. Unlike previous approaches that employ small- or medium-sized language models, the proposed framework utilizes LLMs to generate SSC labels through designed prompts, which enhance task understanding by incorporating demonstrations and a query to describe the prediction target. We also present a multi-label contrastive learning loss with auto-weighting scheme, enabling the multi-label classification task. To support our multi-label SSC analysis, we introduce and release a new dataset, biorc800, which mainly contains unstructured abstracts in the biomedical domain with manual annotations. Experiments demonstrate LLM-SSC's strong performance in SSC under both in-context learning and task-specific tuning settings. We release biorc800 and our code at: this https URL.

Title: ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Authors: Reza Ghoddoosian, Nakul Agarwal, Isht Dwivedi, Behzad Darisuh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15628
Pdf URL: https://arxiv.org/pdf/2411.15628
Copy Paste: [[2411.15628]] ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos(https://arxiv.org/abs/2411.15628)
Keywords: robust
Abstract: Vision-language models (VLMs) are capable of recognizing unseen actions. However, existing VLMs lack intrinsic understanding of procedural action concepts. Hence, they overfit to fixed labels and are not invariant to unseen action synonyms. To address this, we propose a simple fine-tuning technique, Action Concept Enhancement (ACE), to improve the robustness and concept understanding of VLMs in procedural action classification. ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss by stochastically replacing fixed labels during training. This creates new combinations of action labels over the course of fine-tuning and prevents overfitting to fixed action representations. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space. Our experiments on the ATA, IKEA and GTEA datasets demonstrate the efficacy of ACE in domains of cooking and assembly leading to significant improvements in zero-shot action classification while maintaining competitive performance on seen actions.

Title: "All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations

Authors: Michael Hardy
Subjects: cs.CL, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2411.15634
Pdf URL: https://arxiv.org/pdf/2411.15634
Copy Paste: [[2411.15634]] "All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations(https://arxiv.org/abs/2411.15634)
Keywords: fair, transformer, large language model
Abstract: "Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.

Title: Learning state and proposal dynamics in state-space models using differentiable particle filters and neural networks

Authors: Benjamin Cox, Santiago Segarra, Victor Elvira
Subjects: cs.LG, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15638
Pdf URL: https://arxiv.org/pdf/2411.15638
Copy Paste: [[2411.15638]] Learning state and proposal dynamics in state-space models using differentiable particle filters and neural networks(https://arxiv.org/abs/2411.15638)
Keywords: interpretability
Abstract: State-space models are a popular statistical framework for analysing sequential data. Within this framework, particle filters are often used to perform inference on non-linear state-space models. We introduce a new method, StateMixNN, that uses a pair of neural networks to learn the proposal distribution and transition distribution of a particle filter. Both distributions are approximated using multivariate Gaussian mixtures. The component means and covariances of these mixtures are learnt as outputs of learned functions. Our method is trained targeting the log-likelihood, thereby requiring only the observation series, and combines the interpretability of state-space models with the flexibility and approximation power of artificial neural networks. The proposed method significantly improves recovery of the hidden state in comparison with the state-of-the-art, showing greater improvement in highly non-linear scenarios.

Title: AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset

Authors: Tobi Olatunji, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, Bonaventure F. P. Dossou, Moshood Yekini, Jonas Kemp, Katherine Heller, Jude Chidubem Omeke, Chidi Asuzu MD, Naome A. Etori, Aimérou Ndiaye, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, Irfan Essa, Stephen Edward Moore, Chris Fourie, Mercy Nyamewaa Asiedu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15640
Pdf URL: https://arxiv.org/pdf/2411.15640
Copy Paste: [[2411.15640]] AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset(https://arxiv.org/abs/2411.15640)
Keywords: large language model
Abstract: Recent advancements in large language model(LLM) performance on medical multiple choice question (MCQ) benchmarks have stimulated interest from healthcare providers and patients globally. Particularly in low-and middle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, their effectiveness in the Global South, especially across the African continent, remains to be established. In this work, we introduce AfriMed-QA, the first large scale Pan-African English multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open and closed-ended) sourced from over 60 medical schools across 16 countries, covering 32 medical specialties. We further evaluate 30 LLMs across multiple axes including correctness and demographic bias. Our findings show significant performance variation across specialties and geographies, MCQ performance clearly lags USMLE (MedQA). We find that biomedical LLMs underperform general models and smaller edge-friendly LLMs struggle to achieve a passing score. Interestingly, human evaluations show a consistent consumer preference for LLM answers and explanations when compared with clinician answers.

Title: MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree

Authors: Gollam Rabby, Farhana Keya, Parvez Zamil, Sören Auer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15645
Pdf URL: https://arxiv.org/pdf/2411.15645
Copy Paste: [[2411.15645]] MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree(https://arxiv.org/abs/2411.15645)
Keywords: large language model
Abstract: Mathematical reasoning has proven to be a critical yet challenging task for large language models (LLMs), as they often struggle with complex multi-step problems. To address these limitations, we introduce the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST) algorithm, an enhancement of the Monte Carlo Tree Self-Refine (MCTSr) approach. By integrating Nash Equilibrium strategies with LLM-based self-refinement and self-evaluation processes, MC-NEST aims to improve decision-making for complex mathematical reasoning tasks. This method ensures balanced exploration and exploitation of potential solutions, leveraging Upper Confidence Bound (UCT) scores and various selection policies. Through iterative critique and refinement, MC-NEST enhances the reasoning capabilities of LLMs, particularly for problems requiring strategic decision-making. Comparative analysis reveals that GPT-4o, equipped with MC-NEST using an Importance Sampling Policy, achieved superior accuracy in domains such as Number Theory and Geometry. These results suggest that both LLMs GPT-4o and Phi-3-mini can benefit from MC-NEST, with iterative self-refinement proving especially effective in expanding the reasoning capacity and problem-solving performance of LLMs. We evaluate the effectiveness of MC-NEST on challenging Olympiad-level benchmarks, demonstrating its potential to significantly boost complex mathematical reasoning performance in LLMs.

Title: OCDet: Object Center Detection via Bounding Box-Aware Heatmap Prediction on Edge Devices with NPUs

Authors: Chen Xin, Thomas Motz, Andreas Hartel, Enkelejda Kasneci
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15653
Pdf URL: https://arxiv.org/pdf/2411.15653
Copy Paste: [[2411.15653]] OCDet: Object Center Detection via Bounding Box-Aware Heatmap Prediction on Edge Devices with NPUs(https://arxiv.org/abs/2411.15653)
Keywords: robust, segmentation
Abstract: Real-time object localization on edge devices is fundamental for numerous applications, ranging from surveillance to industrial automation. Traditional frameworks, such as object detection, segmentation, and keypoint detection, struggle in resource-constrained environments, often resulting in substantial target omissions. To address these challenges, we introduce OCDet, a lightweight Object Center Detection framework optimized for edge devices with NPUs. OCDet predicts heatmaps representing object center probabilities and extracts center points through peak identification. Unlike prior methods using fixed Gaussian distribution, we introduce Generalized Centerness (GC) to generate ground truth heatmaps from bounding box annotations, providing finer spatial details without additional manual labeling. Built on NPU-friendly Semantic FPN with MobileNetV4 backbones, OCDet models are trained by our Balanced Continuous Focal Loss (BCFL), which alleviates data imbalance and focuses training on hard negative examples for probability regression tasks. Leveraging the novel Center Alignment Score (CAS) with Hungarian matching, we demonstrate that OCDet consistently outperforms YOLO11 in object center detection, achieving up to 23% higher CAS while requiring 42% fewer parameters, 34% less computation, and 64% lower NPU latency. When compared to keypoint detection frameworks, OCDet achieves substantial CAS improvements up to 186% using identical models. By integrating GC, BCFL, and CAS, OCDet establishes a new paradigm for efficient and robust object center detection on edge devices with NPUs. The code is released at this https URL.

Title: Machine Learning-based sEMG Signal Classification for Hand Gesture Recognition

Authors: Parshuram N. Aarotale, Ajita Rattani
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15655
Pdf URL: https://arxiv.org/pdf/2411.15655
Copy Paste: [[2411.15655]] Machine Learning-based sEMG Signal Classification for Hand Gesture Recognition(https://arxiv.org/abs/2411.15655)
Keywords: extraction
Abstract: EMG-based hand gesture recognition uses electromyographic~(EMG) signals to interpret and classify hand movements by analyzing electrical activity generated by muscle contractions. It has wide applications in prosthesis control, rehabilitation training, and human-computer interaction. Using electrodes placed on the skin, the EMG sensor captures muscle signals, which are processed and filtered to reduce noise. Numerous feature extraction and machine learning algorithms have been proposed to extract and classify muscle signals to distinguish between various hand gestures. This paper aims to benchmark the performance of EMG-based hand gesture recognition using novel feature extraction methods, namely, fused time-domain descriptors, temporal-spatial descriptors, and wavelet transform-based features, combined with the state-of-the-art machine and deep learning models. Experimental investigations on the Grabmyo dataset demonstrate that the 1D Dilated CNN performed the best with an accuracy of $97\%$ using fused time-domain descriptors such as power spectral moments, sparsity, irregularity factor and waveform length ratio. Similarly, on the FORS-EMG dataset, random forest performed the best with an accuracy of $94.95\%$ using temporal-spatial descriptors (which include time domain features along with additional features such as coefficient of variation (COV), and Teager-Kaiser energy operator (TKEO)).

Title: Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Authors: Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, Gao Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15657
Pdf URL: https://arxiv.org/pdf/2411.15657
Copy Paste: [[2411.15657]] Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data(https://arxiv.org/abs/2411.15657)
Keywords: large language model
Abstract: Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However, training 3D models with labels directly derived from pseudo-LiDAR is inadequate due to imprecise boxes estimated from noisy point clouds and severely occluded objects. To address these issues, we introduce two innovative designs: adaptive pseudo-LiDAR erosion and bounding box refinement with prior knowledge from large language models. These techniques effectively calibrate the 3D labels and enable RGB-only training for 3D detectors. Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. The code will be released.

Title: Ontology-Constrained Generation of Domain-Specific Clinical Summaries

Authors: Gaya Mehenni, Amal Zouaq
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15666
Pdf URL: https://arxiv.org/pdf/2411.15666
Copy Paste: [[2411.15666]] Ontology-Constrained Generation of Domain-Specific Clinical Summaries(https://arxiv.org/abs/2411.15666)
Keywords: large language model
Abstract: Large Language Models (LLMs) offer promising solutions for text summarization. However, some domains require specific information to be available in the summaries. Generating these domain-adapted summaries is still an open challenge. Similarly, hallucinations in generated content is a major drawback of current approaches, preventing their deployment. This study proposes a novel approach that leverages ontologies to create domain-adapted summaries both structured and unstructured. We employ an ontology-guided constrained decoding process to reduce hallucinations while improving relevance. When applied to the medical domain, our method shows potential in summarizing Electronic Health Records (EHRs) across different specialties, allowing doctors to focus on the most relevant information to their domain. Evaluation on the MIMIC-III dataset demonstrates improvements in generating domain-adapted summaries of clinical notes and hallucination reduction.

Title: Best of Both Worlds: Advantages of Hybrid Graph Sequence Models

Authors: Ali Behrouz, Ali Parviz, Mahdi Karami, Clayton Sanford, Bryan Perozzi, Vahab Mirrokni
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2411.15671
Pdf URL: https://arxiv.org/pdf/2411.15671
Copy Paste: [[2411.15671]] Best of Both Worlds: Advantages of Hybrid Graph Sequence Models(https://arxiv.org/abs/2411.15671)
Keywords: transformer
Abstract: Modern sequence models (e.g., Transformers, linear RNNs, etc.) emerged as dominant backbones of recent deep learning frameworks, mainly due to their efficiency, representational power, and/or ability to capture long-range dependencies. Adopting these sequence models for graph-structured data has recently gained popularity as the alternative to Message Passing Neural Networks (MPNNs). There is, however, a lack of a common foundation about what constitutes a good graph sequence model, and a mathematical description of the benefits and deficiencies in adopting different sequence models for learning on graphs. To this end, we first present Graph Sequence Model (GSM), a unifying framework for adopting sequence models for graphs, consisting of three main steps: (1) Tokenization, which translates the graph into a set of sequences; (2) Local Encoding, which encodes local neighborhoods around each node; and (3) Global Encoding, which employs a scalable sequence model to capture long-range dependencies within the sequences. This framework allows us to understand, evaluate, and compare the power of different sequence model backbones in graph tasks. Our theoretical evaluations of the representation power of Transformers and modern recurrent models through the lens of global and local graph tasks show that there are both negative and positive sides for both types of models. Building on this observation, we present GSM++, a fast hybrid model that uses the Hierarchical Affinity Clustering (HAC) algorithm to tokenize the graph into hierarchical sequences, and then employs a hybrid architecture of Transformer to encode these sequences. Our theoretical and experimental results support the design of GSM++, showing that GSM++ outperforms baselines in most benchmark evaluations.

Title: IRSKG: Unified Intrusion Response System Knowledge Graph Ontology for Cyber Defense

Authors: Damodar Panigrahi, Shaswata Mitra, Subash Neupane, Sudip Mittal, Benjamin A. Blakely
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15672
Pdf URL: https://arxiv.org/pdf/2411.15672
Copy Paste: [[2411.15672]] IRSKG: Unified Intrusion Response System Knowledge Graph Ontology for Cyber Defense(https://arxiv.org/abs/2411.15672)
Keywords: defense, attack, robust, explainability
Abstract: Cyberattacks are becoming increasingly difficult to detect and prevent due to their sophistication. In response, Autonomous Intelligent Cyber-defense Agents (AICAs) are emerging as crucial solutions. One prominent AICA agent is the Intrusion Response System (IRS), which is critical for mitigating threats after detection. IRS uses several Tactics, Techniques, and Procedures (TTPs) to mitigate attacks and restore the infrastructure to normal operations. Continuous monitoring of the enterprise infrastructure is an essential TTP the IRS uses. However, each system serves different purposes to meet operational needs. Integrating these disparate sources for continuous monitoring increases pre-processing complexity and limits automation, eventually prolonging critical response time for attackers to exploit. We propose a unified IRS Knowledge Graph ontology (IRSKG) that streamlines the onboarding of new enterprise systems as a source for the AICAs. Our ontology can capture system monitoring logs and supplemental data, such as a rules repository containing the administrator-defined policies to dictate the IRS responses. Besides, our ontology permits us to incorporate dynamic changes to adapt to the evolving cyber-threat landscape. This robust yet concise design allows machine learning models to train effectively and recover a compromised system to its desired state autonomously with explainability.

Title: Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment

Authors: Alvi Md Ishmam, Christopher Thomas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15673
Pdf URL: https://arxiv.org/pdf/2411.15673
Copy Paste: [[2411.15673]] Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment(https://arxiv.org/abs/2411.15673)
Keywords: security, attack
Abstract: In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However, the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats, such as backdooring and poisoning attacks. In this paper, we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge. We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings, while maintaining model utility and without requiring any changes at inference time

Title: Can a Large Language Model Learn Matrix Functions In Context?

Authors: Paimon Goulart, Evangelos E. Papalexakis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15675
Pdf URL: https://arxiv.org/pdf/2411.15675
Copy Paste: [[2411.15675]] Can a Large Language Model Learn Matrix Functions In Context?(https://arxiv.org/abs/2411.15675)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated the ability to solve complex tasks through In-Context Learning (ICL), where models learn from a few input-output pairs without explicit fine-tuning. In this paper, we explore the capacity of LLMs to solve non-linear numerical computations, with specific emphasis on functions of the Singular Value Decomposition. Our experiments show that while LLMs perform comparably to traditional models such as Stochastic Gradient Descent (SGD) based Linear Regression and Neural Networks (NN) for simpler tasks, they outperform these models on more complex tasks, particularly in the case of top-k Singular Values. Furthermore, LLMs demonstrate strong scalability, maintaining high accuracy even as the matrix size increases. Additionally, we found that LLMs can achieve high accuracy with minimal prior examples, converging quickly and avoiding the overfitting seen in classical models. These results suggest that LLMs could provide an efficient alternative to classical methods for solving high-dimensional problems. Future work will focus on extending these findings to larger matrices and more complex matrix operations while exploring the effect of using different numerical representations in ICL.

Title: DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration

Authors: Sizhe Liu, Yizhou Lu, Siyu Chen, Xiyang Hu, Jieyu Zhao, Tianfan Fu, Yue Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15692
Pdf URL: https://arxiv.org/pdf/2411.15692
Copy Paste: [[2411.15692]] DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration(https://arxiv.org/abs/2411.15692)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have opened new avenues for accelerating drug discovery processes. Despite their potential, several critical challenges remain unsolved, particularly in translating theoretical ideas into practical applications within the highly specialized field of pharmaceutical research, limiting practitioners from leveraging the latest AI development in drug discovery. To this end, we introduce DrugAgent, a multi-agent framework aimed at automating machine learning (ML) programming in drug discovery. DrugAgent incorporates domain expertise by identifying specific requirements and building domain-specific tools, while systematically exploring different ideas to find effective solutions. A preliminary case study demonstrates DrugAgent's potential to overcome key limitations LLMs face in drug discovery, moving toward AI-driven innovation. For example, DrugAgent is able to complete the ML programming pipeline end-to-end, from data acquisition to performance evaluation for the ADMET prediction task, and finally select the best model, where the random forest model achieves an F1 score of 0.92 when predicting absorption using the PAMPA dataset.

Title: Deep Sparse Latent Feature Models for Knowledge Graph Completion

Authors: Haotian Li, Rui Zhang, Lingzhi Wang, Bin Yu, Youwei Wang, Yuliang Wei, Kai Wang, Richard Yi Da Xu, Bailing Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15694
Pdf URL: https://arxiv.org/pdf/2411.15694
Copy Paste: [[2411.15694]] Deep Sparse Latent Feature Models for Knowledge Graph Completion(https://arxiv.org/abs/2411.15694)
Keywords: robust, interpretability
Abstract: Recent progress in knowledge graph completion (KGC) has focused on text-based approaches to address the challenges of large-scale knowledge graphs (KGs). Despite their achievements, these methods often overlook the intricate interconnections between entities, a key aspect of the underlying topological structure of a KG. Stochastic blockmodels (SBMs), particularly the latent feature relational model (LFRM), offer robust probabilistic frameworks that can dynamically capture latent community structures and enhance link prediction. In this paper, we introduce a novel framework of sparse latent feature models for KGC, optimized through a deep variational autoencoder (VAE). Our approach not only effectively completes missing triples but also provides clear interpretability of the latent structures, leveraging textual information. Comprehensive experiments on the WN18RR, FB15k-237, and Wikidata5M datasets show that our method significantly improves performance by revealing latent communities and producing interpretable representations.

Title: RAMIE: Retrieval-Augmented Multi-task Information Extraction with Large Language Models on Dietary Supplements

Authors: Zaifu Zhan, Shuang Zhou, Mingchen Li, Rui Zhang
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2411.15700
Pdf URL: https://arxiv.org/pdf/2411.15700
Copy Paste: [[2411.15700]] RAMIE: Retrieval-Augmented Multi-task Information Extraction with Large Language Models on Dietary Supplements(https://arxiv.org/abs/2411.15700)
Keywords: extraction, large language model
Abstract: \textbf{Objective:} We aimed to develop an advanced multi-task large language model (LLM) framework to extract multiple types of information about dietary supplements (DS) from clinical records. \textbf{Methods:} We used four core DS information extraction tasks - namely, named entity recognition (NER: 2,949 clinical sentences), relation extraction (RE: 4,892 sentences), triple extraction (TE: 2,949 sentences), and usage classification (UC: 2,460 sentences) as our multitasks. We introduced a novel Retrieval-Augmented Multi-task Information Extraction (RAMIE) Framework, including: 1) employed instruction fine-tuning techniques with task-specific prompts, 2) trained LLMs for multiple tasks with improved storage efficiency and lower training costs, and 3) incorporated retrieval augmentation generation (RAG) techniques by retrieving similar examples from the training set. We compared RAMIE's performance to LLMs with instruction fine-tuning alone and conducted an ablation study to assess the contributions of multi-task learning and RAG to improved multitasking performance. \textbf{Results:} With the aid of the RAMIE framework, Llama2-13B achieved an F1 score of 87.39 (3.51\% improvement) on the NER task and demonstrated outstanding performance on the RE task with an F1 score of 93.74 (1.15\% improvement). For the TE task, Llama2-7B scored 79.45 (14.26\% improvement), and MedAlpaca-7B achieved the highest F1 score of 93.45 (0.94\% improvement) on the UC task. The ablation study revealed that while MTL increased efficiency with a slight trade-off in performance, RAG significantly boosted overall accuracy. \textbf{Conclusion:} This study presents a novel RAMIE framework that demonstrates substantial improvements in multi-task information extraction for DS-related data from clinical records. Our framework can potentially be applied to other domains.

Title: Fixing the Perspective: A Critical Examination of Zero-1-to-3

Authors: Jack Yu, Xueying Jia, Charlie Sun, Prince Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15706
Pdf URL: https://arxiv.org/pdf/2411.15706
Copy Paste: [[2411.15706]] Fixing the Perspective: A Critical Examination of Zero-1-to-3(https://arxiv.org/abs/2411.15706)
Keywords: diffusion, transformer
Abstract: Novel view synthesis is a fundamental challenge in image-to-3D generation, requiring the generation of target view images from a set of conditioning images and their relative poses. While recent approaches like Zero-1-to-3 have demonstrated promising results using conditional latent diffusion models, they face significant challenges in generating consistent and accurate novel views, particularly when handling multiple conditioning images. In this work, we conduct a thorough investigation of Zero-1-to-3's cross-attention mechanism within the Spatial Transformer of the diffusion 2D-conditional UNet. Our analysis reveals a critical discrepancy between Zero-1-to-3's theoretical framework and its implementation, specifically in the processing of image-conditional context. We propose two significant improvements: (1) a corrected implementation that enables effective utilization of the cross-attention mechanism, and (2) an enhanced architecture that can leverage multiple conditional views simultaneously. Our theoretical analysis and preliminary results suggest potential improvements in novel view synthesis consistency and accuracy.

Title: Nimbus: Secure and Efficient Two-Party Inference for Transformers

Authors: Zhengyi Li, Kang Yang, Jin Tan, Wen-jie Lu, Haoqi Wu, Xiao Wang, Yu Yu, Derun Zhao, Yancheng Zheng, Minyi Guo, Jingwen Leng
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15707
Pdf URL: https://arxiv.org/pdf/2411.15707
Copy Paste: [[2411.15707]] Nimbus: Secure and Efficient Two-Party Inference for Transformers(https://arxiv.org/abs/2411.15707)
Keywords: secure, privacy, transformer
Abstract: Transformer models have gained significant attention due to their power in machine learning tasks. Their extensive deployment has raised concerns about the potential leakage of sensitive information during inference. However, when being applied to Transformers, existing approaches based on secure two-party computation (2PC) bring about efficiency limitations in two folds: (1) resource-intensive matrix multiplications in linear layers, and (2) complex non-linear activation functions like $\mathsf{GELU}$ and $\mathsf{Softmax}$. This work presents a new two-party inference framework $\mathsf{Nimbus}$ for Transformer models. For the linear layer, we propose a new 2PC paradigm along with an encoding approach to securely compute matrix multiplications based on an outer-product insight, which achieves $2.9\times \sim 12.5\times$ performance improvements compared to the state-of-the-art (SOTA) protocol. For the non-linear layer, through a new observation of utilizing the input distribution, we propose an approach of low-degree polynomial approximation for $\mathsf{GELU}$ and $\mathsf{Softmax}$, which improves the performance of the SOTA polynomial approximation by $2.9\times \sim 4.0\times$, where the average accuracy loss of our approach is 0.08\% compared to the non-2PC inference without privacy. Compared with the SOTA two-party inference, $\mathsf{Nimbus}$ improves the end-to-end performance of \bert{} inference by $2.7\times \sim 4.7\times$ across different network settings.

Title: LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

Authors: Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15708
Pdf URL: https://arxiv.org/pdf/2411.15708
Copy Paste: [[2411.15708]] LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training(https://arxiv.org/abs/2411.15708)
Keywords: transformer, large language model
Abstract: Recently, inspired by the concept of sparsity, Mixture-of-Experts (MoE) models have gained increasing popularity for scaling model size while keeping the number of activated parameters constant. In this study, we thoroughly investigate the sparsity of the dense LLaMA model by constructing MoE for both the attention (i.e., Attention MoE) and MLP (i.e., MLP MoE) modules in the transformer blocks. Specifically, we investigate different expert construction methods and granularities under the same activation conditions to analyze the impact of sparsifying the model. Additionally, to comprehensively evaluate the model's capabilities across various domains (e.g., conversation, code, math) after sparsification, we apply sparsity to the instructed large language models (LLMs) and construct instructed MoE models. To counteract the performance degradation resulting from increased sparsity, we design a two-stage post-training strategy to enhance model performance. Experiments on the LLaMA3 model demonstrate the potential effectiveness of this approach for future developments of instructed MoE models. The source codes and models are available at: \url{this https URL}.

Title: Tackling Data Heterogeneity in Federated Time Series Forecasting

Authors: Wei Yuan, Guanhua Ye, Xiangyu Zhao, Quoc Viet Hung Nguyen, Yang Cao, Hongzhi Yin
Subjects: cs.LG, cs.CR, cs.IR
Abstract URL: https://arxiv.org/abs/2411.15716
Pdf URL: https://arxiv.org/pdf/2411.15716
Copy Paste: [[2411.15716]] Tackling Data Heterogeneity in Federated Time Series Forecasting(https://arxiv.org/abs/2411.15716)
Keywords: privacy, federate
Abstract: Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting. Although substantial progress has been made in time series forecasting, most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices (e.g., sensors, wearables) to a central cloud server. However, this paradigm has overloaded communication networks and raised privacy concerns. Federated learning, a popular privacy-preserving technique, enables collaborative model training across distributed data sources. However, directly applying federated learning to time series forecasting often yields suboptimal results, as time series data generated by different devices are inherently heterogeneous. In this paper, we propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers. Specifically, Fed-TREND generates two types of synthetic data. The first type of synthetic data captures the representative distribution information from clients' uploaded model updates and enhances clients' local training consensus. The second kind of synthetic data extracts long-term influence insights from global model update trajectories and is used to refine the global model after aggregation. Fed-TREND is compatible with most time series forecasting models and can be seamlessly integrated into existing federated learning frameworks to improve prediction performance. Extensive experiments on eight datasets, using several federated learning baselines and four popular time series forecasting models, demonstrate the effectiveness and generalizability of Fed-TREND.

Title: Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Authors: Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15720
Pdf URL: https://arxiv.org/pdf/2411.15720
Copy Paste: [[2411.15720]] Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks(https://arxiv.org/abs/2411.15720)
Keywords: attack, robust
Abstract: Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.

Title: OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions

Authors: Guanyu Zhou, Wenxuan Liu, Wenxin Huang, Xuemei Jia, Xian Zhong, Chia-Wen Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15729
Pdf URL: https://arxiv.org/pdf/2411.15729
Copy Paste: [[2411.15729]] OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions(https://arxiv.org/abs/2411.15729)
Keywords: robust
Abstract: The lack of occlusion data in commonly used action recognition video datasets limits model robustness and impedes sustained performance improvements. We construct OccludeNet, a large-scale occluded video dataset that includes both real-world and synthetic occlusion scene videos under various natural environments. OccludeNet features dynamic tracking occlusion, static scene occlusion, and multi-view interactive occlusion, addressing existing gaps in data. Our analysis reveals that occlusion impacts action classes differently, with actions involving low scene relevance and partial body visibility experiencing greater accuracy degradation. To overcome the limitations of current occlusion-focused approaches, we propose a structural causal model for occluded scenes and introduce the Causal Action Recognition (CAR) framework, which employs backdoor adjustment and counterfactual reasoning. This framework enhances key actor information, improving model robustness to occlusion. We anticipate that the challenges posed by OccludeNet will stimulate further exploration of causal relations in occlusion scenarios and encourage a reevaluation of class correlations, ultimately promoting sustainable performance improvements. The code and full dataset will be released soon.

Title: Development of Pre-Trained Transformer-based Models for the Nepali Language

Authors: Prajwal Thapa, Jinu Nyachhyon, Mridul Sharma, Bal Krishna Bal
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15734
Pdf URL: https://arxiv.org/pdf/2411.15734
Copy Paste: [[2411.15734]] Development of Pre-Trained Transformer-based Models for the Nepali Language(https://arxiv.org/abs/2411.15734)
Keywords: transformer
Abstract: Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.

Title: AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

Authors: Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15738
Pdf URL: https://arxiv.org/pdf/2411.15738
Copy Paste: [[2411.15738]] AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea(https://arxiv.org/abs/2411.15738)
Keywords: diffusion
Abstract: Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity.

Title: LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration

Authors: Gaojing Zhang, Jinglun Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15740
Pdf URL: https://arxiv.org/pdf/2411.15740
Copy Paste: [[2411.15740]] LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration(https://arxiv.org/abs/2411.15740)
Keywords: transformer
Abstract: We introduce LTCF-Net, a novel network architecture designed for enhancing low-light images. Unlike Retinex-based methods, our approach utilizes two color spaces - LAB and YUV - to efficiently separate and process color information, by leveraging the separation of luminance from chromatic components in color images. In addition, our model incorporates the Transformer architecture to comprehensively understand image content while maintaining computational efficiency. To dynamically balance the brightness in output images, we also introduce a Fourier transform module that adjusts the luminance channel in the frequency domain. This mechanism could uniformly balance brightness across different regions while eliminating background noises, and thereby enhancing visual quality. By combining these innovative components, LTCF-Net effectively improves low-light image quality while keeping the model lightweight. Experimental results demonstrate that our method outperforms current state-of-the-art approaches across multiple evaluation metrics and datasets, achieving more natural color restoration and a balanced brightness distribution.

Title: Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting

Authors: Liran Nochumsohn, Michal Moshkovitz, Orly Avner, Dotan Di Castro, Omri Azencot
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15743
Pdf URL: https://arxiv.org/pdf/2411.15743
Copy Paste: [[2411.15743]] Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting(https://arxiv.org/abs/2411.15743)
Keywords: robust
Abstract: Time series forecasting is critical in numerous real-world applications, requiring accurate predictions of future values based on observed patterns. While traditional forecasting techniques work well in in-domain scenarios with ample data, they struggle when data is scarce or not available at all, motivating the emergence of zero-shot and few-shot learning settings. Recent advancements often leverage large-scale foundation models for such tasks, but these methods require extensive data and compute resources, and their performance may be hindered by ineffective learning from the available training set. This raises a fundamental question: What factors influence effective learning from data in time series forecasting? Toward addressing this, we propose using Fourier analysis to investigate how models learn from synthetic and real-world time series data. Our findings reveal that forecasters commonly suffer from poor learning from data with multiple frequencies and poor generalization to unseen frequencies, which impedes their predictive performance. To alleviate these issues, we present a novel synthetic data generation framework, designed to enhance real data or replace it completely by creating task-specific frequency information, requiring only the sampling rate of the target data. Our approach, Freq-Synth, improves the robustness of both foundation as well as nonfoundation forecast models in zero-shot and few-shot settings, facilitating more reliable time series forecasting under limited data scenarios.

Title: Integrating Deep Metric Learning with Coreset for Active Learning in 3D Segmentation

Authors: Arvind Murari Vepa, Zukang Yang, Andrew Choi, Jungseock Joo, Fabien Scalzo, Yizhou Sun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15763
Pdf URL: https://arxiv.org/pdf/2411.15763
Copy Paste: [[2411.15763]] Integrating Deep Metric Learning with Coreset for Active Learning in 3D Segmentation(https://arxiv.org/abs/2411.15763)
Keywords: segmentation
Abstract: Deep learning has seen remarkable advancements in machine learning, yet it often demands extensive annotated data. Tasks like 3D semantic segmentation impose a substantial annotation burden, especially in domains like medicine, where expert annotations drive up the cost. Active learning (AL) holds great potential to alleviate this annotation burden in 3D medical segmentation. The majority of existing AL methods, however, are not tailored to the medical domain. While weakly-supervised methods have been explored to reduce annotation burden, the fusion of AL with weak supervision remains unexplored, despite its potential to significantly reduce annotation costs. Additionally, there is little focus on slice-based AL for 3D segmentation, which can also significantly reduce costs in comparison to conventional volume-based AL. This paper introduces a novel metric learning method for Coreset to perform slice-based active learning in 3D medical segmentation. By merging contrastive learning with inherent data groupings in medical imaging, we learn a metric that emphasizes the relevant differences in samples for training 3D medical segmentation models. We perform comprehensive evaluations using both weak and full annotations across four datasets (medical and non-medical). Our findings demonstrate that our approach surpasses existing active learning techniques on both weak and full annotations and obtains superior performance with low-annotation budgets which is crucial in medical imaging. Source code for this project is available in the supplementary materials and on GitHub: this https URL.

Title: LLM Online Spatial-temporal Signal Reconstruction Under Noise

Authors: Yi Yan, Dayu Qin, Ercan Engin Kuruoglu
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.15764
Pdf URL: https://arxiv.org/pdf/2411.15764
Copy Paste: [[2411.15764]] LLM Online Spatial-temporal Signal Reconstruction Under Noise(https://arxiv.org/abs/2411.15764)
Keywords: robust, large language model
Abstract: This work introduces the LLM Online Spatial-temporal Reconstruction (LLM-OSR) framework, which integrates Graph Signal Processing (GSP) and Large Language Models (LLMs) for online spatial-temporal signal reconstruction. The LLM-OSR utilizes a GSP-based spatial-temporal signal handler to enhance graph signals and employs LLMs to predict missing values based on spatiotemporal patterns. The performance of LLM-OSR is evaluated on traffic and meteorological datasets under varying Gaussian noise levels. Experimental results demonstrate that utilizing GPT-4-o mini within the LLM-OSR is accurate and robust under Gaussian noise conditions. The limitations are discussed along with future research insights, emphasizing the potential of combining GSP techniques with LLMs for solving spatiotemporal prediction tasks.

Title: Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering

Authors: Zhicheng Zhao, Changfu Zhou, Yu Zhang, Chenglong Li, Xiaoliang Ma, Jin Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15770
Pdf URL: https://arxiv.org/pdf/2411.15770
Copy Paste: [[2411.15770]] Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering(https://arxiv.org/abs/2411.15770)
Keywords: robust
Abstract: Remote Sensing Visual Question Answering (RSVQA) has gained significant research interest. However, current RSVQA methods are limited by the imaging mechanisms of optical sensors, particularly under challenging conditions such as cloud-covered and low-light scenarios. Given the all-time and all-weather imaging capabilities of Synthetic Aperture Radar (SAR), it is crucial to investigate the integration of optical-SAR images to improve RSVQA performance. In this work, we propose a Text-guided Coarse-to-Fine Fusion Network (TGFNet), which leverages the semantic relationships between question text and multi-source images to guide the network toward complementary fusion at the feature level. Specifically, we develop a Text-guided Coarse-to-Fine Attention Refinement (CFAR) module to focus on key areas related to the question in complex remote sensing images. This module progressively directs attention from broad areas to finer details through key region routing, enhancing the model's ability to focus on relevant regions. Furthermore, we propose an Adaptive Multi-Expert Fusion (AMEF) module that dynamically integrates different experts, enabling the adaptive fusion of optical and SAR features. In addition, we create the first large-scale benchmark dataset for evaluating optical-SAR RSVQA methods, comprising 6,008 well-aligned optical-SAR image pairs and 1,036,694 well-labeled question-answer pairs across 16 diverse question types, including complex relational reasoning questions. Extensive experiments on the proposed dataset demonstrate that our TGFNet effectively integrates complementary information between optical and SAR images, significantly improving the model's performance in challenging scenarios. The dataset is available at: this https URL. Index Terms: Remote Sensing Visual Question Answering, Multi-source Data Fusion, Multimodal, Remote Sensing, OPT-SAR.

Title: A Method for Building Large Language Models with Predefined KV Cache Capacity

Authors: Zhonghua Yi, Ge Niu, Lei Wang, Wei Tang, Liqiu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15785
Pdf URL: https://arxiv.org/pdf/2411.15785
Copy Paste: [[2411.15785]] A Method for Building Large Language Models with Predefined KV Cache Capacity(https://arxiv.org/abs/2411.15785)
Keywords: transformer, large language model
Abstract: This paper proposes a method for building large language models with predefined Key-Value (KV) cache capacity, particularly suitable for the attention layers in Transformer decode-only architectures. This method introduces fixed-length KV caches to address the issue of excessive memory consumption in traditional KV caches when handling infinite contexts. By dynamically updating the key-value vector sequences, it achieves efficient inference within limited cache capacity, significantly reducing memory usage while maintaining model performance and system throughput. Experimental results show that this method significantly reduces memory usage while maintaining the model's inference quality.

Title: Multi-Token Enhancing for Vision Representation Learning

Authors: Zhong-Yu Li, Yu-Song Hu, Bo-Wen Yin, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15787
Pdf URL: https://arxiv.org/pdf/2411.15787
Copy Paste: [[2411.15787]] Multi-Token Enhancing for Vision Representation Learning(https://arxiv.org/abs/2411.15787)
Keywords: robust
Abstract: Vision representation learning, especially self-supervised learning, is pivotal for various vision applications. Ensemble learning has also succeeded in enhancing the performance and robustness of the vision models. However, traditional ensemble strategies are impractical for representation learning, especially self-supervised representation learning that requires large-scale datasets and long schedules. This is because they require k times more training and inference computation costs for an ensemble of k models. Differently, we introduce Multi-Token Enhancing (MTE) that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs. These auxiliary tokens, including auxiliary CLS tokens and adaptively pooled tokens, capture complementary information due to their differences. Meanwhile, to address the increase in inference costs, we distill the knowledge acquired by the auxiliary tokens into a global token during pre-training. Consequently, we can discard the auxiliary tokens during inference without incurring additional costs. Our MTE is compatible with various self-supervised loss functions and architectures, consistently improving performances across different downstream tasks. Our source code will be made publicly available.

Title: Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization

Authors: Corrado Coppola, Lorenzo Papa, Irene Amerini, Laura Palagi
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2411.15795
Pdf URL: https://arxiv.org/pdf/2411.15795
Copy Paste: [[2411.15795]] Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization(https://arxiv.org/abs/2411.15795)
Keywords: transformer
Abstract: Adaptive gradient methods have been increasingly adopted by deep learning community due to their fast convergence and reduced sensitivity to hyper-parameters. However, these methods come with limitations, such as increased memory requirements for elements like moving averages and a poorly understood convergence theory. To overcome these challenges, we introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch, along with its deterministic proof of global convergence to a stationary point. To evaluate the F-CMA, we integrate it into conventional training protocols for classification tasks involving both convolutional neural networks and vision transformer models, allowing for a direct comparison with popular optimizers. Computational tests show significant improvements, including a decrease in the overall training time by up to 68%, an increase in per-epoch efficiency by up to 20%, and in model accuracy by up to 5%.

Title: Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning

Authors: Qi Li, Cheng-Long Wang, Yinzhi Cao, Di Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15796
Pdf URL: https://arxiv.org/pdf/2411.15796
Copy Paste: [[2411.15796]] Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning(https://arxiv.org/abs/2411.15796)
Keywords: privacy, protect, attack, membership infer
Abstract: In this work, we systematically explore the data privacy issues of dataset pruning in machine learning systems. Our findings reveal, for the first time, that even if data in the redundant set is solely used before model training, its pruning-phase membership status can still be detected through attacks. Since this is a fully upstream process before model training, traditional model output-based privacy inference methods are completely unsuitable. To address this, we introduce a new task called Data-Centric Membership Inference and propose the first ever data-centric privacy inference paradigm named Data Lineage Inference (DaLI). Under this paradigm, four threshold-based attacks are proposed, named WhoDis, CumDis, ArraDis and SpiDis. We show that even without access to downstream models, adversaries can accurately identify the redundant set with only limited prior knowledge. Furthermore, we find that different pruning methods involve varying levels of privacy leakage, and even the same pruning method can present different privacy risks at different pruning fractions. We conducted an in-depth analysis of these phenomena and introduced a metric called the Brimming score to offer guidance for selecting pruning methods with privacy protection in mind.

Title: LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

Authors: Ayush Singh, Rajdeep Aher, Shivank Garg
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15804
Pdf URL: https://arxiv.org/pdf/2411.15804
Copy Paste: [[2411.15804]] LoRA-Mini : Adaptation Matrices Decomposition and Selective Training(https://arxiv.org/abs/2411.15804)
Keywords: large language model
Abstract: The rapid advancements in large language models (LLMs) have revolutionized natural language processing, creating an increased need for efficient, task-specific fine-tuning methods. Traditional fine-tuning of LLMs involves updating a large number of parameters, which is computationally expensive and memory-intensive. Low-Rank Adaptation (LoRA) has emerged as a promising solution, enabling parameter-efficient fine-tuning by reducing the number of trainable parameters. However, while LoRA reduces the number of trainable parameters, LoRA modules still create significant storage challenges. We propose LoRA-Mini, an optimized adaptation of LoRA that improves parameter efficiency by splitting low-rank matrices into four parts, with only the two inner matrices being trainable. This approach achieves upto a 20x reduction compared to standard LoRA in the number of trainable parameters while preserving performance levels comparable to standard LoRA, addressing both computational and storage efficiency in LLM fine-tuning.

Title: LRSAA: Large-scale Remote Sensing Image Target Recognition and Automatic Annotation

Authors: Yujuan Zhu, Wuzheng Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15808
Pdf URL: https://arxiv.org/pdf/2411.15808
Copy Paste: [[2411.15808]] LRSAA: Large-scale Remote Sensing Image Target Recognition and Automatic Annotation(https://arxiv.org/abs/2411.15808)
Keywords: segmentation
Abstract: This paper presents a method for object recognition and automatic labeling in large-area remote sensing images called LRSAA. The method integrates YOLOv11 and MobileNetV3-SSD object detection algorithms through ensemble learning to enhance model performance. Furthermore, it employs Poisson disk sampling segmentation techniques and the EIOU metric to optimize the training and inference processes of segmented images, followed by the integration of results. This approach not only reduces the demand for computational resources but also achieves a good balance between accuracy and speed. The source code for this project has been made publicly available on this https URL.

Title: FastTrackTr:Towards Fast Multi-Object Tracking with Transformers

Authors: Pan Liao, Feng Yang, Di Wu, Jinwen Yu, Wenhui Zhao, Bo Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15811
Pdf URL: https://arxiv.org/pdf/2411.15811
Copy Paste: [[2411.15811]] FastTrackTr:Towards Fast Multi-Object Tracking with Transformers(https://arxiv.org/abs/2411.15811)
Keywords: transformer
Abstract: Transformer-based multi-object tracking (MOT) methods have captured the attention of many researchers in recent years. However, these models often suffer from slow inference speeds due to their structure or other issues. To address this problem, we revisited the Joint Detection and Tracking (JDT) method by looking back at past approaches. By integrating the original JDT approach with some advanced theories, this paper employs an efficient method of information transfer between frames on the DETR, constructing a fast and novel JDT-type MOT framework: FastTrackTr. Thanks to the superiority of this information transfer method, our approach not only reduces the number of queries required during tracking but also avoids the excessive introduction of network structures, ensuring model simplicity. Experimental results indicate that our method has the potential to achieve real-time tracking and exhibits competitive tracking accuracy across multiple datasets.

Title: Efficient and Private: Memorisation under differentially private parameter-efficient fine-tuning in language models

Authors: Olivia Ma, Jonathan Passerat-Palmbach, Dmitrii Usynin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15831
Pdf URL: https://arxiv.org/pdf/2411.15831
Copy Paste: [[2411.15831]] Efficient and Private: Memorisation under differentially private parameter-efficient fine-tuning in language models(https://arxiv.org/abs/2411.15831)
Keywords: privacy, large language model
Abstract: Fine-tuning large language models (LLMs) for specific tasks introduces privacy risks, as models may inadvertently memorise and leak sensitive training data. While Differential Privacy (DP) offers a solution to mitigate these risks, it introduces significant computational and performance trade-offs, particularly with standard fine-tuning approaches. Previous work has primarily focused on full-parameter updates, which are computationally intensive and may not fully leverage DPs potential in large models. In this work, we address these shortcomings by investigating Parameter-Efficient Fine-Tuning (PEFT) methods under DP constraints. We show that PEFT methods achieve comparable performance to standard fine-tuning while requiring fewer parameters and significantly reducing privacy leakage. Furthermore, we incorporate a data poisoning experiment involving intentional mislabelling to assess model memorisation and directly measure privacy risks. Our findings indicate that PEFT methods not only provide a promising alternative but also serve as a complementary approach for privacy-preserving, resource-efficient fine-tuning of LLMs.

Title: Modality Alignment Meets Federated Broadcasting

Authors: Yuting Ma, Shengeng Tang, Xiaohua Xu, Lechao Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15837
Pdf URL: https://arxiv.org/pdf/2411.15837
Copy Paste: [[2411.15837]] Modality Alignment Meets Federated Broadcasting(https://arxiv.org/abs/2411.15837)
Keywords: privacy, robust, federate
Abstract: Federated learning (FL) has emerged as a powerful approach to safeguard data privacy by training models across distributed edge devices without centralizing local data. Despite advancements in homogeneous data scenarios, maintaining performance between the global and local clients in FL over heterogeneous data remains challenging due to data distribution variations that degrade model convergence and increase computational costs. This paper introduces a novel FL framework leveraging modality alignment, where a text encoder resides on the server, and image encoders operate on local devices. Inspired by multi-modal learning paradigms like CLIP, this design aligns cross-client learning by treating server-client communications akin to multi-modal broadcasting. We initialize with a pre-trained model to mitigate overfitting, updating select parameters through low-rank adaptation (LoRA) to meet computational demand and performance efficiency. Local models train independently and communicate updates to the server, which aggregates parameters via a query-based method, facilitating cross-client knowledge sharing and performance improvement under extreme heterogeneity. Extensive experiments on benchmark datasets demonstrate the efficacy in maintaining generalization and robustness, even in highly heterogeneous settings.

Title: Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

Authors: Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Charles Ling, Boyu Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15843
Pdf URL: https://arxiv.org/pdf/2411.15843
Copy Paste: [[2411.15843]] Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing(https://arxiv.org/abs/2411.15843)
Keywords: diffusion, transformer, generative
Abstract: Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.

Title: Unveiling the Superior Paradigm: A Comparative Study of Source-Free Domain Adaptation and Unsupervised Domain Adaptation

Authors: Fan Wang, Zhongyi Han, Xingbo Liu, Xin Gao, Yilong Yin
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15844
Pdf URL: https://arxiv.org/pdf/2411.15844
Copy Paste: [[2411.15844]] Unveiling the Superior Paradigm: A Comparative Study of Source-Free Domain Adaptation and Unsupervised Domain Adaptation(https://arxiv.org/abs/2411.15844)
Keywords: robust
Abstract: In domain adaptation, there are two popular paradigms: Unsupervised Domain Adaptation (UDA), which aligns distributions using source data, and Source-Free Domain Adaptation (SFDA), which leverages pre-trained source models without accessing source data. Evaluating the superiority of UDA versus SFDA is an open and timely question with significant implications for deploying adaptive algorithms in practical applications. In this study, we demonstrate through predictive coding theory and extensive experiments on multiple benchmark datasets that SFDA generally outperforms UDA in real-world scenarios. Specifically, SFDA offers advantages in time efficiency, storage requirements, targeted learning objectives, reduced risk of negative transfer, and increased robustness against overfitting. Notably, SFDA is particularly effective in mitigating negative transfer when there are substantial distribution discrepancies between source and target domains. Additionally, we introduce a novel data-model fusion scenario, where data sharing among stakeholders varies (e.g., some provide raw data while others provide only models), and reveal that traditional UDA and SFDA methods do not fully exploit their potential in this context. To address this limitation and capitalize on the strengths of SFDA, we propose a novel weight estimation method that effectively integrates available source data into multi-SFDA (MSFDA) approaches, thereby enhancing model performance within this scenario. This work provides a thorough analysis of UDA versus SFDA and advances a practical approach to model adaptation across diverse real-world environments.

Title: FedQP: Towards Accurate Federated Learning using Quadratic Programming Guided Mutation

Authors: Jiawen Weng, Zeke Xia, Ran Li, Ming Hu, Mingsong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15847
Pdf URL: https://arxiv.org/pdf/2411.15847
Copy Paste: [[2411.15847]] FedQP: Towards Accurate Federated Learning using Quadratic Programming Guided Mutation(https://arxiv.org/abs/2411.15847)
Keywords: privacy, federate
Abstract: Due to the advantages of privacy-preserving, Federated Learning (FL) is widely used in distributed machine learning systems. However, existing FL methods suffer from low-inference performance caused by data heterogeneity. Specifically, due to heterogeneous data, the optimization directions of different local models vary greatly, making it difficult for the traditional FL method to get a generalized global model that performs well on all clients. As one of the state-of-the-art FL methods, the mutation-based FL method attempts to adopt a stochastic mutation strategy to guide the model training towards a well-generalized area (i.e., flat area in the loss landscape). Specifically, mutation allows the model to shift within the solution space, providing an opportunity to escape areas with poor generalization (i.e., sharp area). However, the stochastic mutation strategy easily results in diverse optimal directions of mutated models, which limits the performance of the existing mutation-based FL method. To achieve higher performance, this paper proposes a novel mutation-based FL approach named FedQP, utilizing a quadratic programming strategy to regulate the mutation directions wisely. By biasing the model mutation towards the direction of gradient update rather than traditional random mutation, FedQP can effectively guide the model to optimize towards a well-generalized area (i.e., flat area). Experiments on multiple well-known datasets show that our quadratic programming-guided mutation strategy effectively improves the inference accuracy of the global model in various heterogeneous data scenarios.

Title: ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Authors: Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15851
Pdf URL: https://arxiv.org/pdf/2411.15851
Copy Paste: [[2411.15851]] ResCLIP: Residual Attention for Training-free Dense Vision-language Inference(https://arxiv.org/abs/2411.15851)
Keywords: segmentation
Abstract: While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at this https URL.

Title: SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Authors: Yongkun Du, Zhineng Chen, Hongtao Xie, Caiyan Jia, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15858
Pdf URL: https://arxiv.org/pdf/2411.15858
Copy Paste: [[2411.15858]] SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition(https://arxiv.org/abs/2411.15858)
Keywords: fair
Abstract: Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed. Code is available at this https URL.

Title: Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching

Authors: Yujing Sun, Caiyi Sun, Yuan Liu, Yuexin Ma, Siu Ming Yiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15860
Pdf URL: https://arxiv.org/pdf/2411.15860
Copy Paste: [[2411.15860]] Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching(https://arxiv.org/abs/2411.15860)
Keywords: robust, diffusion
Abstract: In this paper, we present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Unlike traditional approaches that rely on instance-level object pose estimation and necessitate extensive training data, our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object. These characteristics are achieved by utilizing a diffusion model to generate novel-view images and conducting a two-sided matching on these generated images. Quantitative experiments demonstrate the superiority of our method over existing pose estimation techniques across both synthetic and real-world datasets. Remarkably, our approach maintains strong performance even in scenarios with significant viewpoint changes, highlighting its robustness and versatility in challenging conditions. The code will be re leased at this https URL.

Title: PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Authors: Teng Zhou, Xiaoyu Zhang, Yongchuan Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15867
Pdf URL: https://arxiv.org/pdf/2411.15867
Copy Paste: [[2411.15867]] PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs(https://arxiv.org/abs/2411.15867)
Keywords: diffusion
Abstract: Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at this https URL.

Title: Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Authors: Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15869
Pdf URL: https://arxiv.org/pdf/2411.15869
Copy Paste: [[2411.15869]] Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation(https://arxiv.org/abs/2411.15869)
Keywords: segmentation
Abstract: Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to its image-level pre-training, CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis reveals that anomaly tokens emerge during the forward pass, drawing excessive attention from normal patch tokens, thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to produce finer-grained representations while preserving its original generalization ability, without introducing new parameters or relying on additional backbones. Specifically, we first identify and resolve the anomaly tokens to mitigate their negative impact. Next, we enhance feature discriminability and attention correlation by leveraging the semantic consistency found in CLIP's intermediate features. Furthermore, we employ multi-level feature fusion to enrich details. Collectively, these strategies enhance CLIP's feature representation with greater granularity and coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across eight semantic segmentation datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at this https URL.

Title: An Extensive Study on D2C: Overfitting Remediation in Deep Learning Using a Decentralized Approach

Authors: Md. Saiful Bari Siddiqui, Md Mohaiminul Islam, Md. Golam Rabiul Alam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15876
Pdf URL: https://arxiv.org/pdf/2411.15876
Copy Paste: [[2411.15876]] An Extensive Study on D2C: Overfitting Remediation in Deep Learning Using a Decentralized Approach(https://arxiv.org/abs/2411.15876)
Keywords: robust
Abstract: Overfitting remains a significant challenge in deep learning, often arising from data outliers, noise, and limited training data. To address this, we propose Divide2Conquer (D2C), a novel technique to mitigate overfitting. D2C partitions the training data into multiple subsets and trains identical models independently on each subset. To balance model generalization and subset-specific learning, the model parameters are periodically aggregated and averaged during training. This process enables the learning of robust patterns while minimizing the influence of outliers and noise. Empirical evaluations on benchmark datasets across diverse deep-learning tasks demonstrate that D2C significantly enhances generalization performance, particularly with larger datasets. Our analysis includes evaluations of decision boundaries, loss curves, and other performance metrics, highlighting D2C's effectiveness both as a standalone technique and in combination with other overfitting reduction methods. We further provide a rigorous mathematical justification for D2C's underlying principles and examine its applicability across multiple domains. Finally, we explore the trade-offs associated with D2C and propose strategies to address them, offering a holistic view of its strengths and limitations. This study establishes D2C as a versatile and effective approach to combating overfitting in deep learning. Our codes are publicly available at: this https URL.

Title: ExAL: An Exploration Enhanced Adversarial Learning Algorithm

Authors: A Vinil, Aneesh Sreevallabh Chivukula, Pranav Chintareddy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15878
Pdf URL: https://arxiv.org/pdf/2411.15878
Copy Paste: [[2411.15878]] ExAL: An Exploration Enhanced Adversarial Learning Algorithm(https://arxiv.org/abs/2411.15878)
Keywords: defense, attack, robust
Abstract: Adversarial learning is critical for enhancing model robustness, aiming to defend against adversarial attacks that jeopardize machine learning systems. Traditional methods often lack efficient mechanisms to explore diverse adversarial perturbations, leading to limited model resilience. Inspired by game-theoretic principles, where adversarial dynamics are analyzed through frameworks like Nash equilibrium, exploration mechanisms in such setups allow for the discovery of diverse strategies, enhancing system robustness. However, existing adversarial learning methods often fail to incorporate structured exploration effectively, reducing their ability to improve model defense comprehensively. To address these challenges, we propose a novel Exploration-enhanced Adversarial Learning Algorithm (ExAL), leveraging the Exponentially Weighted Momentum Particle Swarm Optimizer (EMPSO) to generate optimized adversarial perturbations. ExAL integrates exploration-driven mechanisms to discover perturbations that maximize impact on the model's decision boundary while preserving structural coherence in the data. We evaluate the performance of ExAL on the MNIST Handwritten Digits and Blended Malware datasets. Experimental results demonstrate that ExAL significantly enhances model resilience to adversarial attacks by improving robustness through adversarial learning.

Title: Evaluating Large Language Models for Causal Modeling

Authors: Houssam Razouk, Leonie Benischke, Georg Niess, Roman Kern
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15888
Pdf URL: https://arxiv.org/pdf/2411.15888
Copy Paste: [[2411.15888]] Evaluating Large Language Models for Causal Modeling(https://arxiv.org/abs/2411.15888)
Keywords: large language model
Abstract: In this paper, we consider the process of transforming causal domain knowledge into a representation that aligns more closely with guidelines from causal data science. To this end, we introduce two novel tasks related to distilling causal domain knowledge into causal variables and detecting interaction entities using LLMs. We have determined that contemporary LLMs are helpful tools for conducting causal modeling tasks in collaboration with human experts, as they can provide a wider perspective. Specifically, LLMs, such as GPT-4-turbo and Llama3-70b, perform better in distilling causal domain knowledge into causal variables compared to sparse expert models, such as Mixtral-8x22b. On the contrary, sparse expert models such as Mixtral-8x22b stand out as the most effective in identifying interaction entities. Finally, we highlight the dependency between the domain where the entities are generated and the performance of the chosen LLM for causal modeling.

Title: From Laws to Motivation: Guiding Exploration through Law-Based Reasoning and Rewards

Authors: Ziyu Chen, Zhiqing Xiao, Xinbei Jiang, Junbo Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15891
Pdf URL: https://arxiv.org/pdf/2411.15891
Copy Paste: [[2411.15891]] From Laws to Motivation: Guiding Exploration through Law-Based Reasoning and Rewards(https://arxiv.org/abs/2411.15891)
Keywords: large language model
Abstract: Large Language Models (LLMs) and Reinforcement Learning (RL) are two powerful approaches for building autonomous agents. However, due to limited understanding of the game environment, agents often resort to inefficient exploration and trial-and-error, struggling to develop long-term strategies or make decisions. We propose a method that extracts experience from interaction records to model the underlying laws of the game environment, using these experience as internal motivation to guide agents. These experience, expressed in language, are highly flexible and can either assist agents in reasoning directly or be transformed into rewards for guiding training. Our evaluation results in Crafter demonstrate that both RL and LLM agents benefit from these experience, leading to improved overall performance.

Title: Enhancing Symbolic Regression and Universal Physics-Informed Neural Networks with Dimensional Analysis

Authors: Lena Podina, Diba Darooneh, Joshveer Grewal, Mohammad Kohandel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15919
Pdf URL: https://arxiv.org/pdf/2411.15919
Copy Paste: [[2411.15919]] Enhancing Symbolic Regression and Universal Physics-Informed Neural Networks with Dimensional Analysis(https://arxiv.org/abs/2411.15919)
Keywords: robust, interpretability
Abstract: We present a new method for enhancing symbolic regression for differential equations via dimensional analysis, specifically Ipsen's and Buckingham pi methods. Since symbolic regression often suffers from high computational costs and overfitting, non-dimensionalizing datasets reduces the number of input variables, simplifies the search space, and ensures that derived equations are physically meaningful. As our main contribution, we integrate Ipsen's method of dimensional analysis with Universal Physics-Informed Neural Networks. We also combine dimensional analysis with the AI Feynman symbolic regression algorithm to show that dimensional analysis significantly improves the accuracy of the recovered equation. The results demonstrate that transforming data into a dimensionless form significantly decreases computation time and improves accuracy of the recovered hidden term. For algebraic equations, using the Buckingham pi theorem reduced complexity, allowing the AI Feynman model to converge faster with fewer data points and lower error rates. For differential equations, Ipsen's method was combined with Universal Physics-Informed Neural Networks (UPINNs) to identify hidden terms more effectively. These findings suggest that integrating dimensional analysis with symbolic regression can significantly lower computational costs, enhance model interpretability, and increase accuracy, providing a robust framework for automated discovery of governing equations in complex systems when data is limited.

Title: An AutoML-based approach for Network Intrusion Detection

Authors: Nana Kankam Gyimah, Judith Mwakalonge, Gurcan Comert, Saidi Siuhi, Robert Akinie, Methusela Sulle, Denis Ruganuza, Benibo Izison, Arthur Mukwaya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15920
Pdf URL: https://arxiv.org/pdf/2411.15920
Copy Paste: [[2411.15920]] An AutoML-based approach for Network Intrusion Detection(https://arxiv.org/abs/2411.15920)
Keywords: security, robust
Abstract: In this paper, we present an automated machine learning (AutoML) approach for network intrusion detection, leveraging a stacked ensemble model developed using the MLJAR AutoML framework. Our methodology combines multiple machine learning algorithms, including LightGBM, CatBoost, and XGBoost, to enhance detection accuracy and robustness. By automating model selection, feature engineering, and hyperparameter tuning, our approach reduces the manual overhead typically associated with traditional machine learning methods. Extensive experimentation on the NSL-KDD dataset demonstrates that the stacked ensemble model outperforms individual models, achieving high accuracy and minimizing false positives. Our findings underscore the benefits of using AutoML for network intrusion detection, as the AutoML-driven stacked ensemble achieved the highest performance with 90\% accuracy and an 89\% F1 score, outperforming individual models like Random Forest (78\% accuracy, 78\% F1 score), XGBoost and CatBoost (both 80\% accuracy, 80\% F1 score), and LightGBM (78\% accuracy, 78\% F1 score), providing a more adaptable and efficient solution for network security applications.

Title: A Tunable Despeckling Neural Network Stabilized via Diffusion Equation

Authors: Yi Ran, Zhichang Guo, Jia Li, Yao Li, Martin Burger, Boying Wu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15921
Pdf URL: https://arxiv.org/pdf/2411.15921
Copy Paste: [[2411.15921]] A Tunable Despeckling Neural Network Stabilized via Diffusion Equation(https://arxiv.org/abs/2411.15921)
Keywords: attack, diffusion
Abstract: Multiplicative Gamma noise remove is a critical research area in the application of synthetic aperture radar (SAR) imaging, where neural networks serve as a potent tool. However, real-world data often diverges from theoretical models, exhibiting various disturbances, which makes the neural network less effective. Adversarial attacks work by finding perturbations that significantly disrupt functionality of neural networks, as the inherent instability of neural networks makes them highly susceptible. A network designed to withstand such extreme cases can more effectively mitigate general disturbances in real SAR data. In this work, the dissipative nature of diffusion equations is employed to underpin a novel approach for countering adversarial attacks and improve the resistance of real noise disturbance. We propose a tunable, regularized neural network that unrolls a denoising unit and a regularization unit into a single network for end-to-end training. In the network, the denoising unit and the regularization unit are composed of the denoising network and the simplest linear diffusion equation respectively. The regularization unit enhances network stability, allowing post-training time step adjustments to effectively mitigate the adverse impacts of adversarial attacks. The stability and convergence of our model are theoretically proven, and in the experiments, we compare our model with several state-of-the-art denoising methods on simulated images, adversarial samples, and real SAR images, yielding superior results in both quantitative and visual evaluations.

Title: Deep Learning for automated multi-scale functional field boundaries extraction using multi-date Sentinel-2 and PlanetScope imagery: Case Study of Netherlands and Pakistan

Authors: Saba Zahid, Sajid Ghuffar, Obaid-ur-Rehman, Syed Roshaan Ali Shah
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15923
Pdf URL: https://arxiv.org/pdf/2411.15923
Copy Paste: [[2411.15923]] Deep Learning for automated multi-scale functional field boundaries extraction using multi-date Sentinel-2 and PlanetScope imagery: Case Study of Netherlands and Pakistan(https://arxiv.org/abs/2411.15923)
Keywords: robust, extraction, segmentation
Abstract: This study explores the effectiveness of multi-temporal satellite imagery for better functional field boundary delineation using deep learning semantic segmentation architecture on two distinct geographical and multi-scale farming systems of Netherlands and Pakistan. Multidate images of April, August and October 2022 were acquired for PlanetScope and Sentinel-2 in sub regions of Netherlands and November 2022, February and March 2023 for selected area of Dunyapur in Pakistan. For Netherlands, Basic registration crop parcels (BRP) vector layer was used as labeled training data. while self-crafted field boundary vector data were utilized for Pakistan. Four deep learning models with UNET architecture were evaluated using different combinations of multi-date images and NDVI stacks in the Netherlands subregions. A comparative analysis of IoU scores assessed the effectiveness of the proposed multi-date NDVI stack approach. These findings were then applied for transfer learning, using pre-trained models from the Netherlands on the selected area in Pakistan. Additionally, separate models were trained using self-crafted field boundary data for Pakistan, and combined models were developed using data from both the Netherlands and Pakistan. Results indicate that multi-date NDVI stacks provide additional temporal context, reflecting crop growth over different times of the season. The study underscores the critical role of multi-scale ground information from diverse geographical areas in developing robust and universally applicable models for field boundary delineation. The results also highlight the importance of fine spatial resolution for extraction of field boundaries in regions with small scale framing. The findings can be extended to multi-scale implementations for improved automatic field boundary delineation in heterogeneous agricultural environments.

Title: Making Images from Images: Interleaving Denoising and Transformation

Authors: Shumeet Baluja, David Marwood, Ashwin Baluja
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2411.15925
Pdf URL: https://arxiv.org/pdf/2411.15925
Copy Paste: [[2411.15925]] Making Images from Images: Interleaving Denoising and Transformation(https://arxiv.org/abs/2411.15925)
Keywords: diffusion
Abstract: Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.

Title: Generative Context Distillation

Authors: Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15927
Pdf URL: https://arxiv.org/pdf/2411.15927
Copy Paste: [[2411.15927]] Generative Context Distillation(https://arxiv.org/abs/2411.15927)
Keywords: generative, large language model
Abstract: Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Context Distillation (GCD), a lightweight prompt internalization method that employs a joint training approach. This method not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Context Distillation enables high-performance and efficient inference without the need for explicit prompts.

Title: Segment to Recognize Robustly -- Enhancing Recognition by Image Decomposition

Authors: Klara Janouskova, Cristian Gavrus, Jiri Matas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15933
Pdf URL: https://arxiv.org/pdf/2411.15933
Copy Paste: [[2411.15933]] Segment to Recognize Robustly -- Enhancing Recognition by Image Decomposition(https://arxiv.org/abs/2411.15933)
Keywords: robust, segmentation
Abstract: In image recognition, both foreground (FG) and background (BG) play an important role; however, standard deep image recognition often leads to unintended over-reliance on the BG, limiting model robustness in real-world deployment settings. Current solutions mainly suppress the BG, sacrificing BG information for improved generalization. We propose "Segment to Recognize Robustly" (S2R^2), a novel recognition approach which decouples the FG and BG modelling and combines them in a simple, robust, and interpretable manner. S2R^2 leverages recent advances in zero-shot segmentation to isolate the FG and the BG before or during recognition. By combining FG and BG, potentially also with a standard full-image classifier, S2R^2 achieves state-of-the-art results on in-domain data while maintaining robustness to BG shifts. The results confirm that segmentation before recognition is now possible.

Title: MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

Authors: Haoyang He, Jiangning Zhang, Yuxuan Cai, Hongxu Chen, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Yunsheng Wu, Lei Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15941
Pdf URL: https://arxiv.org/pdf/2411.15941
Copy Paste: [[2411.15941]] MobileMamba: Lightweight Multi-Receptive Visual Mamba Network(https://arxiv.org/abs/2411.15941)
Keywords: extraction, transformer
Abstract: Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction(MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba(WTE-Mamba), Efficient Multi-Kernel Depthwise Convolution(MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum x21 faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.

Title: Understanding Machine Learning Paradigms through the Lens of Statistical Thermodynamics: A tutorial

Authors: Star (Xinxin)Liu
Subjects: cs.LG, cond-mat.mtrl-sci, math.ST, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2411.15945
Pdf URL: https://arxiv.org/pdf/2411.15945
Copy Paste: [[2411.15945]] Understanding Machine Learning Paradigms through the Lens of Statistical Thermodynamics: A tutorial(https://arxiv.org/abs/2411.15945)
Keywords: robust
Abstract: This tutorial investigates the convergence of statistical mechanics and learning theory, elucidating the potential enhancements in machine learning methodologies through the integration of foundational principles from physics. The tutorial delves into advanced techniques like entropy, free energy, and variational inference which are utilized in machine learning, illustrating their significant contributions to model efficiency and robustness. By bridging these scientific disciplines, we aspire to inspire newer methodologies in researches, demonstrating how an in-depth comprehension of physical systems' behavior can yield more effective and dependable machine learning models, particularly in contexts characterized by uncertainty.

Title: Partial Identifiability and Misspecification in Inverse Reinforcement Learning

Authors: Joar Skalse, Alessandro Abate
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15951
Pdf URL: https://arxiv.org/pdf/2411.15951
Copy Paste: [[2411.15951]] Partial Identifiability and Misspecification in Inverse Reinforcement Learning(https://arxiv.org/abs/2411.15951)
Keywords: robust
Abstract: The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $\pi$. This problem is difficult, for several reasons. First of all, there are typically multiple reward functions which are compatible with a given policy; this means that the reward function is only *partially identifiable*, and that IRL contains a certain fundamental degree of ambiguity. Secondly, in order to infer $R$ from $\pi$, an IRL algorithm must have a *behavioural model* of how $\pi$ relates to $R$. However, the true relationship between human preferences and human behaviour is very complex, and practically impossible to fully capture with a simple model. This means that the behavioural model in practice will be *misspecified*, which raises the worry that it might lead to unsound inferences if applied to real-world data. In this paper, we provide a comprehensive mathematical analysis of partial identifiability and misspecification in IRL. Specifically, we fully characterise and quantify the ambiguity of the reward function for all of the behavioural models that are most common in the current IRL literature. We also provide necessary and sufficient conditions that describe precisely how the observed demonstrator policy may differ from each of the standard behavioural models before that model leads to faulty inferences about the reward function $R$. In addition to this, we introduce a cohesive framework for reasoning about partial identifiability and misspecification in IRL, together with several formal tools that can be used to easily derive the partial identifiability and misspecification robustness of new IRL models, or analyse other kinds of reward learning algorithms.

Title: Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise

Authors: Enea Monzio Compagnoni, Tianlin Liu, Rustem Islamov, Frank Norbert Proske, Antonio Orvieto, Aurelien Lucchi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15958
Pdf URL: https://arxiv.org/pdf/2411.15958
Copy Paste: [[2411.15958]] Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise(https://arxiv.org/abs/2411.15958)
Keywords: robust, transformer
Abstract: Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules.

Title: Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

Authors: Soumava Paul, Prakhar Kaushik, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15966
Pdf URL: https://arxiv.org/pdf/2411.15966
Copy Paste: [[2411.15966]] Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors(https://arxiv.org/abs/2411.15966)
Keywords: diffusion, generative
Abstract: In this work, we introduce a generative approach for pose-free reconstruction of $360^{\circ}$ scenes from a limited number of uncalibrated 2D images. Pose-free scene reconstruction from incomplete, unposed observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of unbounded scenes with known camera poses using diffusion priors, these methods rely on explicit camera embeddings for extrapolating unobserved regions. This reliance limits their application in pose-free settings, where view-specific data is only implicitly available. To address this, we propose an instruction-following RGBD diffusion model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We also propose a novel confidence measure for Gaussian representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent Gaussian representation. Evaluations on the MipNeRF360 dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed reconstruction methods in complex $360^{\circ}$ scenes.

Title: DRIVE: Dual-Robustness via Information Variability and Entropic Consistency in Source-Free Unsupervised Domain Adaptation

Authors: Ruiqiang Xiao, Songning Lai, Yijun Yang, Jiemin Wu, Yutao Yue, Lei Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15976
Pdf URL: https://arxiv.org/pdf/2411.15976
Copy Paste: [[2411.15976]] DRIVE: Dual-Robustness via Information Variability and Entropic Consistency in Source-Free Unsupervised Domain Adaptation(https://arxiv.org/abs/2411.15976)
Keywords: robust
Abstract: Adapting machine learning models to new domains without labeled data, especially when source data is inaccessible, is a critical challenge in applications like medical imaging, autonomous driving, and remote sensing. This task, known as Source-Free Unsupervised Domain Adaptation (SFUDA), involves adapting a pre-trained model to a target domain using only unlabeled target data, which can lead to issues such as overfitting, underfitting, and poor generalization due to domain discrepancies and noise. Existing SFUDA methods often rely on single-model architectures, struggling with uncertainty and variability in the target domain. To address these challenges, we propose DRIVE (Dual-Robustness through Information Variability and Entropy), a novel SFUDA framework leveraging a dual-model architecture. The two models, initialized with identical weights, work in parallel to capture diverse target domain characteristics. One model is exposed to perturbations via projection gradient descent (PGD) guided by mutual information, focusing on high-uncertainty regions. We also introduce an entropy-aware pseudo-labeling strategy that adjusts label weights based on prediction uncertainty, ensuring the model focuses on reliable data while avoiding noisy regions. The adaptation process has two stages: the first aligns the models on stable features using a mutual information consistency loss, and the second dynamically adjusts the perturbation level based on the loss from the first stage, encouraging the model to explore a broader range of the target domain while preserving existing performance. This enhances generalization capabilities and robustness against interference. Evaluations on standard SFUDA benchmarks show that DRIVE consistently outperforms previous methods, delivering improved adaptation accuracy and stability across complex target domains.

Title: Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Authors: Lifu Tu, Rui Meng, Shafiq Joty, Yingbo Zhou, Semih Yavuz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15993
Pdf URL: https://arxiv.org/pdf/2411.15993
Copy Paste: [[2411.15993]] Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown(https://arxiv.org/abs/2411.15993)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality scores tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). The results indicate that even advanced models like GPT-4 and Gemini-1.5-Pro fail to achieve perfect Self-Known scores, while their Self-Unknown scores remain notably above zero, reflecting ongoing uncertainty in their self-assessments. Moreover, we find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality. Interestingly, even without significant changes in the models' self-judgment (Self-Known and Self-Unknown), the number of unsupported claims can increases, likely as an artifact of long-form generation. These findings show the limitations of current LLMs in long-form generation, and provide valuable insights for improving factuality in long-form text generation.

Title: Ensuring Fair LLM Serving Amid Diverse Applications

Authors: Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renèe St. Amant, Rujia Wang, Yue Cheng, Ali R. Butt, Victor Rühle, Chetan Bansal, Saravan Rajmohan
Subjects: cs.LG, cs.AI, cs.DC, cs.MA
Abstract URL: https://arxiv.org/abs/2411.15997
Pdf URL: https://arxiv.org/pdf/2411.15997
Copy Paste: [[2411.15997]] Ensuring Fair LLM Serving Amid Diverse Applications(https://arxiv.org/abs/2411.15997)
Keywords: fair, large language model
Abstract: In a multi-tenant large language model (LLM) serving platform hosting diverse applications, some users may submit an excessive number of requests, causing the service to become unavailable to other users and creating unfairness. Existing fairness approaches do not account for variations in token lengths across applications and multiple LLM calls, making them unsuitable for such platforms. To address the fairness challenge, this paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications. FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness. Our experimental results on real-world traces demonstrate FairServe's superior performance compared to the state-of-the-art method in ensuring fairness. We are actively working on deploying our system in production, expecting to benefit millions of customers world-wide.

Title: Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models

Authors: Jayanta Sadhu, Ayan Antik Khan, Noshin Nawal, Sanju Basak, Abhik Bhattacharjee, Rifat Shahriyar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15999
Pdf URL: https://arxiv.org/pdf/2411.15999
Copy Paste: [[2411.15999]] Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models(https://arxiv.org/abs/2411.15999)
Keywords: large language model
Abstract: Theory of Mind (ToM) refers to the cognitive ability to infer and attribute mental states to oneself and others. As large language models (LLMs) are increasingly evaluated for social and cognitive capabilities, it remains unclear to what extent these models demonstrate ToM across diverse languages and cultural contexts. In this paper, we introduce a comprehensive study of multilingual ToM capabilities aimed at addressing this gap. Our approach includes two key components: (1) We translate existing ToM datasets into multiple languages, effectively creating a multilingual ToM dataset and (2) We enrich these translations with culturally specific elements to reflect the social and cognitive scenarios relevant to diverse populations. We conduct extensive evaluations of six state-of-the-art LLMs to measure their ToM performance across both the translated and culturally adapted datasets. The results highlight the influence of linguistic and cultural diversity on the models' ability to exhibit ToM, and questions their social reasoning capabilities. This work lays the groundwork for future research into enhancing LLMs' cross-cultural social cognition and contributes to the development of more culturally aware and socially intelligent AI systems. All our data and code are publicly available.

Title: eFedLLM: Efficient LLM Inference Based on Federated Learning

Authors: Shengwen Ding, Chenhui Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16003
Pdf URL: https://arxiv.org/pdf/2411.16003
Copy Paste: [[2411.16003]] eFedLLM: Efficient LLM Inference Based on Federated Learning(https://arxiv.org/abs/2411.16003)
Keywords: federate, transformer, large language model
Abstract: Large Language Models (LLMs) herald a transformative era in artificial intelligence (AI). However, the expansive scale of data and parameters of LLMs requires high-demand computational and memory resources, restricting their accessibility to a broader range of users and researchers. This paper introduces an effective approach that enhances the operational efficiency and affordability of LLM inference. By utilizing transformer-based federated learning (FL) with model-parallel distributed training, our model efficiently distributes the computational loads and memory requirements across a network of participants. This strategy permits users, especially those with limited resources to train state-of-the-art LLMs collaboratively. We also innovate an incentive mechanism within the FL framework, rewarding constructive contributions and filtering out malicious activities, thereby safeguarding the integrity and reliability of the training process. Concurrently, we leverage memory hierarchy strategies and Singular Value Decomposition (SVD) on weight matrices to boost computational and memory efficiencies further. Our results, derived from formulaic analyses and numerical calculations, demonstrate significant optimization of resource use and democratize access to cutting-edge LLMs, ensuring that a wide scale of users can both contribute to and benefit from these advanced models.

Title: M3: Mamba-assisted Multi-Circuit Optimization via MBRL with Effective Scheduling

Authors: Youngmin Oh, Jinje Park, Seunggeun Kim, Taejin Paik, David Pan, Bosun Hwang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16019
Pdf URL: https://arxiv.org/pdf/2411.16019
Copy Paste: [[2411.16019]] M3: Mamba-assisted Multi-Circuit Optimization via MBRL with Effective Scheduling(https://arxiv.org/abs/2411.16019)
Keywords: transformer
Abstract: Recent advancements in reinforcement learning (RL) for analog circuit optimization have demonstrated significant potential for improving sample efficiency and generalization across diverse circuit topologies and target specifications. However, there are challenges such as high computational overhead, the need for bespoke models for each circuit. To address them, we propose M3, a novel Model-based RL (MBRL) method employing the Mamba architecture and effective scheduling. The Mamba architecture, known as a strong alternative to the transformer architecture, enables multi-circuit optimization with distinct parameters and target specifications. The effective scheduling strategy enhances sample efficiency by adjusting crucial MBRL training parameters. To the best of our knowledge, M3 is the first method for multi-circuit optimization by leveraging both the Mamba architecture and a MBRL with effective scheduling. As a result, it significantly improves sample efficiency compared to existing RL methods.

Title: TransCompressor: LLM-Powered Multimodal Data Compression for Smart Transportation

Authors: Huanqi Yang, Rucheng Wu, Weitao Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16020
Pdf URL: https://arxiv.org/pdf/2411.16020
Copy Paste: [[2411.16020]] TransCompressor: LLM-Powered Multimodal Data Compression for Smart Transportation(https://arxiv.org/abs/2411.16020)
Keywords: large language model
Abstract: The incorporation of Large Language Models (LLMs) into smart transportation systems has paved the way for improving data management and operational efficiency. This study introduces TransCompressor, a novel framework that leverages LLMs for efficient compression and decompression of multimodal transportation sensor data. TransCompressor has undergone thorough evaluation with diverse sensor data types, including barometer, speed, and altitude measurements, across various transportation modes like buses, taxis, and MTRs. Comprehensive evaluation illustrates the effectiveness of TransCompressor in reconstructing transportation sensor data at different compression ratios. The results highlight that, with well-crafted prompts, LLMs can utilize their vast knowledge base to contribute to data compression processes, enhancing data storage, analysis, and retrieval in smart transportation settings.

Title: Binary Search with Distributional Predictions

Authors: Michael Dinitz, Sungjin Im, Thomas Lavastida, Benjamin Moseley, Aidin Niaparast, Sergei Vassilvitskii
Subjects: cs.LG, cs.DS
Abstract URL: https://arxiv.org/abs/2411.16030
Pdf URL: https://arxiv.org/pdf/2411.16030
Copy Paste: [[2411.16030]] Binary Search with Distributional Predictions(https://arxiv.org/abs/2411.16030)
Keywords: robust
Abstract: Algorithms with (machine-learned) predictions is a powerful framework for combining traditional worst-case algorithms with modern machine learning. However, the vast majority of work in this space assumes that the prediction itself is non-probabilistic, even if it is generated by some stochastic process (such as a machine learning system). This is a poor fit for modern ML, particularly modern neural networks, which naturally generate a distribution. We initiate the study of algorithms with distributional predictions, where the prediction itself is a distribution. We focus on one of the simplest yet fundamental settings: binary search (or searching a sorted array). This setting has one of the simplest algorithms with a point prediction, but what happens if the prediction is a distribution? We show that this is a richer setting: there are simple distributions where using the classical prediction-based algorithm with any single prediction does poorly. Motivated by this, as our main result, we give an algorithm with query complexity $O(H(p) + \log \eta)$, where $H(p)$ is the entropy of the true distribution $p$ and $\eta$ is the earth mover's distance between $p$ and the predicted distribution $\hat p$. This also yields the first distributionally-robust algorithm for the classical problem of computing an optimal binary search tree given a distribution over target keys. We complement this with a lower bound showing that this query complexity is essentially optimal (up to constants), and experiments validating the practical usefulness of our algorithm.

Title: ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Authors: Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16044
Pdf URL: https://arxiv.org/pdf/2411.16044
Copy Paste: [[2411.16044]] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration(https://arxiv.org/abs/2411.16044)
Keywords: large language model
Abstract: An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57\% on $V^*$ Bench and 17.88\% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at \href{this https URL}{this https URL}.

Title: ROADS: Robust Prompt-driven Multi-Class Anomaly Detection under Domain Shift

Authors: Hossein Kashiani, Niloufar Alipour Talemi, Fatemeh Afghah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16049
Pdf URL: https://arxiv.org/pdf/2411.16049
Copy Paste: [[2411.16049]] ROADS: Robust Prompt-driven Multi-Class Anomaly Detection under Domain Shift(https://arxiv.org/abs/2411.16049)
Keywords: robust
Abstract: Recent advancements in anomaly detection have shifted focus towards Multi-class Unified Anomaly Detection (MUAD), offering more scalable and practical alternatives compared to traditional one-class-one-model approaches. However, existing MUAD methods often suffer from inter-class interference and are highly susceptible to domain shifts, leading to substantial performance degradation in real-world applications. In this paper, we propose a novel robust prompt-driven MUAD framework, called ROADS, to address these challenges. ROADS employs a hierarchical class-aware prompt integration mechanism that dynamically encodes class-specific information into our anomaly detector to mitigate interference among anomaly classes. Additionally, ROADS incorporates a domain adapter to enhance robustness against domain shifts by learning domain-invariant representations. Extensive experiments on MVTec-AD and VISA datasets demonstrate that ROADS surpasses state-of-the-art methods in both anomaly detection and localization, with notable improvements in out-of-distribution settings.

Title: UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Authors: Guangzhao Dai, Jian Zhao, Yuantao Chen, Yusen Qin, Hao Zhao, Guosen Xie, Yazhou Yao, Xiangbo Shu, Xuelong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16053
Pdf URL: https://arxiv.org/pdf/2411.16053
Copy Paste: [[2411.16053]] UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation(https://arxiv.org/abs/2411.16053)
Keywords: robust
Abstract: Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.

Title: Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training

Authors: Man Yao, Xuerui Qiu, Tianxiang Hu, Jiakui Hu, Yuhong Chou, Keyu Tian, Jianxing Liao, Luziwei Leng, Bo Xu, Guoqi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16061
Pdf URL: https://arxiv.org/pdf/2411.16061
Copy Paste: [[2411.16061]] Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training(https://arxiv.org/abs/2411.16061)
Keywords: transformer, segmentation
Abstract: The ambition of brain-inspired Spiking Neural Networks (SNNs) is to become a low-power alternative to traditional Artificial Neural Networks (ANNs). This work addresses two major challenges in realizing this vision: the performance gap between SNNs and ANNs, and the high training costs of SNNs. We identify intrinsic flaws in spiking neurons caused by binary firing mechanisms and propose a Spike Firing Approximation (SFA) method using integer training and spike-driven inference. This optimizes the spike firing pattern of spiking neurons, enhancing efficient training, reducing power consumption, improving performance, enabling easier scaling, and better utilizing neuromorphic chips. We also develop an efficient spike-driven Transformer architecture and a spike-masked autoencoder to prevent performance degradation during SNN scaling. On ImageNet-1k, we achieve state-of-the-art top-1 accuracy of 78.5\%, 79.8\%, 84.0\%, and 86.2\% with models containing 10M, 19M, 83M, and 173M parameters, respectively. For instance, the 10M model outperforms the best existing SNN by 7.2\% on ImageNet, with training time acceleration and inference energy efficiency improved by 4.5$\times$ and 3.9$\times$, respectively. We validate the effectiveness and efficiency of the proposed method across various tasks, including object detection, semantic segmentation, and neuromorphic vision tasks. This work enables SNNs to match ANN performance while maintaining the low-power advantage, marking a significant step towards SNNs as a general visual backbone. Code is available at this https URL.

Title: VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Authors: Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher
Subjects: cs.LG, math.NA, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2411.16063
Pdf URL: https://arxiv.org/pdf/2411.16063
Copy Paste: [[2411.16063]] VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction(https://arxiv.org/abs/2411.16063)
Keywords: transformer
Abstract: In-Context Operator Networks (ICONs) are models that learn operators across different types of PDEs using a few-shot, in-context approach. Although they show successful generalization to various PDEs, existing methods treat each data point as a single token, and suffer from computational inefficiency when processing dense data, limiting their application in higher spatial dimensions. In this work, we propose Vision In-Context Operator Networks (VICON), incorporating a vision transformer architecture that efficiently processes 2D functions through patch-wise operations. We evaluated our method on three fluid dynamics datasets, demonstrating both superior performance (reducing scaled $L^2$ error by $40\%$ and $61.6\%$ for two benchmark datasets for compressible flows, respectively) and computational efficiency (requiring only one-third of the inference time per frame) in long-term rollout predictions compared to the current state-of-the-art sequence-to-sequence model with fixed timestep prediction: Multiple Physics Pretraining (MPP). Compared to MPP, our method preserves the benefits of in-context operator learning, enabling flexible context formation when dealing with insufficient frame counts or varying timestep values.

Title: Soft-TransFormers for Continual Learning

Authors: Haeyong Kang, Chang D. Yoo
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16073
Pdf URL: https://arxiv.org/pdf/2411.16073
Copy Paste: [[2411.16073]] Soft-TransFormers for Continual Learning(https://arxiv.org/abs/2411.16073)
Keywords: transformer
Abstract: Inspired by Well-initialized Lottery Ticket Hypothesis (WLTH), which provides suboptimal fine-tuning solutions, we propose a novel fully fine-tuned continual learning (CL) method referred to as Soft-TransFormers (Soft-TF). Soft-TF sequentially learns and selects an optimal soft-network or subnetwork for each task. During sequential training in CL, Soft-TF jointly optimizes the weights of sparse layers to obtain task-adaptive soft (real-valued) networks or subnetworks (binary masks), while keeping the well-pre-trained layer parameters frozen. In inference, the identified task-adaptive network of Soft-TF masks the parameters of the pre-trained network, mapping to an optimal solution for each task and minimizing Catastrophic Forgetting (CF) - the soft-masking preserves the knowledge of the pre-trained network. Extensive experiments on Vision Transformer (ViT) and CLIP demonstrate the effectiveness of Soft-TF, achieving state-of-the-art performance across various CL scenarios, including Class-Incremental Learning (CIL) and Task-Incremental Learning (TIL), supported by convergence theory.

Title: Geometry Distributions

Authors: Biao Zhang, Jing Ren, Peter Wonka
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16076
Pdf URL: https://arxiv.org/pdf/2411.16076
Copy Paste: [[2411.16076]] Geometry Distributions(https://arxiv.org/abs/2411.16076)
Keywords: diffusion
Abstract: Neural representations of 3D data have been widely adopted across various applications, particularly in recent work leveraging coordinate-based networks to model scalar or vector fields. However, these approaches face inherent challenges, such as handling thin structures and non-watertight geometries, which limit their flexibility and accuracy. In contrast, we propose a novel geometric data representation that models geometry as distributions-a powerful representation that makes no assumptions about surface genus, connectivity, or boundary conditions. Our approach uses diffusion models with a novel network architecture to learn surface point distributions, capturing fine-grained geometric details. We evaluate our representation qualitatively and quantitatively across various object types, demonstrating its effectiveness in achieving high geometric fidelity. Additionally, we explore applications using our representation, such as textured mesh representation, neural surface compression, dynamic object modeling, and rendering, highlighting its potential to advance 3D geometric learning.

Title: SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text

Authors: Reshmi Ghosh, Tianyi Yao, Lizzy Chen, Sadid Hasan, Tianwei Chen, Dario Bernal, Huitian Jiao, H M Sajjad Hossain
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2411.16077
Pdf URL: https://arxiv.org/pdf/2411.16077
Copy Paste: [[2411.16077]] SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text(https://arxiv.org/abs/2411.16077)
Keywords: robust, large language model
Abstract: Large Language Model (LLM) integrations into applications like Microsoft365 suite and Google Workspace for creating/processing documents, emails, presentations, etc. has led to considerable enhancements in productivity and time savings. But as these integrations become more more complex, it is paramount to ensure that the quality of output from the LLM-integrated applications are relevant and appropriate for use. Identifying the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn't exist or isn't amply available, this paper introduces a novel framework called "SAGEval" which utilizes a critiquing Agent to provide feedback on scores generated by LLM evaluators. We show that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys with responses in different styles like multiple choice, likert ratings, single choice questions, etc.

Title: Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models

Authors: Donggeun Ko, Dongjun Lee, Namjun Park, Wonkyeong Shim, Jaekwang Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16079
Pdf URL: https://arxiv.org/pdf/2411.16079
Copy Paste: [[2411.16079]] Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models(https://arxiv.org/abs/2411.16079)
Keywords: robust, diffusion, generative, large language model
Abstract: Neural networks struggle with image classification when biases are learned and misleads correlations, affecting their generalization and performance. Previous methods require attribute labels (e.g. background, color) or utilizes Generative Adversarial Networks (GANs) to mitigate biases. We introduce DiffuBias, a novel pipeline for text-to-image generation that enhances classifier robustness by generating bias-conflict samples, without requiring training during the generation phase. Utilizing pretrained diffusion and image captioning models, DiffuBias generates images that challenge the biases of classifiers, using the top-$K$ losses from a biased classifier ($f_B$) to create more representative data samples. This method not only debiases effectively but also boosts classifier generalization capabilities. To the best of our knowledge, DiffuBias is the first approach leveraging a stable diffusion model to generate bias-conflict samples in debiasing tasks. Our comprehensive experimental evaluations demonstrate that DiffuBias achieves state-of-the-art performance on benchmark datasets. We also conduct a comparative analysis of various generative models in terms of carbon emissions and energy consumption to highlight the significance of computational efficiency.

Title: Boosting 3D Object Generation through PBR Materials

Authors: Yitong Wang, Xudong Xu, Li Ma, Haoran Wang, Bo Dai
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16080
Pdf URL: https://arxiv.org/pdf/2411.16080
Copy Paste: [[2411.16080]] Boosting 3D Object Generation through PBR Materials(https://arxiv.org/abs/2411.16080)
Keywords: diffusion
Abstract: Automatic 3D content creation has gained increasing attention recently, due to its potential in various applications such as video games, film industry, and AR/VR. Recent advancements in diffusion models and multimodal models have notably improved the quality and efficiency of 3D object generation given a single RGB image. However, 3D objects generated even by state-of-the-art methods are still unsatisfactory compared to human-created assets. Considering only textures instead of materials makes these methods encounter challenges in photo-realistic rendering, relighting, and flexible appearance editing. And they also suffer from severe misalignment between geometry and high-frequency texture details. In this work, we propose a novel approach to boost the quality of generated 3D objects from the perspective of Physics-Based Rendering (PBR) materials. By analyzing the components of PBR materials, we choose to consider albedo, roughness, metalness, and bump maps. For albedo and bump maps, we leverage Stable Diffusion fine-tuned on synthetic data to extract these values, with novel usages of these fine-tuned models to obtain 3D consistent albedo UV and bump UV for generated objects. In terms of roughness and metalness maps, we adopt a semi-automatic process to provide room for interactive adjustment, which we believe is more practical. Extensive experiments demonstrate that our model is generally beneficial for various state-of-the-art generation methods, significantly boosting the quality and realism of their generated 3D objects, with natural relighting effects and substantially improved geometry.

Title: Cautious Optimizers: Improving Training with One Line of Code

Authors: Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, cs.DM
Abstract URL: https://arxiv.org/abs/2411.16085
Pdf URL: https://arxiv.org/pdf/2411.16085
Copy Paste: [[2411.16085]] Cautious Optimizers: Improving Training with One Line of Code(https://arxiv.org/abs/2411.16085)
Keywords: transformer
Abstract: AdamW has been the default optimizer for transformer pretraining. For many years, our community searches for faster and more stable optimizers with only constraint positive outcomes. In this work, we propose a \textbf{single-line modification in Pytorch} to any momentum-based optimizer, which we rename Cautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing speed-up on Llama and MAE pretraining up to $1.47\times$. Code is available at this https URL

Title: AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

Authors: Jili Xia, Lihuo He, Fei Gao, Kaifan Zhang, Leida Li, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16087
Pdf URL: https://arxiv.org/pdf/2411.16087
Copy Paste: [[2411.16087]] AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity(https://arxiv.org/abs/2411.16087)
Keywords: generative
Abstract: Recently, AI-generated images (AIGIs) created by given prompts (initial prompts) have garnered widespread attention. Nevertheless, due to technical nonproficiency, they often suffer from poor perception quality and Text-to-Image misalignment. Therefore, assessing the perception quality and alignment quality of AIGIs is crucial to improving the generative model's performance. Existing assessment methods overly rely on the initial prompts in the task prompt design and use the same prompts to guide both perceptual and alignment quality evaluation, overlooking the distinctions between the two tasks. To address this limitation, we propose a novel quality assessment method for AIGIs named TSP-MGS, which designs task-specific prompts and measures multi-granularity similarity between AIGIs and the prompts. Specifically, task-specific prompts are first constructed to describe perception and alignment quality degrees separately, and the initial prompt is introduced for detailed quality perception. Then, the coarse-grained similarity between AIGIs and task-specific prompts is calculated, which facilitates holistic quality awareness. In addition, to improve the understanding of AIGI details, the fine-grained similarity between the image and the initial prompt is measured. Finally, precise quality prediction is acquired by integrating the multi-granularity similarities. Experiments on the commonly used AGIQA-1K and AGIQA-3K benchmarks demonstrate the superiority of the proposed TSP-MGS.

Title: Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability

Authors: Jatin Nainani, Sankaran Vaidyanathan, AJ Yeung, Kartik Gupta, David Jensen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16105
Pdf URL: https://arxiv.org/pdf/2411.16105
Copy Paste: [[2411.16105]] Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability(https://arxiv.org/abs/2411.16105)
Keywords: interpretability, large language model
Abstract: Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.

Title: LLMPirate: LLMs for Black-box Hardware IP Piracy

Authors: Vasudev Gohil, Matthew DeLorenzo, Veera Vishwa Achuta Sai Venkat Nallam, Joey See, Jeyavijayan Rajendran
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16111
Pdf URL: https://arxiv.org/pdf/2411.16111
Copy Paste: [[2411.16111]] LLMPirate: LLMs for Black-box Hardware IP Piracy(https://arxiv.org/abs/2411.16111)
Keywords: security, attack, large language model
Abstract: The rapid advancement of large language models (LLMs) has enabled the ability to effectively analyze and generate code nearly instantaneously, resulting in their widespread adoption in software development. Following this advancement, researchers and companies have begun integrating LLMs across the hardware design and verification process. However, these highly potent LLMs can also induce new attack scenarios upon security vulnerabilities across the hardware development process. One such attack vector that has not been explored is intellectual property (IP) piracy. Given that this attack can manifest as rewriting hardware designs to evade piracy detection, it is essential to thoroughly evaluate LLM capabilities in performing this task and assess the mitigation abilities of current IP piracy detection tools. Therefore, in this work, we propose LLMPirate, the first LLM-based technique able to generate pirated variations of circuit designs that successfully evade detection across multiple state-of-the-art piracy detection tools. We devise three solutions to overcome challenges related to integration of LLMs for hardware circuit designs, scalability to large circuits, and effectiveness, resulting in an end-to-end automated, efficient, and practical formulation. We perform an extensive experimental evaluation of LLMPirate using eight LLMs of varying sizes and capabilities and assess their performance in pirating various circuit designs against four state-of-the-art, widely-used piracy detection tools. Our experiments demonstrate that LLMPirate is able to consistently evade detection on 100% of tested circuits across every detection tool. Additionally, we showcase the ramifications of LLMPirate using case studies on IBEX and MOR1KX processors and a GPS module, that we successfully pirate. We envision that our work motivates and fosters the development of better IP piracy detection tools.

Title: LLM Augmentations to support Analytical Reasoning over Multiple Documents

Authors: Raquib Bin Yousuf, Nicholas Defelice, Mandar Sharma, Shengzhe Xu, Naren Ramakrishnan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16116
Pdf URL: https://arxiv.org/pdf/2411.16116
Copy Paste: [[2411.16116]] LLM Augmentations to support Analytical Reasoning over Multiple Documents(https://arxiv.org/abs/2411.16116)
Keywords: large language model
Abstract: Building on their demonstrated ability to perform a variety of tasks, we investigate the application of large language models (LLMs) to enhance in-depth analytical reasoning within the context of intelligence analysis. Intelligence analysts typically work with massive dossiers to draw connections between seemingly unrelated entities, and uncover adversaries' plans and motives. We explore if and how LLMs can be helpful to analysts for this task and develop an architecture to augment the capabilities of an LLM with a memory module called dynamic evidence trees (DETs) to develop and track multiple investigation threads. Through extensive experiments on multiple datasets, we highlight how LLMs, as-is, are still inadequate to support intelligence analysts and offer recommendations to improve LLMs for such intricate reasoning applications.

Title: Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain

Authors: Hangyul Yoon, Doohyuk Jang, Jungeun Kim, Eunho Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16123
Pdf URL: https://arxiv.org/pdf/2411.16123
Copy Paste: [[2411.16123]] Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain(https://arxiv.org/abs/2411.16123)
Keywords: extraction
Abstract: Leveraging pre-trained models with tailored prompts for in-context learning has proven highly effective in NLP tasks. Building on this success, recent studies have applied a similar approach to the Segment Anything Model (SAM) within a ``one-shot" framework, where only a single reference image and its label are employed. However, these methods face limitations in the medical domain, primarily due to SAM's essential requirement for visual prompts and the over-reliance on pixel similarity for generating them. This dependency may lead to (1) inaccurate prompt generation and (2) clustering of point prompts, resulting in suboptimal outcomes. To address these challenges, we introduce \textbf{Med-PerSAM}, a novel and straightforward one-shot framework designed for the medical domain. Med-PerSAM uses only visual prompt engineering and eliminates the need for additional training of the pretrained SAM or human intervention, owing to our novel automated prompt generation process. By integrating our lightweight warping-based prompt tuning model with SAM, we enable the extraction and iterative refinement of visual prompts, enhancing the performance of the pre-trained SAM. This advancement is particularly meaningful in the medical domain, where creating visual prompts poses notable challenges for individuals lacking medical expertise. Our model outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.

Title: DF-GNN: Dynamic Fusion Framework for Attention Graph Neural Networks on GPUs

Authors: Jiahui Liu, Zhenkun Cai, Zhiyong Chen, Minjie Wang
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2411.16127
Pdf URL: https://arxiv.org/pdf/2411.16127
Copy Paste: [[2411.16127]] DF-GNN: Dynamic Fusion Framework for Attention Graph Neural Networks on GPUs(https://arxiv.org/abs/2411.16127)
Keywords: transformer
Abstract: Attention Graph Neural Networks (AT-GNNs), such as GAT and Graph Transformer, have demonstrated superior performance compared to other GNNs. However, existing GNN systems struggle to efficiently train AT-GNNs on GPUs due to their intricate computation patterns. The execution of AT-GNN operations without kernel fusion results in heavy data movement and significant kernel launch overhead, while fixed thread scheduling in existing GNN kernel fusion strategies leads to sub-optimal performance, redundant computation and unbalanced workload. To address these challenges, we propose a dynamic kernel fusion framework, DF-GNN, for the AT-GNN family. DF-GNN introduces a dynamic bi-level thread scheduling strategy, enabling flexible adjustments to thread scheduling while retaining the benefits of shared memory within the fused kernel. DF-GNN tailors specific thread scheduling for operations in AT-GNNs and considers the performance bottleneck shift caused by the presence of super nodes. Additionally, DF-GNN is integrated with the PyTorch framework for high programmability. Evaluations across diverse GNN models and multiple datasets reveal that DF-GNN surpasses existing GNN kernel optimization works like cuGraph and dgNN, with speedups up to $7.0\times$ over the state-of-the-art non-fusion DGL sparse library. Moreover, it achieves an average speedup of $2.16\times$ in end-to-end training compared to the popular GNN computing framework DGL.

Title: CIA: Controllable Image Augmentation Framework Based on Stable Diffusion

Authors: Mohamed Benkedadra, Dany Rimez, Tiffanie Godelaine, Natarajan Chidambaram, Hamed Razavi Khosroshahi, Horacio Tellez, Matei Mancas, Benoit Macq, Sidi Ahmed Mahmoudi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16128
Pdf URL: https://arxiv.org/pdf/2411.16128
Copy Paste: [[2411.16128]] CIA: Controllable Image Augmentation Framework Based on Stable Diffusion(https://arxiv.org/abs/2411.16128)
Keywords: diffusion, segmentation
Abstract: Computer vision tasks such as object detection and segmentation rely on the availability of extensive, accurately annotated datasets. In this work, We present CIA, a modular pipeline, for (1) generating synthetic images for dataset augmentation using Stable Diffusion, (2) filtering out low quality samples using defined quality metrics, (3) forcing the existence of specific patterns in generated images using accurate prompting and ControlNet. In order to show how CIA can be used to search for an optimal augmentation pipeline of training data, we study human object detection in a data constrained scenario, using YOLOv8n on COCO and Flickr30k datasets. We have recorded significant improvement using CIA-generated images, approaching the performances obtained when doubling the amount of real images in the dataset. Our findings suggest that our modular framework can significantly enhance object detection systems, and make it possible for future research to be done on data-constrained scenarios. The framework is available at: this http URL.

Title: Context Awareness Gate For Retrieval Augmented Generation

Authors: Mohammad Hassan Heydari, Arshia Hemmat, Erfan Naman, Afsaneh Fatemi
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2411.16133
Pdf URL: https://arxiv.org/pdf/2411.16133
Copy Paste: [[2411.16133]] Context Awareness Gate For Retrieval Augmented Generation(https://arxiv.org/abs/2411.16133)
Keywords: large language model
Abstract: Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions. Previous research has predominantly focused on improving the accuracy and quality of retrieved data chunks to enhance the overall performance of the generation pipeline. However, despite ongoing advancements, the critical issue of retrieving irrelevant information -- which can impair the ability of the model to utilize its internal knowledge effectively -- has received minimal attention. In this work, we investigate the impact of retrieving irrelevant information in open-domain question answering, highlighting its significant detrimental effect on the quality of LLM outputs. To address this challenge, we propose the Context Awareness Gate (CAG) architecture, a novel mechanism that dynamically adjusts the LLMs' input prompt based on whether the user query necessitates external context retrieval. Additionally, we introduce the Vector Candidates method, a core mathematical component of CAG that is statistical, LLM-independent, and highly scalable. We further examine the distributions of relationships between contexts and questions, presenting a statistical analysis of these distributions. This analysis can be leveraged to enhance the context retrieval process in Retrieval Augmented Generation (RAG) systems.

Title: Beyond Task Vectors: Selective Task Arithmetic Based on Importance Metrics

Authors: Tian Bowen, Lai Songning, Wu Jiemin, Shuai Zhihao, Ge Shiming, Yue Yutao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16139
Pdf URL: https://arxiv.org/pdf/2411.16139
Copy Paste: [[2411.16139]] Beyond Task Vectors: Selective Task Arithmetic Based on Importance Metrics(https://arxiv.org/abs/2411.16139)
Keywords: robust
Abstract: Pretrained models have revolutionized deep learning by enabling significant performance improvements across a wide range of tasks, leveraging large-scale, pre-learned knowledge representations. However, deploying these models in real-world multi-task learning (MTL) scenarios poses substantial challenges, primarily due to high computational costs and inefficiencies in inference. Traditional approaches such as pruning, quantization, and knowledge distillation have been explored to mitigate these issues, but they often fall short in fully addressing the complexities of multi-task environments. This paper introduces \textbf{\underline{S}}elective \textbf{\underline{T}}ask \textbf{\underline{A}}rithmetic \underline{\textbf{(STA)}}, a training-free framework designed to enhance multi-task performance through task-specific parameter fusion. STA addresses three key challenges: (i) \textbf{Parameter importance diversity: } Recognizing that different tasks relie on distinct parameters, STA employs a loss-sensitive parameter importance metric derived from a first-order Taylor expansion to accurately measure the importance of parameters for each task. (ii) \textbf{Over-reliance on hyperparameter tuning: }By enhancing the sparsity of task vectors through parameter importance metrics, STA reduces the need for extensive hyperparameter tuning, thereby improving the generalization and robustness of the model. (iii) \textbf{Neglect of other abilities in task arithmetic: } Previous works have largely overlooked the potential for more precise task forgetting. STA leverages its parameter importance metric to achieve more controlled and effective task forgetting, minimizing the impact of noisy elements that can degrade model performance. Experimental results demonstrate that STA achieves superior multi-task performance across benchmarks and excellent performance in task forgetting.

Title: DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders

Authors: Sizai Hou, Songze Li, Duanyi Yao
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.16154
Pdf URL: https://arxiv.org/pdf/2411.16154
Copy Paste: [[2411.16154]] DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders(https://arxiv.org/abs/2411.16154)
Keywords: attack, steal
Abstract: Self-supervised learning (SSL) is pervasively exploited in training high-quality upstream encoders with a large amount of unlabeled data. However, it is found to be susceptible to backdoor attacks merely via polluting a small portion of training data. The victim encoders mismatch triggered inputs with target embeddings, e.g., match the triggered cat input to an airplane embedding, such that the downstream tasks are affected to misbehave when the trigger is activated. Emerging backdoor attacks have shown great threats in different SSL paradigms such as contrastive learning and CLIP, while few research is devoted to defending against such attacks. Besides, the existing ones fall short in detecting advanced stealthy backdoors. To address the limitations, we propose a novel detection mechanism, DeDe, which detects the activation of the backdoor mapping with the cooccurrence of victim encoder and trigger inputs. Specifically, DeDe trains a decoder for the SSL encoder on an auxiliary dataset (can be out-of-distribution or even slightly poisoned), such that for any triggered input that misleads to the target embedding, the decoder outputs an image significantly different from the input. We empirically evaluate DeDe on both contrastive learning and CLIP models against various types of backdoor attacks, and demonstrate its superior performance over SOTA detection methods in both upstream detection performance and ability of preventing backdoors in downstream tasks.

Title: Graph Adapter of EEG Foundation Models for Parameter Efficient Fine Tuning

Authors: Toyotaro Suzumura, Hiroki Kanezashi, Shotaro Akahori
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2411.16155
Pdf URL: https://arxiv.org/pdf/2411.16155
Copy Paste: [[2411.16155]] Graph Adapter of EEG Foundation Models for Parameter Efficient Fine Tuning(https://arxiv.org/abs/2411.16155)
Keywords: transformer
Abstract: In diagnosing mental diseases from electroencephalography (EEG) data, neural network models such as Transformers have been employed to capture temporal dynamics. Additionally, it is crucial to learn the spatial relationships between EEG sensors, for which Graph Neural Networks (GNNs) are commonly used. However, fine-tuning large-scale complex neural network models simultaneously to capture both temporal and spatial features increases computational costs due to the more significant number of trainable parameters. It causes the limited availability of EEG datasets for downstream tasks, making it challenging to fine-tune large models effectively. We propose EEG-GraphAdapter (EGA), a parameter-efficient fine-tuning (PEFT) approach to address these challenges. EGA is integrated into pre-trained temporal backbone models as a GNN-based module and fine-tuned itself alone while keeping the backbone model parameters frozen. This enables the acquisition of spatial representations of EEG signals for downstream tasks, significantly reducing computational overhead and data requirements. Experimental evaluations on healthcare-related downstream tasks of Major Depressive Disorder and Abnormality Detection demonstrate that our EGA improves performance by up to 16.1% in the F1-score compared with the backbone BENDR model.

Title: VideoOrion: Tokenizing Object Dynamics in Videos

Authors: Yicheng Feng, Yijiang Li, Wanpeng Zhang, Sipeng Zheng, Zongqing Lu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16156
Pdf URL: https://arxiv.org/pdf/2411.16156
Copy Paste: [[2411.16156]] VideoOrion: Tokenizing Object Dynamics in Videos(https://arxiv.org/abs/2411.16156)
Keywords: large language model
Abstract: We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos--the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.

Title: MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Authors: Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16157
Pdf URL: https://arxiv.org/pdf/2411.16157
Copy Paste: [[2411.16157]] MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model(https://arxiv.org/abs/2411.16157)
Keywords: diffusion
Abstract: We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on arbitrary reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset comprising up to 1.2 million scenes, equipped with well-aligned metric depth. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at this https URL.

Title: MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Authors: Yu Zhang, Mingzi Wang, Lancheng Zou, Wulong Liu, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2411.16158
Pdf URL: https://arxiv.org/pdf/2411.16158
Copy Paste: [[2411.16158]] MixPE: Quantization and Hardware Co-design for Efficient LLM Inference(https://arxiv.org/abs/2411.16158)
Keywords: transformer, large language model
Abstract: Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient dequantization operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference. MixPE leverages two key innovations to minimize dequantization overhead and unlock the full potential of low-bit quantization. First, recognizing that scale and zero point are shared within each quantization group, we propose performing dequantization after per-group mpGEMM, significantly reducing dequantization overhead. Second, instead of relying on conventional multipliers, MixPE utilizes efficient shift\&add operations for multiplication, optimizing both computation and energy efficiency. Our experimental results demonstrate that MixPE surpasses the state-of-the-art quantization accelerators by $2.6\times$ speedup and $1.4\times$ energy reduction.

Title: Sparse patches adversarial attacks via extrapolating point-wise information

Authors: Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16162
Pdf URL: https://arxiv.org/pdf/2411.16162
Copy Paste: [[2411.16162]] Sparse patches adversarial attacks via extrapolating point-wise information(https://arxiv.org/abs/2411.16162)
Keywords: security, attack
Abstract: Sparse and patch adversarial attacks were previously shown to be applicable in realistic settings and are considered a security risk to autonomous systems. Sparse adversarial perturbations constitute a setting in which the adversarial perturbations are limited to affecting a relatively small number of points in the input. Patch adversarial attacks denote the setting where the sparse attacks are limited to a given structure, i.e., sparse patches with a given shape and number. However, previous patch adversarial attacks do not simultaneously optimize multiple patches' locations and perturbations. This work suggests a novel approach for sparse patches adversarial attacks via point-wise trimming dense adversarial perturbations. Our approach enables simultaneous optimization of multiple sparse patches' locations and perturbations for any given number and shape. Moreover, our approach is also applicable for standard sparse adversarial attacks, where we show that it significantly improves the state-of-the-art over multiple extensive settings. A reference implementation of the proposed method and the reported experiments is provided at \url{this https URL}

Title: Text-to-Image Synthesis: A Decade Survey

Authors: Nonghai Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16164
Pdf URL: https://arxiv.org/pdf/2411.16164
Copy Paste: [[2411.16164]] Text-to-Image Synthesis: A Decade Survey(https://arxiv.org/abs/2411.16164)
Keywords: diffusion, generative
Abstract: When humans read a specific text, they often visualize the corresponding images, and we hope that computers can do the same. Text-to-image synthesis (T2I), which focuses on generating high-quality images from textual descriptions, has become a significant aspect of Artificial Intelligence Generated Content (AIGC) and a transformative direction in artificial intelligence research. Foundation models play a crucial role in T2I. In this survey, we review over 440 recent works on T2I. We start by briefly introducing how GANs, autoregressive models, and diffusion models have been used for image generation. Building on this foundation, we discuss the development of these models for T2I, focusing on their generative capabilities and diversity when conditioned on text. We also explore cutting-edge research on various aspects of T2I, including performance, controllability, personalized generation, safety concerns, and consistency in content and spatial relationships. Furthermore, we summarize the datasets and evaluation metrics commonly used in T2I research. Finally, we discuss the potential applications of T2I within AIGC, along with the challenges and future research opportunities in this field.

Title: BadSFL: Backdoor Attack against Scaffold Federated Learning

Authors: Xingshuo Han, Xiang Lan, Haozhao Wang, Shengmin Xu, Shen Ren, Jason Zeng, Ming Wu, Michael Heinrich, Tianwei Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16167
Pdf URL: https://arxiv.org/pdf/2411.16167
Copy Paste: [[2411.16167]] BadSFL: Backdoor Attack against Scaffold Federated Learning(https://arxiv.org/abs/2411.16167)
Keywords: privacy, attack, steal, federate, generative
Abstract: Federated learning (FL) enables the training of deep learning models on distributed clients to preserve data privacy. However, this learning paradigm is vulnerable to backdoor attacks, where malicious clients can upload poisoned local models to embed backdoors into the global model, leading to attacker-desired predictions. Existing backdoor attacks mainly focus on FL with independently and identically distributed (IID) scenarios, while real-world FL training data are typically non-IID. Current strategies for non-IID backdoor attacks suffer from limitations in maintaining effectiveness and durability. To address these challenges, we propose a novel backdoor attack method, \name, specifically designed for the FL framework using the scaffold aggregation algorithm in non-IID settings. \name leverages a Generative Adversarial Network (GAN) based on the global model to complement the training set, achieving high accuracy on both backdoor and benign samples. It utilizes a specific feature as the backdoor trigger to ensure stealthiness, and exploits the Scaffold's control variate to predict the global model's convergence direction, ensuring the backdoor's persistence. Extensive experiments on three benchmark datasets demonstrate the high effectiveness, stealthiness, and durability of \name. Notably, our attack remains effective over 60 rounds in the global model and up to 3 times longer than existing baseline attacks after stopping the injection of malicious updates.

Title: Local and Global Feature Attention Fusion Network for Face Recognition

Authors: Wang Yu, Wei Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16169
Pdf URL: https://arxiv.org/pdf/2411.16169
Copy Paste: [[2411.16169]] Local and Global Feature Attention Fusion Network for Face Recognition(https://arxiv.org/abs/2411.16169)
Keywords: robust, extraction
Abstract: Recognition of low-quality face images remains a challenge due to invisible or deformation in partial facial regions. For low-quality images dominated by missing partial facial regions, local region similarity contributes more to face recognition (FR). Conversely, in cases dominated by local face deformation, excessive attention to local regions may lead to misjudgments, while global features exhibit better robustness. However, most of the existing FR methods neglect the bias in feature quality of low-quality images introduced by different factors. To address this issue, we propose a Local and Global Feature Attention Fusion (LGAF) network based on feature quality. The network adaptively allocates attention between local and global features according to feature quality and obtains more discriminative and high-quality face features through local and global information complementarity. In addition, to effectively obtain fine-grained information at various scales and increase the separability of facial features in high-dimensional space, we introduce a Multi-Head Multi-Scale Local Feature Extraction (MHMS) module. Experimental results demonstrate that the LGAF achieves the best average performance on $4$ validation sets (CFP-FP, CPLFW, AgeDB, and CALFW), and the performance on TinyFace and SCFace outperforms the state-of-the-art methods (SoTA).

Title: CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction

Authors: Yuan Zhou, Qingshan Xu, Jiequan Cui, Junbao Zhou, Jing Zhang, Richang Hong, Hanwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16170
Pdf URL: https://arxiv.org/pdf/2411.16170
Copy Paste: [[2411.16170]] CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction(https://arxiv.org/abs/2411.16170)
Keywords: transformer
Abstract: Recently, large efforts have been made to design efficient linear-complexity visual Transformers. However, current linear attention models are generally unsuitable to be deployed in resource-constrained mobile devices, due to suffering from either few efficiency gains or significant accuracy drops. In this paper, we propose a new de\textbf{C}oupled du\textbf{A}l-interactive linea\textbf{R} att\textbf{E}ntion (CARE) mechanism, revealing that features' decoupling and interaction can fully unleash the power of linear attention. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies, thereby preserving sufficient local and global information while effectively enhancing the efficiency of models. Then, a dynamic memory unit is employed to maintain critical information along the network pipeline. Moreover, we design a dual interaction module to effectively facilitate interaction between local inductive bias and long-range information as well as among features at different layers. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy. Extensive experiments on ImageNet-1K, COCO, and ADE20K datasets demonstrate the effectiveness of our approach, e.g., achieving $78.4/82.1\%$ top-1 accuracy on ImagegNet-1K at the cost of only $0.7/1.9$ GMACs. Codes will be released on \href{..}{github}.

Title: Image Generation Diversity Issues and How to Tame Them

Authors: Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16171
Pdf URL: https://arxiv.org/pdf/2411.16171
Copy Paste: [[2411.16171]] Image Generation Diversity Issues and How to Tame Them(https://arxiv.org/abs/2411.16171)
Keywords: extraction, diffusion, generative
Abstract: Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models this https URL.

Title: U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields

Authors: Vinayak Gupta, Manoj S, Mukund Varma T, Kaushik Mitra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16172
Pdf URL: https://arxiv.org/pdf/2411.16172
Copy Paste: [[2411.16172]] U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields(https://arxiv.org/abs/2411.16172)
Keywords: transformer
Abstract: Underwater images suffer from colour shifts, low contrast, and haziness due to light absorption, refraction, scattering and restoring these images has warranted much attention. In this work, we present Unsupervised Underwater Neural Radiance Field U2NeRF, a transformer-based architecture that learns to render and restore novel views conditioned on multi-view geometry simultaneously. Due to the absence of supervision, we attempt to implicitly bake restoring capabilities onto the NeRF pipeline and disentangle the predicted color into several components - scene radiance, direct transmission map, backscatter transmission map, and global background light, and when combined reconstruct the underwater image in a self-supervised manner. In addition, we release an Underwater View Synthesis UVS dataset consisting of 12 underwater scenes, containing both synthetically-generated and real-world data. Our experiments demonstrate that when optimized on a single scene, U2NeRF outperforms several baselines by as much LPIPS 11%, UIQM 5%, UCIQE 4% (on average) and showcases improved rendering and restoration capabilities. Code will be made available upon acceptance.

Title: SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

Authors: Junho Kim, Hyunjun Kim, Hosu Lee, Yong Man Ro
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16173
Pdf URL: https://arxiv.org/pdf/2411.16173
Copy Paste: [[2411.16173]] SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis(https://arxiv.org/abs/2411.16173)
Keywords: robust
Abstract: Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.

Title: Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

Authors: Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, Khoi Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16183
Pdf URL: https://arxiv.org/pdf/2411.16183
Copy Paste: [[2411.16183]] Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking(https://arxiv.org/abs/2411.16183)
Keywords: robust, segmentation
Abstract: Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.

Title: Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation

Authors: Qiao Yu, Xianzhi Li, Yuan Tang, Xu Han, Long Hu, Yixue Hao, Min Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16185
Pdf URL: https://arxiv.org/pdf/2411.16185
Copy Paste: [[2411.16185]] Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation(https://arxiv.org/abs/2411.16185)
Keywords: diffusion
Abstract: Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM's generated mesh ensures high clarity, discarding LRM's predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123's SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.

Title: Learn from Foundation Model: Fruit Detection Model without Manual Annotation

Authors: Yanan Wang, Zhenghao Fei, Ruichen Li, Yibin Ying
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16196
Pdf URL: https://arxiv.org/pdf/2411.16196
Copy Paste: [[2411.16196]] Learn from Foundation Model: Fruit Detection Model without Manual Annotation(https://arxiv.org/abs/2411.16196)
Keywords: segmentation
Abstract: Recent breakthroughs in large foundation models have enabled the possibility of transferring knowledge pre-trained on vast datasets to domains with limited data availability. Agriculture is one of the domains that lacks sufficient data. This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation. Our approach begins with SDM (Segmentation-Description-Matching), a stage that leverages two foundation models: SAM2 (Segment Anything in Images and Videos) for segmentation and OpenCLIP (Open Contrastive Language-Image Pretraining) for zero-shot open-vocabulary classification. In the second stage, a novel knowledge distillation mechanism is utilized to distill compact, edge-deployable models from SDM, enhancing both inference speed and perception accuracy. The complete method, termed SDM-D (Segmentation-Description-Matching-Distilling), demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation) without manual annotation. It nearly matches the performance of models trained with abundant labels. Notably, SDM-D outperforms open-set detection methods such as Grounding SAM and YOLO-World on all tested fruit detection datasets. Additionally, we introduce MegaFruits, a comprehensive fruit segmentation dataset encompassing over 25,000 images, and all code and datasets are made publicly available at this https URL.

Title: Interpreting Object-level Foundation Models via Visual Precision Search

Authors: Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Maosen Li, Zheng Huang, Hua Zhang, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16198
Pdf URL: https://arxiv.org/pdf/2411.16198
Copy Paste: [[2411.16198]] Interpreting Object-level Foundation Models via Visual Precision Search(https://arxiv.org/abs/2411.16198)
Keywords: interpretability
Abstract: Advances in multimodal pre-training have propelled object-level foundation models, such as Grounding DINO and Florence-2, in tasks like visual grounding and object detection. However, interpreting these models\' decisions has grown increasingly challenging. Existing interpretable attribution methods for object-level task interpretation have notable limitations: (1) gradient-based methods lack precise localization due to visual-textual fusion in foundation models, and (2) perturbation-based methods produce noisy saliency maps, limiting fine-grained interpretability. To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. Our method bypasses internal model parameters to overcome attribution issues from multimodal fusion, dividing inputs into sparse sub-regions and using consistency and collaboration scores to accurately identify critical decision-making regions. We also conducted a theoretical analysis of the boundary guarantees and scope of applicability of our method. Experiments on RefCOCO, MS COCO, and LVIS show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics, with faithfulness gains of 23.7\%, 31.6\%, and 20.1\% on MS COCO, LVIS, and RefCOCO for Grounding DINO, and 102.9\% and 66.9\% on MS COCO and RefCOCO for Florence-2. Additionally, our method can interpret failures in visual grounding and object detection tasks, surpassing existing methods across multiple evaluation metrics. The code will be released at \url{this https URL}.

Title: VIRES: Video Instance Repainting with Sketch and Text Guidance

Authors: Shuchen Weng, Haojie Zheng, Peixuan Zhan, Yuchen Hong, Han Jiang, Si Li, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16199
Pdf URL: https://arxiv.org/pdf/2411.16199
Copy Paste: [[2411.16199]] VIRES: Video Instance Repainting with Sketch and Text Guidance(https://arxiv.org/abs/2411.16199)
Keywords: diffusion, transformer, generative
Abstract: We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings.

Title: Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models

Authors: Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16201
Pdf URL: https://arxiv.org/pdf/2411.16201
Copy Paste: [[2411.16201]] Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models(https://arxiv.org/abs/2411.16201)
Keywords: large language model
Abstract: High-quality video-text preference data is crucial for Multimodal Large Language Models (MLLMs) alignment. However, existing preference data is very scarce. Obtaining VQA preference data for preference training is costly, and manually annotating responses is highly unreliable, which could result in low-quality pairs. Meanwhile, AI-generated responses controlled by temperature adjustment lack diversity. To address these issues, we propose a high-quality VQA preference dataset, called \textit{\textbf{M}ultiple \textbf{M}ultimodal \textbf{A}rtificial \textbf{I}ntelligence \textbf{P}reference Datasets in \textbf{V}QA} (\textbf{MMAIP-V}), which is constructed by sampling from the response distribution set and using an external scoring function for response evaluation. Furthermore, to fully leverage the preference knowledge in MMAIP-V and ensure sufficient optimization, we propose \textit{\textbf{Iter}ative \textbf{W}eak-to-\textbf{S}trong \textbf{R}einforcement \textbf{L}earning from \textbf{AI} \textbf{F}eedback for video MLLMs} (\textbf{Iter-W2S-RLAIF}), a framework that gradually enhances MLLMs' alignment capabilities by iteratively updating the reference model and performing parameter extrapolation. Finally, we propose an unbiased and information-complete evaluation scheme in VQA evaluation. Experiments demonstrate that MMAIP-V is beneficial for MLLMs in preference learning and Iter-W2S-RLAIF fully exploits the alignment information in MMAIP-V. We believe that the proposed automatic VQA preference data generation pipeline based on AI feedback can greatly promote future work in the MLLMs alignment. \textbf{Code and dataset are available} \href{this https URL}{MMAIP-V\_Iter-W2S-RLAIF-702F}.

Title: MH-MoE:Multi-Head Mixture-of-Experts

Authors: Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16205
Pdf URL: https://arxiv.org/pdf/2411.16205
Copy Paste: [[2411.16205]] MH-MoE:Multi-Head Mixture-of-Experts(https://arxiv.org/abs/2411.16205)
Keywords: large language model
Abstract: Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.

Title: Can Encrypted Images Still Train Neural Networks? Investigating Image Information and Random Vortex Transformation

Authors: XiaoKai Cao, WenJin Mo, ChangDong Wang, JianHuang Lai, Qiong Huang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.16207
Pdf URL: https://arxiv.org/pdf/2411.16207
Copy Paste: [[2411.16207]] Can Encrypted Images Still Train Neural Networks? Investigating Image Information and Random Vortex Transformation(https://arxiv.org/abs/2411.16207)
Keywords: security, transformer
Abstract: Vision is one of the essential sources through which humans acquire information. In this paper, we establish a novel framework for measuring image information content to evaluate the variation in information content during image transformations. Within this framework, we design a nonlinear function to calculate the neighboring information content of pixels at different distances, and then use this information to measure the overall information content of the image. Hence, we define a function to represent the variation in information content during image transformations. Additionally, we utilize this framework to prove the conclusion that swapping the positions of any two pixels reduces the image's information content. Furthermore, based on the aforementioned framework, we propose a novel image encryption algorithm called Random Vortex Transformation. This algorithm encrypts the image using random functions while preserving the neighboring information of the pixels. The encrypted images are difficult for the human eye to distinguish, yet they allow for direct training of the encrypted images using machine learning methods. Experimental verification demonstrates that training on the encrypted dataset using ResNet and Vision Transformers only results in a decrease in accuracy ranging from 0.3\% to 6.5\% compared to the original data, while ensuring the security of the data. Furthermore, there is a positive correlation between the rate of information loss in the images and the rate of accuracy loss, further supporting the validity of the proposed image information content measurement framework.

Title: SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

Authors: Jungang Li, Sicheng Tao, Yibo Yan, Xiaojie Gu, Haodong Xu, Xu Zheng, Yuanhuiyi Lyu, Linfeng Zhang, Xuming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16213
Pdf URL: https://arxiv.org/pdf/2411.16213
Copy Paste: [[2411.16213]] SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context(https://arxiv.org/abs/2411.16213)
Keywords: large language model
Abstract: Endeavors have been made to explore Large Language Models for video analysis (Video-LLMs), particularly in understanding and interpreting long videos. However, existing Video-LLMs still face challenges in effectively integrating the rich and diverse audio-visual information inherent in long videos, which is crucial for comprehensive understanding. This raises the question: how can we leverage embedded audio-visual information to enhance long video understanding? Therefore, (i) we introduce SAVEn-Vid, the first-ever long audio-visual video dataset comprising over 58k audio-visual instructions. (ii) From the model perspective, we propose a time-aware Audio-Visual Large Language Model (AV-LLM), SAVEnVideo, fine-tuned on SAVEn-Vid. (iii) Besides, we present AVBench, a benchmark containing 2,500 QAs designed to evaluate models on enhanced audio-visual comprehension tasks within long video, challenging their ability to handle intricate audio-visual interactions. Experiments on AVBench reveal the limitations of current AV-LLMs. Experiments also demonstrate that SAVEnVideo outperforms the best Video-LLM by 3.61% on the zero-shot long video task (Video-MME) and surpasses the leading audio-visual LLM by 1.29% on the zero-shot audio-visual task (Music-AVQA). Consequently, at the 7B parameter scale, SAVEnVideo can achieve state-of-the-art performance. Our dataset and code will be released at this https URL upon acceptance.

Title: SMGDiff: Soccer Motion Generation using diffusion probabilistic models

Authors: Hongdi Yang, Chengyang Li, Zhenxuan Wu, Gaozheng Li, Jingya Wang, Jingyi Yu, Zhuo Su, Lan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16216
Pdf URL: https://arxiv.org/pdf/2411.16216
Copy Paste: [[2411.16216]] SMGDiff: Soccer Motion Generation using diffusion probabilistic models(https://arxiv.org/abs/2411.16216)
Keywords: diffusion, transformer, generative
Abstract: Soccer is a globally renowned sport with significant applications in video games and VR/AR. However, generating realistic soccer motions remains challenging due to the intricate interactions between the human player and the ball. In this paper, we introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character control with a powerful diffusion-based generative model, ensuring high-quality and diverse output motion. In the first stage, we instantly transform coarse user controls into diverse global trajectories of the character. In the second stage, we employ a transformer-based autoregressive diffusion model to generate soccer motions based on trajectory conditioning. We further incorporate a contact guidance module during inference to optimize the contact details for realistic ball-foot interactions. Moreover, we contribute a large-scale soccer motion dataset consisting of over 1.08 million frames of diverse soccer motions. Extensive experiments demonstrate that our SMGDiff significantly outperforms existing methods in terms of motion quality and condition alignment.

Title: Weakly supervised image segmentation for defect-based grading of fresh produce

Authors: Manuel Knott, Divinefavour Odion, Sameer Sontakke, Anup Karwa, Thijs Defraeye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16219
Pdf URL: https://arxiv.org/pdf/2411.16219
Copy Paste: [[2411.16219]] Weakly supervised image segmentation for defect-based grading of fresh produce(https://arxiv.org/abs/2411.16219)
Keywords: segmentation
Abstract: Implementing image-based machine learning in agriculture is often limited by scarce data and annotations, making it hard to achieve high-quality model predictions. This study tackles the issue of postharvest quality assessment of bananas in decentralized supply chains. We propose a method to detect and segment surface defects in banana images using panoptic segmentation to quantify defect size and number. Instead of time-consuming pixel-level annotations, we use weak supervision with coarse labels. A dataset of 476 smartphone images of bananas was collected under real-world field conditions and annotated for bruises and scars. Using the Segment Anything Model (SAM), a recently published foundation model for image segmentation, we generated dense annotations from coarse bounding boxes to train a segmentation model, significantly reducing manual effort while achieving a panoptic quality score of 77.6%. This demonstrates SAM's potential for low-effort, accurate segmentation in agricultural settings with limited data.

Title: DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings

Authors: Hong Liu, Yitong Lu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16236
Pdf URL: https://arxiv.org/pdf/2411.16236
Copy Paste: [[2411.16236]] DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings(https://arxiv.org/abs/2411.16236)
Keywords: robust
Abstract: This paper presents a novel method to improve the robustness of foundation models to group-based biases. We propose a simple yet effective method, called DoubleCCA, that leverages random sentences and Canonical Correlation Analysis (CCA) to enrich the text embeddings of the foundation model. First, we generate various random sentences that augment the original prompts, which extends the original prompts with random words or character sequences. Second, we use an additional sentence embedding model to generate different text embeddings with respect to these random sentences. We then use CCA double twice to align the representations and reconstruct them back to the original representation space. We demonstrate the effectiveness of our method on a variety of tasks and datasets, showing that it outperforms existing methods in terms of both performance and robustness. Our method is simple to implement and can be easily integrated into existing models, making it a practical solution for improving the robustness of foundation models to group-based biases.

Title: CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

Authors: Zhengmin Yu, Jiutian Zeng, Siyi Chen, Wenhan Xu, Dandan Xu, Xiangyu Liu, Zonghao Ying, Nan Wang, Yuan Zhang, Min Yang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.16239
Pdf URL: https://arxiv.org/pdf/2411.16239
Copy Paste: [[2411.16239]] CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity(https://arxiv.org/abs/2411.16239)
Keywords: security, large language model
Abstract: Over the past year, there has been a notable rise in the use of large language models (LLMs) for academic research and industrial practices within the cybersecurity field. However, it remains a lack of comprehensive and publicly accessible benchmarks to evaluate the performance of LLMs on cybersecurity tasks. To address this gap, we introduce CS-Eval, a publicly accessible, comprehensive and bilingual LLM benchmark specifically designed for cybersecurity. CS-Eval synthesizes the research hotspots from academia and practical applications from industry, curating a diverse set of high-quality questions across 42 categories within cybersecurity, systematically organized into three cognitive levels: knowledge, ability, and application. Through an extensive evaluation of a wide range of LLMs using CS-Eval, we have uncovered valuable insights. For instance, while GPT-4 generally excels overall, other models may outperform it in certain specific subcategories. Additionally, by conducting evaluations over several months, we observed significant improvements in many LLMs' abilities to solve cybersecurity tasks. The benchmarks are now publicly available at this https URL.

Title: Transparent Neighborhood Approximation for Text Classifier Explanation

Authors: Yi Cai, Arthur Zimek, Eirini Ntoutsi, Gerhard Wunder
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16251
Pdf URL: https://arxiv.org/pdf/2411.16251
Copy Paste: [[2411.16251]] Transparent Neighborhood Approximation for Text Classifier Explanation(https://arxiv.org/abs/2411.16251)
Keywords: explainability, generative
Abstract: Recent literature highlights the critical role of neighborhood construction in deriving model-agnostic explanations, with a growing trend toward deploying generative models to improve synthetic instance quality, especially for explaining text classifiers. These approaches overcome the challenges in neighborhood construction posed by the unstructured nature of texts, thereby improving the quality of explanations. However, the deployed generators are usually implemented via neural networks and lack inherent explainability, sparking arguments over the transparency of the explanation process itself. To address this limitation while preserving neighborhood quality, this paper introduces a probability-based editing method as an alternative to black-box text generators. This approach generates neighboring texts by implementing manipulations based on in-text contexts. Substituting the generator-based construction process with recursive probability-based editing, the resultant explanation method, XPROB (explainer with probability-based editing), exhibits competitive performance according to the evaluation conducted on two real-world datasets. Additionally, XPROB's fully transparent and more controllable construction process leads to superior stability compared to the generator-based explainers.

Title: NormXLogit: The Head-on-Top Never Lies

Authors: Sina Abbasi, Mohammad Reza Modarres, Mohammad Taher Pilehvar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16252
Pdf URL: https://arxiv.org/pdf/2411.16252
Copy Paste: [[2411.16252]] NormXLogit: The Head-on-Top Never Lies(https://arxiv.org/abs/2411.16252)
Keywords: interpretability, transformer, large language model
Abstract: The Transformer architecture has emerged as the dominant choice for building large language models (LLMs). However, with new LLMs emerging on a frequent basis, it is important to consider the potential value of architecture-agnostic approaches that can provide interpretability across a variety of architectures. Despite recent successes in the interpretability of LLMs, many existing approaches rely on complex methods that are often tied to a specific model design and come with a significant computational cost. To address these limitations, we propose a novel technique, called NormXLogit, for assessing the significance of individual input tokens. This method operates based on the input and output representations associated with each token. First, we demonstrate that during the pre-training of LLMs, the norms of word embeddings capture the importance of input tokens. Second, we reveal a significant relationship between a token's importance and the extent to which its representation can resemble the model's final prediction. Through extensive analysis, we show that our approach consistently outperforms existing gradient-based methods in terms of faithfulness. Additionally, our method achieves better performance in layer-wise explanations compared to the most prominent architecture-specific methods.

Title: Open-Vocabulary Octree-Graph for 3D Scene Understanding

Authors: Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Bin Zhao, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16253
Pdf URL: https://arxiv.org/pdf/2411.16253
Copy Paste: [[2411.16253]] Open-Vocabulary Octree-Graph for 3D Scene Understanding(https://arxiv.org/abs/2411.16253)
Keywords: segmentation
Abstract: Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method.

Title: Unraveling Arithmetic in Large Language Models: The Role of Algebraic Structures

Authors: Fu-Chieh Chang, Pei-Yuan Wu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16260
Pdf URL: https://arxiv.org/pdf/2411.16260
Copy Paste: [[2411.16260]] Unraveling Arithmetic in Large Language Models: The Role of Algebraic Structures(https://arxiv.org/abs/2411.16260)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable mathematical capabilities, largely driven by chain-of-thought (CoT) prompting, which decomposes complex reasoning into step-by-step solutions. This approach has enabled significant advancements, as evidenced by performance on benchmarks like GSM8K and MATH. However, the mechanisms underlying LLMs' ability to perform arithmetic in a single step of CoT remain poorly understood. Existing studies debate whether LLMs encode numerical values or rely on symbolic reasoning, while others explore attention and multi-layered processing in arithmetic tasks. In this work, we propose that LLMs learn arithmetic by capturing algebraic structures, such as \emph{Commutativity} and \emph{Identity} properties. Since these structures are observable through input-output relationships, they can generalize to unseen data. We empirically demonstrate that LLMs can learn algebraic structures using a custom dataset of arithmetic problems. Our findings indicate that leveraging algebraic structures can enhance the LLMs' arithmetic capabilities, offering insights into improving their arithmetic performance.

Title: Even Sparser Graph Transformers

Authors: Hamed Shirzad, Honghao Lin, Balaji Venkatachalam, Ameya Velingker, David Woodruff, Danica Sutherland
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16278
Pdf URL: https://arxiv.org/pdf/2411.16278
Copy Paste: [[2411.16278]] Even Sparser Graph Transformers(https://arxiv.org/abs/2411.16278)
Keywords: transformer
Abstract: Graph Transformers excel in long-range dependency modeling, but generally require quadratic memory complexity in the number of nodes in an input graph, and hence have trouble scaling to large graphs. Sparse attention variants such as Exphormer can help, but may require high-degree augmentations to the input graph for good performance, and do not attempt to sparsify an already-dense input graph. As the learned attention mechanisms tend to use few of these edges, such high-degree connections may be unnecessary. We show (empirically and with theoretical backing) that attention scores on graphs are usually quite consistent across network widths, and use this observation to propose a two-stage procedure, which we call Spexphormer: first, train a narrow network on the full augmented graph. Next, use only the active connections to train a wider network on a much sparser graph. We establish theoretical conditions when a narrow network's attention scores can match those of a wide network, and show that Spexphormer achieves good performance with drastically reduced memory requirements on various graph datasets.

Title: Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery

Authors: Tom Wehrbein, Marco Rudolph, Bodo Rosenhahn, Bastian Wandt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16289
Pdf URL: https://arxiv.org/pdf/2411.16289
Copy Paste: [[2411.16289]] Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery(https://arxiv.org/abs/2411.16289)
Keywords: segmentation
Abstract: Monocular 3D human pose and shape estimation is an inherently ill-posed problem due to depth ambiguities, occlusions, and truncations. Recent probabilistic approaches learn a distribution over plausible 3D human meshes by maximizing the likelihood of the ground-truth pose given an image. We show that this objective function alone is not sufficient to best capture the full distributions. Instead, we propose to additionally supervise the learned distributions by minimizing the distance to distributions encoded in heatmaps of a 2D pose detector. Moreover, we reveal that current methods often generate incorrect hypotheses for invisible joints which is not detected by the evaluation protocols. We demonstrate that person segmentation masks can be utilized during training to significantly decrease the number of invalid samples and introduce two metrics to evaluate it. Our normalizing flow-based approach predicts plausible 3D human mesh hypotheses that are consistent with the image evidence while maintaining high diversity for ambiguous body parts. Experiments on 3DPW and EMDB show that we outperform other state-of-the-art probabilistic methods. Code is available for research purposes at this https URL.

Title: A Performance Increment Strategy for Semantic Segmentation of Low-Resolution Images from Damaged Roads

Authors: Rafael S. Toledo, Cristiano S. Oliveira, Vitor H. T. Oliveira, Eric A. Antonelo, Aldo von Wangenheim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16295
Pdf URL: https://arxiv.org/pdf/2411.16295
Copy Paste: [[2411.16295]] A Performance Increment Strategy for Semantic Segmentation of Low-Resolution Images from Damaged Roads(https://arxiv.org/abs/2411.16295)
Keywords: segmentation
Abstract: Autonomous driving needs good roads, but 85% of Brazilian roads have damages that deep learning models may not regard as most semantic segmentation datasets for autonomous driving are high-resolution images of well-maintained urban roads. A representative dataset for emerging countries consists of low-resolution images of poorly maintained roads and includes labels of damage classes; in this scenario, three challenges arise: objects with few pixels, objects with undefined shapes, and highly underrepresented classes. To tackle these challenges, this work proposes the Performance Increment Strategy for Semantic Segmentation (PISSS) as a methodology of 14 training experiments to boost performance. With PISSS, we reached state-of-the-art results of 79.8 and 68.8 mIoU on the Road Traversing Knowledge (RTK) and Technik Autonomer Systeme 500 (TAS500) test sets, respectively. Furthermore, we also offer an analysis of DeepLabV3+ pitfalls for small object segmentation.

Title: Evaluating Rank-N-Contrast: Continuous and Robust Representations for Regression

Authors: Six Valentin, Chidiac Alexandre, Worlikar Arkin
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16298
Pdf URL: https://arxiv.org/pdf/2411.16298
Copy Paste: [[2411.16298]] Evaluating Rank-N-Contrast: Continuous and Robust Representations for Regression(https://arxiv.org/abs/2411.16298)
Keywords: robust
Abstract: This document is a replication of the original "Rank-N-Contrast" (arXiv:2210.01189v2) paper published in 2023. This evaluation is done for academic purposes. Deep regression models often fail to capture the continuous nature of sample orders, creating fragmented representations and suboptimal performance. To address this, we reproduced the Rank-N-Contrast (RNC) framework, which learns continuous representations by contrasting samples by their rankings in the target space. Our study validates RNC's theoretical and empirical benefits, including improved performance and robustness. We extended the evaluation to an additional regression dataset and conducted robustness tests using a holdout method, where a specific range of continuous data was excluded from the training set. This approach assessed the model's ability to generalise to unseen data and achieve state-of-the-art performance. This replication study validates the original findings and broadens the understanding of RNC's applicability and robustness.

Title: BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Authors: Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, Yang Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16300
Pdf URL: https://arxiv.org/pdf/2411.16300
Copy Paste: [[2411.16300]] BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment(https://arxiv.org/abs/2411.16300)
Keywords: generative, large language model
Abstract: Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.

Title: DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation

Authors: Yuxuan Yang, Jingyao Wang, Tao Geng, Wenwen Qiang, Changwen Zheng, Fuchun Sun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16301
Pdf URL: https://arxiv.org/pdf/2411.16301
Copy Paste: [[2411.16301]] DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation(https://arxiv.org/abs/2411.16301)
Keywords: robust, diffusion, generative
Abstract: Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in machine learning, generative models have emerged as a promising means of improving efficiency by creating designs from text descriptions or sketches. However, few generative works focus on interior design, leading to substantial discrepancies between outputs and practical needs, such as differences in size, spatial scope, and the lack of controllable generation quality. To address these challenges, we propose DiffDesign, a controllable diffusion model with meta priors for efficient interior design generation. Specifically, we utilize the generative priors of a 2D diffusion model pre-trained on a large image dataset as our rendering backbone. We further guide the denoising process by disentangling cross-attention control over design attributes, such as appearance, pose, and size, and introduce an optimal transfer-based alignment module to enforce view consistency. Simultaneously, we construct an interior design-specific dataset, DesignHelper, consisting of over 400 solutions across more than 15 spatial types and 15 design styles. This dataset helps fine-tune DiffDesign. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness and robustness of DiffDesign.

Title: Understanding Generalization of Federated Learning: the Trade-off between Model Stability and Optimization

Authors: Dun Zeng, Zheshun Wu, Shiyu Liu, Yu Pan, Xiaoying Tang, Zenglin Xu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16303
Pdf URL: https://arxiv.org/pdf/2411.16303
Copy Paste: [[2411.16303]] Understanding Generalization of Federated Learning: the Trade-off between Model Stability and Optimization(https://arxiv.org/abs/2411.16303)
Keywords: federate
Abstract: Federated Learning (FL) is a distributed learning approach that trains neural networks across multiple devices while keeping their local data private. However, FL often faces challenges due to data heterogeneity, leading to inconsistent local optima among clients. These inconsistencies can cause unfavorable convergence behavior and generalization performance degradation. Existing studies mainly describe this issue through \textit{convergence analysis}, focusing on how well a model fits training data, or through \textit{algorithmic stability}, which examines the generalization gap. However, neither approach precisely captures the generalization performance of FL algorithms, especially for neural networks. In this paper, we introduce the first generalization dynamics analysis framework in federated optimization, highlighting the trade-offs between model stability and optimization. Through this framework, we show how the generalization of FL algorithms is affected by the interplay of algorithmic stability and optimization. This framework applies to standard federated optimization and its advanced versions, like server momentum. We find that fast convergence from large local steps or accelerated momentum enlarges stability but obtains better generalization performance. Our insights into these trade-offs can guide the practice of future algorithms for better generalization.

Title: An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models

Authors: Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, Liang Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16308
Pdf URL: https://arxiv.org/pdf/2411.16308
Copy Paste: [[2411.16308]] An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models(https://arxiv.org/abs/2411.16308)
Keywords: robust, diffusion, segmentation
Abstract: Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic \textbf{Seg}mentation \textbf{Net}work based on a \textbf{C}onditional-Noise Framework (CNF) of D\textbf{D}PMs, named \textbf{CDSegNet}. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance.

Title: Functionality understanding and segmentation in 3D scenes

Authors: Jaime Corsetti, Francesco Giuliari, Alice Fasoli, Davide Boscaini, Fabio Poiesi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16310
Pdf URL: https://arxiv.org/pdf/2411.16310
Copy Paste: [[2411.16310]] Functionality understanding and segmentation in 3D scenes(https://arxiv.org/abs/2411.16310)
Keywords: segmentation
Abstract: Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like 'turn on the ceiling light', an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches. Code will be released publicly.

Title: Monocular Lane Detection Based on Deep Learning: A Survey

Authors: Xin He, Haiyun Guo, Kuan Zhu, Bingke Zhu, Xu Zhao, Jianwu Fang, Jinqiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16316
Pdf URL: https://arxiv.org/pdf/2411.16316
Copy Paste: [[2411.16316]] Monocular Lane Detection Based on Deep Learning: A Survey(https://arxiv.org/abs/2411.16316)
Keywords: fair
Abstract: Lane detection plays an important role in autonomous driving perception system. As deep learning algorithms gain popularity, monocular lane detection methods based on deep learning have demonstrated superior performance and emerged as a key research direction in autonomous driving perception. The core design of these algorithmic frameworks can be summarized as follows: (1) Task paradigm, focusing on lane instance-level discrimination; (2) Lane modeling, representing lanes as a set of learnable parameters in the neural network; (3) Global context supplementation, enhancing the detection of obscured lanes; (4) Perspective effect elimination, providing 3D lanes usable for downstream applications. From these perspectives, this paper presents a comprehensive overview of existing methods, encompassing both the increasingly mature 2D lane detection approaches and the developing 3D lane detection works. For a relatively fair comparison, in addition to comparing the performance of mainstream methods on different benchmarks, their inference speed is also investigated under a unified setting. Moreover, we present some extended works on lane detection, including multi-task perception, video lane detection, online high-definition (HD) map construction, and lane topology reasoning, to offer readers a comprehensive roadmap for the evolution of lane detection. Finally, we point out some potential future research directions in this field. We exhaustively collect the papers and codes of existing works at this https URL and will keep tracing the research.

Title: One Diffusion to Generate Them All

Authors: Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16318
Pdf URL: https://arxiv.org/pdf/2411.16318
Copy Paste: [[2411.16318]] One Diffusion to Generate Them All(https://arxiv.org/abs/2411.16318)
Keywords: diffusion, segmentation
Abstract: We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at this https URL

Title: CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation

Authors: Leon Sick, Dominik Engel, Sebastian Hartwig, Pedro Hermosilla, Timo Ropinski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16319
Pdf URL: https://arxiv.org/pdf/2411.16319
Copy Paste: [[2411.16319]] CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation(https://arxiv.org/abs/2411.16319)
Keywords: segmentation
Abstract: Traditionally, algorithms that learn to segment object instances in 2D images have heavily relied on large amounts of human-annotated data. Only recently, novel approaches have emerged tackling this problem in an unsupervised fashion. Generally, these approaches first generate pseudo-masks and then train a class-agnostic detector. While such methods deliver the current state of the art, they often fail to correctly separate instances overlapping in 2D image space since only semantics are considered. To tackle this issue, we instead propose to cut the semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene. Furthermore, we derive a Spatial Importance function, which we use to resharpen the semantics along the 3D borders of instances. Nevertheless, these pseudo-masks are still subject to mask ambiguity. To address this issue, we further propose to augment the training of a class-agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal. With these contributions, our approach outperforms competing methods across multiple standard benchmarks for unsupervised instance segmentation and object detection.

Title: Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Authors: Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, Enkelejda Kasneci
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2411.16337
Pdf URL: https://arxiv.org/pdf/2411.16337
Copy Paste: [[2411.16337]] Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring(https://arxiv.org/abs/2411.16337)
Keywords: generative, large language model
Abstract: The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's $r = .74$ with human assessments in the overall score, and an internal consistency of $ICC=.80$. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.

Title: Preference Optimization for Reasoning with Pseudo Feedback

Authors: Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F. Chen, Shafiq Joty, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16345
Pdf URL: https://arxiv.org/pdf/2411.16345
Copy Paste: [[2411.16345]] Preference Optimization for Reasoning with Pseudo Feedback(https://arxiv.org/abs/2411.16345)
Keywords: large language model
Abstract: Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.

Title: Towards Foundation Models for Critical Care Time Series

Authors: Manuel Burger, Fedor Sergeev, Malte Londschien, Daphné Chopard, Hugo Yèche, Eike Gerdes, Polina Leshetkina, Alexander Morgenroth, Zeynep Babür, Jasmina Bogojeska, Martin Faltys, Rita Kuznetsova, Gunnar Rätsch
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16346
Pdf URL: https://arxiv.org/pdf/2411.16346
Copy Paste: [[2411.16346]] Towards Foundation Models for Critical Care Time Series(https://arxiv.org/abs/2411.16346)
Keywords: robust, large language model
Abstract: Notable progress has been made in generalist medical large language models across various healthcare areas. However, large-scale modeling of in-hospital time series data - such as vital signs, lab results, and treatments in critical care - remains underexplored. Existing datasets are relatively small, but combining them can enhance patient diversity and improve model robustness. To effectively utilize these combined datasets for large-scale modeling, it is essential to address the distribution shifts caused by varying treatment policies, necessitating the harmonization of treatment variables across the different datasets. This work aims to establish a foundation for training large-scale multi-variate time series models on critical care data and to provide a benchmark for machine learning models in transfer learning across hospitals to study and address distribution shift challenges. We introduce a harmonized dataset for sequence modeling and transfer learning research, representing the first large-scale collection to include core treatment variables. Future plans involve expanding this dataset to support further advancements in transfer learning and the development of scalable, generalizable models for critical healthcare applications.

Title: Machine learning for cerebral blood vessels' malformations

Authors: Irem Topal, Alexander Cherevko, Yuri Bugay, Maxim Shishlenin, Jean Barbier, Deniz Eroglu, Édgar Roldán, Roman Belousov
Subjects: cs.LG, cond-mat.stat-mech, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.16349
Pdf URL: https://arxiv.org/pdf/2411.16349
Copy Paste: [[2411.16349]] Machine learning for cerebral blood vessels' malformations(https://arxiv.org/abs/2411.16349)
Keywords: robust
Abstract: Cerebral aneurysms and arteriovenous malformations are life-threatening hemodynamic pathologies of the brain. While surgical intervention is often essential to prevent fatal outcomes, it carries significant risks both during the procedure and in the postoperative period, making the management of these conditions highly challenging. Parameters of cerebral blood flow, routinely monitored during medical interventions, could potentially be utilized in machine learning-assisted protocols for risk assessment and therapeutic prognosis. To this end, we developed a linear oscillatory model of blood velocity and pressure for clinical data acquired from neurosurgical operations. Using the method of Sparse Identification of Nonlinear Dynamics (SINDy), the parameters of our model can be reconstructed online within milliseconds from a short time series of the hemodynamic variables. The identified parameter values enable automated classification of the blood-flow pathologies by means of logistic regression, achieving an accuracy of 73 %. Our results demonstrate the potential of this model for both diagnostic and prognostic applications, providing a robust and interpretable framework for assessing cerebral blood vessel conditions.

Title: A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation

Authors: M.M.A. Valiuddin, R.J.G. van Sloun, C.G.A. Viviers, P.H.N. de With, F. van der Sommen
Subjects: cs.CV, cs.AI, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16370
Pdf URL: https://arxiv.org/pdf/2411.16370
Copy Paste: [[2411.16370]] A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation(https://arxiv.org/abs/2411.16370)
Keywords: segmentation
Abstract: Advancements in image segmentation play an integral role within the greater scope of Deep Learning-based computer vision. Furthermore, their widespread applicability in critical real-world tasks has given rise to challenges related to the reliability of such algorithms. Hence, uncertainty quantification has been extensively studied within this context, enabling expression of model ignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to prevent uninformed decision making. Due to the rapid adoption of Convolutional Neural Network (CNN)-based segmentation models in high-stake applications, a substantial body of research has been published on this very topic, causing its swift expansion into a distinct field. This work provides a comprehensive overview of probabilistic segmentation by discussing fundamental concepts in uncertainty that govern advancements in the field as well as the application to various tasks. We identify that quantifying aleatoric and epistemic uncertainty approximates Bayesian inference w.r.t. to either latent variables or model parameters, respectively. Moreover, literature on both uncertainties trace back to four key applications; (1) to quantify statistical inconsistencies in the annotation process due ambiguous images, (2) correlating prediction error with uncertainty, (3) expanding the model hypothesis space for better generalization, and (4) active learning. Then, a discussion follows that includes an overview of utilized datasets for each of the applications and comparison of the available methods. We also highlight challenges related to architectures, uncertainty-based active learning, standardization and benchmarking, and recommendations for future work such as methods based on single forward passes and models that appropriately leverage volumetric data.

Title: Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Authors: Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16375
Pdf URL: https://arxiv.org/pdf/2411.16375
Copy Paste: [[2411.16375]] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing(https://arxiv.org/abs/2411.16375)
Keywords: diffusion
Abstract: With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at this https URL

Title: FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

Authors: Cheng-Wei Lin, Wan-Hsuan Hsieh, Kai-Xin Guan, Chan-Jan Hsu, Chia-Chen Kuo, Chuan-Lin Lai, Chung-Wei Chung, Ming-Jen Wang, Da-Shan Shiu
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2411.16387
Pdf URL: https://arxiv.org/pdf/2411.16387
Copy Paste: [[2411.16387]] FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web(https://arxiv.org/abs/2411.16387)
Keywords: large language model
Abstract: The quality and size of a pretraining dataset significantly influence the performance of large language models (LLMs). While there have been numerous efforts in the curation of such a dataset for English users, there is a relative lack of similar initiatives for Traditional Chinese. Building upon this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored specifically for Traditional Chinese users. We came up with multiple stages of meticulously designed filters to cater to the linguistic difference between English and Traditional Chinese, to ensure comprehensiveness and quality. We determined effectiveness from querying dataset samples with three main objectives. Our code and datasets are publicly available.

Title: Human-Calibrated Automated Testing and Validation of Generative Language Models

Authors: Agus Sudjianto, Aijun Zhang, Srinivas Neppalli, Tarun Joshi, Michal Malohlava
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16391
Pdf URL: https://arxiv.org/pdf/2411.16391
Copy Paste: [[2411.16391]] Human-Calibrated Automated Testing and Validation of Generative Language Models(https://arxiv.org/abs/2411.16391)
Keywords: robust, generative
Abstract: This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.

Title: A Survey of Blockchain-Based Privacy Applications: An Analysis of Consent Management and Self-Sovereign Identity Approaches

Authors: Rodrigo Dutra Garcia, Gowri Ramachandran, Kealan Dunnett, Raja Jurdak, Caetano Ranieri, Bhaskar Krishnamachari, Jo Ueyama
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2411.16404
Pdf URL: https://arxiv.org/pdf/2411.16404
Copy Paste: [[2411.16404]] A Survey of Blockchain-Based Privacy Applications: An Analysis of Consent Management and Self-Sovereign Identity Approaches(https://arxiv.org/abs/2411.16404)
Keywords: privacy, protect
Abstract: Modern distributed applications in healthcare, supply chain, and the Internet of Things handle a large amount of data in a diverse application setting with multiple stakeholders. Such applications leverage advanced artificial intelligence (AI) and machine learning algorithms to automate business processes. The proliferation of modern AI technologies increases the data demand. However, real-world networks often include private and sensitive information of businesses, users, and other organizations. Emerging data-protection regulations such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) introduce policies around collecting, storing, and managing digital data. While Blockchain technology offers transparency, auditability, and immutability for multi-stakeholder applications, it lacks inherent support for privacy. Typically, privacy support is added to a blockchain-based application by incorporating cryptographic schemes, consent mechanisms, and self-sovereign identity. This article surveys the literature on blockchain-based privacy-preserving systems and identifies the tools for protecting privacy. Besides, consent mechanisms and identity management in the context of blockchain-based systems are also analyzed. The article concludes by highlighting the list of open challenges and further research opportunities.

Title: Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN

Authors: Elona Shatri, Kalikidhar Palavala, George Fazekas
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2411.16405
Pdf URL: https://arxiv.org/pdf/2411.16405
Copy Paste: [[2411.16405]] Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN(https://arxiv.org/abs/2411.16405)
Keywords: generative
Abstract: The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.

Title: A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models

Authors: Manuel Schwonberg, Claus Werner, Hanno Gottschalk, Carsten Meyer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16407
Pdf URL: https://arxiv.org/pdf/2411.16407
Copy Paste: [[2411.16407]] A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models(https://arxiv.org/abs/2411.16407)
Keywords: segmentation
Abstract: Despite the recent progress in deep learning based computer vision, domain shifts are still one of the major challenges. Semantic segmentation for autonomous driving faces a wide range of domain shifts, e.g. caused by changing weather conditions, new geolocations and the frequent use of synthetic data in model training. Unsupervised domain adaptation (UDA) methods have emerged which adapt a model to a new target domain by only using unlabeled data of that domain. The variety of UDA methods is large but all of them use ImageNet pre-trained models. Recently, vision-language models have demonstrated strong generalization capabilities which may facilitate domain adaptation. We show that simply replacing the encoder of existing UDA methods like DACS by a vision-language pre-trained encoder can result in significant performance improvements of up to 10.0% mIoU on the GTA5-to-Cityscapes domain shift. For the generalization performance to unseen domains, the newly employed vision-language pre-trained encoder provides a gain of up to 13.7% mIoU across three unseen datasets. However, we find that not all UDA methods can be easily paired with the new encoder and that the UDA performance does not always likewise transfer into generalization performance. Finally, we perform our experiments on an adverse weather condition domain shift to further verify our findings on a pure real-to-real domain shift.

Title: Unsupervised Event Outlier Detection in Continuous Time

Authors: Somjit Nath, Yik Chau Lui, Siqi Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16427
Pdf URL: https://arxiv.org/pdf/2411.16427
Copy Paste: [[2411.16427]] Unsupervised Event Outlier Detection in Continuous Time(https://arxiv.org/abs/2411.16427)
Keywords: generative
Abstract: Event sequence data record the occurrences of events in continuous time. Event sequence forecasting based on temporal point processes (TPPs) has been extensively studied, but outlier or anomaly detection, especially without any supervision from humans, is still underexplored. In this work, we develop, to the best our knowledge, the first unsupervised outlier detection approach to detecting abnormal events. Our novel unsupervised outlier detection framework is based on ideas from generative adversarial networks (GANs) and reinforcement learning (RL). We train a 'generator' that corrects outliers in the data with a 'discriminator' that learns to discriminate the corrected data from the real data, which may contain outliers. A key insight is that if the generator made a mistake in the correction, it would generate anomalies that are different from the anomalies in the real data, so it serves as data augmentation for the discriminator learning. Different from typical GAN-based outlier detection approaches, our method employs the generator to detect outliers in an online manner. The experimental results show that our method can detect event outliers more accurately than the state-of-the-art approaches.

Title: Finding Structure in Language Models

Authors: Jaap Jumelet
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16433
Pdf URL: https://arxiv.org/pdf/2411.16433
Copy Paste: [[2411.16433]] Finding Structure in Language Models(https://arxiv.org/abs/2411.16433)
Keywords: interpretability
Abstract: When we speak, write or listen, we continuously make predictions based on our knowledge of a language's grammar. Remarkably, children acquire this grammatical knowledge within just a few years, enabling them to understand and generalise to novel constructions that have never been uttered before. Language models are powerful tools that create representations of language by incrementally predicting the next word in a sentence, and they have had a tremendous societal impact in recent years. The central research question of this thesis is whether these models possess a deep understanding of grammatical structure similar to that of humans. This question lies at the intersection of natural language processing, linguistics, and interpretability. To address it, we will develop novel interpretability techniques that enhance our understanding of the complex nature of large-scale language models. We approach our research question from three directions. First, we explore the presence of abstract linguistic information through structural priming, a key paradigm in psycholinguistics for uncovering grammatical structure in human language processing. Next, we examine various linguistic phenomena, such as adjective order and negative polarity items, and connect a model's comprehension of these phenomena to the data distribution on which it was trained. Finally, we introduce a controlled testbed for studying hierarchical structure in language models using various synthetic languages of increasing complexity and examine the role of feature interactions in modelling this structure. Our findings offer a detailed account of the grammatical knowledge embedded in language model representations and provide several directions for investigating fundamental linguistic questions using computational methods.

Title: Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack

Authors: Xide Xu, Muhammad Atif Butt, Sandesh Kamath, Bogdan Raducanu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16437
Pdf URL: https://arxiv.org/pdf/2411.16437
Copy Paste: [[2411.16437]] Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack(https://arxiv.org/abs/2411.16437)
Keywords: privacy, protect, attack, diffusion
Abstract: The growing demand for customized visual content has led to the rise of personalized text-to-image (T2I) diffusion models. Despite their remarkable potential, they pose significant privacy risk when misused for malicious purposes. In this paper, we propose a novel and efficient adversarial attack method, Concept Protection by Selective Attention Manipulation (CoPSAM) which targets only the cross-attention layers of a T2I diffusion model. For this purpose, we carefully construct an imperceptible noise to be added to clean samples to get their adversarial counterparts. This is obtained during the fine-tuning process by maximizing the discrepancy between the corresponding cross-attention maps of the user-specific token and the class-specific token, respectively. Experimental validation on a subset of CelebA-HQ face images dataset demonstrates that our approach outperforms existing methods. Besides this, our method presents two important advantages derived from the qualitative evaluation: (i) we obtain better protection results for lower noise levels than our competitors; and (ii) we protect the content from unauthorized use thereby protecting the individual's identity from potential misuse.

Title: AnonyNoise: Anonymizing Event Data with Smart Noise to Outsmart Re-Identification and Preserve Privacy

Authors: Katharina Bendig, René Schuster, Nicole Thiemer, Karen Joisten, Didier Stricker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16440
Pdf URL: https://arxiv.org/pdf/2411.16440
Copy Paste: [[2411.16440]] AnonyNoise: Anonymizing Event Data with Smart Noise to Outsmart Re-Identification and Preserve Privacy(https://arxiv.org/abs/2411.16440)
Keywords: privacy, attack, robust
Abstract: The increasing capabilities of deep neural networks for re-identification, combined with the rise in public surveillance in recent years, pose a substantial threat to individual privacy. Event cameras were initially considered as a promising solution since their output is sparse and therefore difficult for humans to interpret. However, recent advances in deep learning proof that neural networks are able to reconstruct high-quality grayscale images and re-identify individuals using data from event cameras. In our paper, we contribute a crucial ethical discussion on data privacy and present the first event anonymization pipeline to prevent re-identification not only by humans but also by neural networks. Our method effectively introduces learnable data-dependent noise to cover personally identifiable information in raw event data, reducing attackers' re-identification capabilities by up to 60%, while maintaining substantial information for the performing of downstream tasks. Moreover, our anonymization generalizes well on unseen data and is robust against image reconstruction and inversion attacks. Code: this https URL

Title: TIFeD: a Tiny Integer-based Federated learning algorithm with Direct feedback alignment

Authors: Luca Colombo, Alessandro Falcetta, Manuel Roveri
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16442
Pdf URL: https://arxiv.org/pdf/2411.16442
Copy Paste: [[2411.16442]] TIFeD: a Tiny Integer-based Federated learning algorithm with Direct feedback alignment(https://arxiv.org/abs/2411.16442)
Keywords: federate
Abstract: Training machine and deep learning models directly on extremely resource-constrained devices is the next challenge in the field of tiny machine learning. The related literature in this field is very limited, since most of the solutions focus only on on-device inference or model adaptation through online learning, leaving the training to be carried out on external Cloud services. An interesting technological perspective is to exploit Federated Learning (FL), which allows multiple devices to collaboratively train a shared model in a distributed way. However, the main drawback of state-of-the-art FL algorithms is that they are not suitable for running on tiny devices. For the first time in the literature, in this paper we introduce TIFeD, a Tiny Integer-based Federated learning algorithm with Direct Feedback Alignment (DFA) entirely implemented by using an integer-only arithmetic and being specifically designed to operate on devices with limited resources in terms of memory, computation and energy. Besides the traditional full-network operating modality, in which each device of the FL setting trains the entire neural network on its own local data, we propose an innovative single-layer TIFeD implementation, which enables each device to train only a portion of the neural network model and opens the door to a new way of distributing the learning procedure across multiple devices. The experimental results show the feasibility and effectiveness of the proposed solution. The proposed TIFeD algorithm, with its full-network and single-layer implementations, is made available to the scientific community as a public repository.

Title: VQ-SGen: A Vector Quantized Stroke Representation for Sketch Generation

Authors: Jiawei Wang, Zhiming Cui, Changjian Li
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16446
Pdf URL: https://arxiv.org/pdf/2411.16446
Copy Paste: [[2411.16446]] VQ-SGen: A Vector Quantized Stroke Representation for Sketch Generation(https://arxiv.org/abs/2411.16446)
Keywords: transformer
Abstract: This paper presents VQ-SGen, a novel algorithm for high-quality sketch generation. Recent approaches have often framed the task as pixel-based generation either as a whole or part-by-part, neglecting the intrinsic and contextual relationships among individual strokes, such as the shape and spatial positioning of both proximal and distant strokes. To overcome these limitations, we propose treating each stroke within a sketch as an entity and introducing a vector-quantized (VQ) stroke representation for fine-grained sketch generation. Our method follows a two-stage framework - in the first stage, we decouple each stroke's shape and location information to ensure the VQ representation prioritizes stroke shape learning. In the second stage, we feed the precise and compact representation into an auto-decoding Transformer to incorporate stroke semantics, positions, and shapes into the generation process. By utilizing tokenized stroke representation, our approach generates strokes with high fidelity and facilitates novel applications, such as conditional generation and semantic-aware stroke editing. Comprehensive experiments demonstrate our method surpasses existing state-of-the-art techniques, underscoring its effectiveness. The code and model will be made publicly available upon publication.

Title: Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval

Authors: Xiaocong Yang, Jiacheng Lin, Ziqi Wang, Chengxiang Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16454
Pdf URL: https://arxiv.org/pdf/2411.16454
Copy Paste: [[2411.16454]] Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval(https://arxiv.org/abs/2411.16454)
Keywords: large language model
Abstract: Large language models (LLMs) are known to struggle with complicated reasoning tasks such as math word problems (MWPs). In this paper, we present how analogy from similarly structured questions can improve LLMs' problem-solving capabilities for MWPs. Specifically, we rely on the retrieval of problems with similar computational graphs to the given question to serve as exemplars in the prompt, providing the correct reasoning path for the generation model to refer to. Empirical results across six math word problem datasets demonstrate the effectiveness of our proposed method, which achieves a significant improvement of up to 6.7 percent on average in absolute value, compared to baseline methods. These results highlight our method's potential in addressing the reasoning challenges in current LLMs.

Title: On the Reconstruction of Training Data from Group Invariant Networks

Authors: Ran Elbaz, Gilad Yehudai, Meirav Galun, Haggai Maron
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16458
Pdf URL: https://arxiv.org/pdf/2411.16458
Copy Paste: [[2411.16458]] On the Reconstruction of Training Data from Group Invariant Networks(https://arxiv.org/abs/2411.16458)
Keywords: privacy, explainability
Abstract: Reconstructing training data from trained neural networks is an active area of research with significant implications for privacy and explainability. Recent advances have demonstrated the feasibility of this process for several data types. However, reconstructing data from group-invariant neural networks poses distinct challenges that remain largely unexplored. This paper addresses this gap by first formulating the problem and discussing some of its basic properties. We then provide an experimental evaluation demonstrating that conventional reconstruction techniques are inadequate in this scenario. Specifically, we observe that the resulting data reconstructions gravitate toward symmetric inputs on which the group acts trivially, leading to poor-quality results. Finally, we propose two novel methods aiming to improve reconstruction in this setup and present promising preliminary experimental results. Our work sheds light on the complexities of reconstructing data from group invariant neural networks and offers potential avenues for future research in this domain.

Title: No Identity, no problem: Motion through detection for people tracking

Authors: Martin Engilberge, F. Wilke Grosche, Pascal Fua
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16466
Pdf URL: https://arxiv.org/pdf/2411.16466
Copy Paste: [[2411.16466]] No Identity, no problem: Motion through detection for people tracking(https://arxiv.org/abs/2411.16466)
Keywords: robust
Abstract: Tracking-by-detection has become the de facto standard approach to people tracking. To increase robustness, some approaches incorporate re-identification using appearance models and regressing motion offset, which requires costly identity annotations. In this paper, we propose exploiting motion clues while providing supervision only for the detections, which is much easier to do. Our algorithm predicts detection heatmaps at two different times, along with a 2D motion estimate between the two images. It then warps one heatmap using the motion estimate and enforces consistency with the other one. This provides the required supervisory signal on the motion without the need for any motion annotations. In this manner, we couple the information obtained from different images during training and increase accuracy, especially in crowded scenes and when using low frame-rate sequences. We show that our approach delivers state-of-the-art results for single- and multi-view multi-target tracking on the MOT17 and WILDTRACK datasets.

Title: Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

Authors: Yutong Wang, Jiajie Teng, Jiajiong Cao, Yuming Li, Chenguang Ma, Hongteng Xu, Dixin Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16468
Pdf URL: https://arxiv.org/pdf/2411.16468
Copy Paste: [[2411.16468]] Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency(https://arxiv.org/abs/2411.16468)
Keywords: transformer
Abstract: As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details. Despite the significant advancement in video face enhancement, current methods still suffer from $i)$ long processing time and $ii)$ inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage \Rmnum{1}, we learn the model with a regularizer mitigating the codebook collapse problem. In Stage \Rmnum{2}, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos. Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness. Code is available at \url{this https URL}.

Title: Distributed Online Optimization with Stochastic Agent Availability

Authors: Juliette Achddou, Nicolò Cesa-Bianchi, Hao Qiu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16477
Pdf URL: https://arxiv.org/pdf/2411.16477
Copy Paste: [[2411.16477]] Distributed Online Optimization with Stochastic Agent Availability(https://arxiv.org/abs/2411.16477)
Keywords: federate
Abstract: Motivated by practical federated learning settings where clients may not be always available, we investigate a variant of distributed online optimization where agents are active with a known probability $p$ at each time step, and communication between neighboring agents can only take place if they are both active. We introduce a distributed variant of the FTRL algorithm and analyze its network regret, defined through the average of the instantaneous regret of the active agents. Our analysis shows that, for any connected communication graph $G$ over $N$ agents, the expected network regret of our FTRL variant after $T$ steps is at most of order $(\kappa/p^2)\min\big\{\sqrt{N},N^{1/4}/\sqrt{p}\big\}\sqrt{T}$, where $\kappa$ is the condition number of the Laplacian of $G$. We then show that similar regret bounds also hold with high probability. Moreover, we show that our notion of regret (average-case over the agents) is essentially equivalent to the standard notion of regret (worst-case over agents), implying that our bounds are not significantly improvable when $p=1$. Our theoretical results are supported by experiments on synthetic datasets.

Title: Distributed, communication-efficient, and differentially private estimation of KL divergence

Authors: Mary Scott, Sayan Biswas, Graham Cormode, Carsten Maple
Subjects: cs.LG, cs.DB
Abstract URL: https://arxiv.org/abs/2411.16478
Pdf URL: https://arxiv.org/pdf/2411.16478
Copy Paste: [[2411.16478]] Distributed, communication-efficient, and differentially private estimation of KL divergence(https://arxiv.org/abs/2411.16478)
Keywords: privacy, federate
Abstract: A key task in managing distributed, sensitive data is to measure the extent to which a distribution changes. Understanding this drift can effectively support a variety of federated learning and analytics tasks. However, in many practical settings sharing such information can be undesirable (e.g., for privacy concerns) or infeasible (e.g., for high communication costs). In this work, we describe novel algorithmic approaches for estimating the KL divergence of data across federated models of computation, under differential privacy. We analyze their theoretical properties and present an empirical study of their performance. We explore parameter settings that optimize the accuracy of the algorithm catering to each of the settings; these provide sub-variations that are applicable to real-world tasks, addressing different context- and application-specific trust level requirements. Our experimental results confirm that our private estimators achieve accuracy comparable to a baseline algorithm without differential privacy guarantees.

Title: Deformable Mamba for Wide Field of View Segmentation

Authors: Jie Hu, Junwei Zheng, Jiale Wei, Jiaming Zhang, Rainer Stiefelhagen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16481
Pdf URL: https://arxiv.org/pdf/2411.16481
Copy Paste: [[2411.16481]] Deformable Mamba for Wide Field of View Segmentation(https://arxiv.org/abs/2411.16481)
Keywords: segmentation
Abstract: Wide-FoV cameras, like fisheye and panoramic setups, are essential for broader perception but introduce significant distortions in 180° and 360° images, complicating dense prediction tasks. For instance, existing MAMBA models lacking distortion-aware capacity cannot perform well in panoramic semantic segmentation. To address this problem, this work presents Deformable Mamba, a unified framework specifically designed to address imaging distortions within the context of panoramic and fisheye semantic segmentation. At the core is a decoder constructed with a series of Deformable Mamba Fusion (DMF) blocks, making the whole framework more deformable, efficient, and accurate, when handling extreme distortions. Extensive evaluations across five datasets demonstrate that our method consistently improves segmentation accuracy compared to the previous state-of-the-art methods tailored for specific FoVs. Notably, Deformable Mamba achieves a +2.5% performance improvement on the 360° Stanford2D3D dataset, and shows better results across FoVs from 60° to 360°.

Title: AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning

Authors: Amy Xin, Jinxin Liu, Zijun Yao, Zhicheng Li, Shulin Cao, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16495
Pdf URL: https://arxiv.org/pdf/2411.16495
Copy Paste: [[2411.16495]] AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning(https://arxiv.org/abs/2411.16495)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) have led to significant improvements in various natural language processing tasks, but it is still challenging for LLMs to perform knowledge-intensive complex question answering due to LLMs' inefficacy in reasoning planning and the hallucination problem. A typical solution is to employ retrieval-augmented generation (RAG) coupled with chain-of-thought (CoT) reasoning, which decomposes complex questions into chain-like sub-questions and applies iterative RAG at each sub-question. However, prior works exhibit sub-optimal reasoning planning and overlook dynamic knowledge retrieval from heterogeneous sources. In this paper, we propose AtomR, a novel heterogeneous knowledge reasoning framework that conducts multi-source reasoning at the atomic level. Drawing inspiration from the graph modeling of knowledge, AtomR leverages large language models (LLMs) to decompose complex questions into combinations of three atomic knowledge operators, significantly enhancing the reasoning process at both the planning and execution stages. We also introduce BlendQA, a novel evaluation benchmark tailored to assess complex heterogeneous knowledge reasoning. Experiments show that AtomR significantly outperforms state-of-the-art baselines across three single-source and two multi-source reasoning benchmarks, with notable performance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA.

Title: Multi-Resolution Generative Modeling of Human Motion from Limited Data

Authors: David Eduardo Moreno-Villamarín, Anna Hilsmann, Peter Eisert
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16498
Pdf URL: https://arxiv.org/pdf/2411.16498
Copy Paste: [[2411.16498]] Multi-Resolution Generative Modeling of Human Motion from Limited Data(https://arxiv.org/abs/2411.16498)
Keywords: generative
Abstract: We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model's ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.

Title: Interpreting Language Reward Models via Contrastive Explanations

Authors: Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16502
Pdf URL: https://arxiv.org/pdf/2411.16502
Copy Paste: [[2411.16502]] Interpreting Language Reward Models via Contrastive Explanations(https://arxiv.org/abs/2411.16502)
Keywords: large language model
Abstract: Reward models (RMs) are a crucial component in the alignment of large language models' (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM's local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

Title: Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Authors: Boming Miao, Chunxiao Li, Xiaoxiao Wang, Andi Zhang, Rui Sun, Zizhe Wang, Yao Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16503
Pdf URL: https://arxiv.org/pdf/2411.16503
Copy Paste: [[2411.16503]] Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis(https://arxiv.org/abs/2411.16503)
Keywords: diffusion
Abstract: Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models. The code is available at this https URL.

Title: All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Authors: Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Khan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16508
Pdf URL: https://arxiv.org/pdf/2411.16508
Copy Paste: [[2411.16508]] All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages(https://arxiv.org/abs/2411.16508)
Keywords: robust
Abstract: Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model's ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.

Title: Guarding the Gate: ConceptGuard Battles Concept-Level Backdoors in Concept Bottleneck Models

Authors: Songning Lai, Yu Huang, Jiayu Yang, Gaoxiang Huang, Wenshuo Chen, Yutao Yue
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16512
Pdf URL: https://arxiv.org/pdf/2411.16512
Copy Paste: [[2411.16512]] Guarding the Gate: ConceptGuard Battles Concept-Level Backdoors in Concept Bottleneck Models(https://arxiv.org/abs/2411.16512)
Keywords: secure, security, protect, defense, attack, robust, interpretability
Abstract: The increasing complexity of AI models, especially in deep learning, has raised concerns about transparency and accountability, particularly in high-stakes applications like medical diagnostics, where opaque models can undermine trust. Explainable Artificial Intelligence (XAI) aims to address these issues by providing clear, interpretable models. Among XAI techniques, Concept Bottleneck Models (CBMs) enhance transparency by using high-level semantic concepts. However, CBMs are vulnerable to concept-level backdoor attacks, which inject hidden triggers into these concepts, leading to undetectable anomalous behavior. To address this critical security gap, we introduce ConceptGuard, a novel defense framework specifically designed to protect CBMs from concept-level backdoor attacks. ConceptGuard employs a multi-stage approach, including concept clustering based on text distance measurements and a voting mechanism among classifiers trained on different concept subgroups, to isolate and mitigate potential triggers. Our contributions are threefold: (i) we present ConceptGuard as the first defense mechanism tailored for concept-level backdoor attacks in CBMs; (ii) we provide theoretical guarantees that ConceptGuard can effectively defend against such attacks within a certain trigger size threshold, ensuring robustness; and (iii) we demonstrate that ConceptGuard maintains the high performance and interpretability of CBMs, crucial for trustworthiness. Through comprehensive experiments and theoretical proofs, we show that ConceptGuard significantly enhances the security and trustworthiness of CBMs, paving the way for their secure deployment in critical applications.

Title: Curator Attack: When Blackbox Differential Privacy Auditing Loses Its Power

Authors: Shiming Wang, Liyao Xiang, Bowei Cheng, Zhe Ji, Tianran Sun, Xinbing Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.16516
Pdf URL: https://arxiv.org/pdf/2411.16516
Copy Paste: [[2411.16516]] Curator Attack: When Blackbox Differential Privacy Auditing Loses Its Power(https://arxiv.org/abs/2411.16516)
Keywords: privacy, attack
Abstract: A surge in data-driven applications enhances everyday life but also raises serious concerns about private information leakage. Hence many privacy auditing tools are emerging for checking if the data sanitization performed meets the privacy standard of the data owner. Blackbox auditing for differential privacy is particularly gaining popularity for its effectiveness and applicability to a wide range of scenarios. Yet, we identified that blackbox auditing is essentially flawed with its setting: small probabilities or densities are ignored due to inaccurate observation. Our argument is based on a solid false positive analysis from a hypothesis testing perspective, which is missed out by prior blackbox auditing tools. This oversight greatly reduces the reliability of these tools, as it allows malicious or incapable data curators to pass the auditing with an overstated privacy guarantee, posing significant risks to data owners. We demonstrate the practical existence of such threats in classical differential privacy mechanisms against four representative blackbox auditors with experimental validations. Our findings aim to reveal the limitations of blackbox auditing tools, empower the data owner with the awareness of risks in using these tools, and encourage the development of more reliable differential privacy auditing methods.

Title: Poster: From Fort to Foe: The Threat of RCE in RPKI

Authors: Oliver Jacobsen, Haya Schulmann, Niklas Vogel, Michael Waidner
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.16518
Pdf URL: https://arxiv.org/pdf/2411.16518
Copy Paste: [[2411.16518]] Poster: From Fort to Foe: The Threat of RCE in RPKI(https://arxiv.org/abs/2411.16518)
Keywords: attack
Abstract: In this work, we present a novel severe buffer-overflow vulnerability in the RPKI validator Fort, that allows an attacker to achieve Remote Code Execution (RCE) on the machine running the software. We discuss the unique impact of this RCE on networks that use RPKI, illustrating that RCE vulnerabilities are especially severe in the context of RPKI. The design of RPKI makes RCE easy to exploit on a large scale, allows compromise of RPKI validation integrity, and enables a powerful vector for additional attacks on other critical components of the network, like the border routers. We analyze the vulnerability exposing to this RCE and identify indications that the discovered vulnerability could constitute an intentional backdoor to compromise systems running the software over a benign coding mistake. We disclosed the vulnerability, which has been assigned a CVE rated 9.8 critical (CVE-2024-45237).

Title: LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation

Authors: Steven Song, Anirudh Subramanyam, Irene Madejski, Robert L. Grossman
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16523
Pdf URL: https://arxiv.org/pdf/2411.16523
Copy Paste: [[2411.16523]] LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation(https://arxiv.org/abs/2411.16523)
Keywords: generative, large language model
Abstract: In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician's report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly "showing" the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.

Title: Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

Authors: Jerry Yao-Chieh Hu, Wei-Po Wang, Ammar Gilani, Chenyang Li, Zhao Song, Han Liu
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16525
Pdf URL: https://arxiv.org/pdf/2411.16525
Copy Paste: [[2411.16525]] Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency(https://arxiv.org/abs/2411.16525)
Keywords: transformer
Abstract: We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on \textit{single-head} transformers with only a \textit{single} self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in-$dL$ and -in-$(1/\epsilon)$ lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the \textit{soft-prompt-induced} keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.

Title: Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Authors: Carolin M. Schuster, Maria-Alexandra Dinisor, Shashwat Ghatiwala, Georg Groh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16527
Pdf URL: https://arxiv.org/pdf/2411.16527
Copy Paste: [[2411.16527]] Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings(https://arxiv.org/abs/2411.16527)
Keywords: large language model
Abstract: Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.

Title: Transformers are Deep Optimizers: Provable In-Context Learning for Deep Model Training

Authors: Weimin Wu, Maojiang Su, Jerry Yao-Chieh Hu, Zhao Song, Han Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16549
Pdf URL: https://arxiv.org/pdf/2411.16549
Copy Paste: [[2411.16549]] Transformers are Deep Optimizers: Provable In-Context Learning for Deep Model Training(https://arxiv.org/abs/2411.16549)
Keywords: transformer
Abstract: We investigate the transformer's capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

Title: Representation Collapsing Problems in Vector Quantization

Authors: Wenhao Zhao, Qiran Zou, Rushi Shah, Dianbo Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16550
Pdf URL: https://arxiv.org/pdf/2411.16550
Copy Paste: [[2411.16550]] Representation Collapsing Problems in Vector Quantization(https://arxiv.org/abs/2411.16550)
Keywords: diffusion, generative, large language model
Abstract: Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we investigate representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values. This collapse fundamentally compromises the model's ability to capture diverse data patterns. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that restricted initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.

Title: Generating Out-Of-Distribution Scenarios Using Language Models

Authors: Erfan Aasi, Phat Nguyen, Shiva Sreeram, Guy Rosman, Sertac Karaman, Daniela Rus
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16554
Pdf URL: https://arxiv.org/pdf/2411.16554
Copy Paste: [[2411.16554]] Generating Out-Of-Distribution Scenarios Using Language Models(https://arxiv.org/abs/2411.16554)
Keywords: robust, large language model
Abstract: The deployment of autonomous vehicles controlled by machine learning techniques requires extensive testing in diverse real-world environments, robust handling of edge cases and out-of-distribution scenarios, and comprehensive safety validation to ensure that these systems can navigate safely and effectively under unpredictable conditions. Addressing Out-Of-Distribution (OOD) driving scenarios is essential for enhancing safety, as OOD scenarios help validate the reliability of the models within the vehicle's autonomy stack. However, generating OOD scenarios is challenging due to their long-tailed distribution and rarity in urban driving dataset. Recently, Large Language Models (LLMs) have shown promise in autonomous driving, particularly for their zero-shot generalization and common-sense reasoning capabilities. In this paper, we leverage these LLM strengths to introduce a framework for generating diverse OOD driving scenarios. Our approach uses LLMs to construct a branching tree, where each branch represents a unique OOD scenario. These scenarios are then simulated in the CARLA simulator using an automated framework that aligns scene augmentation with the corresponding textual descriptions. We evaluate our framework through extensive simulations, and assess its performance via a diversity metric that measures the richness of the scenarios. Additionally, we introduce a new "OOD-ness" metric, which quantifies how much the generated scenarios deviate from typical urban driving conditions. Furthermore, we explore the capacity of modern Vision-Language Models (VLMs) to interpret and safely navigate through the simulated OOD scenarios. Our findings offer valuable insights into the reliability of language models in addressing OOD scenarios within the context of urban driving.

Title: Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches

Authors: Yinqiu Feng, Aoran Shen, Jiacheng Hu, Yingbin Liang, Shiru Wang, Junliang Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16567
Pdf URL: https://arxiv.org/pdf/2411.16567
Copy Paste: [[2411.16567]] Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches(https://arxiv.org/abs/2411.16567)
Keywords: generative
Abstract: This paper presents an innovative approach to enhancing few-shot learning by integrating data augmentation with model fine-tuning in a framework designed to tackle the challenges posed by small-sample data. Recognizing the critical limitations of traditional machine learning models that require large datasets-especially in fields such as drug discovery, target recognition, and malicious traffic detection-this study proposes a novel strategy that leverages Generative Adversarial Networks (GANs) and advanced optimization techniques to improve model performance with limited data. Specifically, the paper addresses the noise and bias issues introduced by data augmentation methods, contrasting them with model-based approaches, such as fine-tuning and metric learning, which rely heavily on related datasets. By combining Markov Chain Monte Carlo (MCMC) sampling and discriminative model ensemble strategies within a GAN framework, the proposed model adjusts generative and discriminative distributions to simulate a broader range of relevant data. Furthermore, it employs MHLoss and a reparameterized GAN ensemble to enhance stability and accelerate convergence, ultimately leading to improved classification performance on small-sample images and structured datasets. Results confirm that the MhERGAN algorithm developed in this research is highly effective for few-shot learning, offering a practical solution that bridges data scarcity with high-performing model adaptability and generalization.

Title: J-CaPA : Joint Channel and Pyramid Attention Improves Medical Image Segmentation

Authors: Marzia Binta Nizam, Marian Zlateva, James Davis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16568
Pdf URL: https://arxiv.org/pdf/2411.16568
Copy Paste: [[2411.16568]] J-CaPA : Joint Channel and Pyramid Attention Improves Medical Image Segmentation(https://arxiv.org/abs/2411.16568)
Keywords: extraction, transformer, segmentation
Abstract: Medical image segmentation is crucial for diagnosis and treatment planning. Traditional CNN-based models, like U-Net, have shown promising results but struggle to capture long-range dependencies and global context. To address these limitations, we propose a transformer-based architecture that jointly applies Channel Attention and Pyramid Attention mechanisms to improve multi-scale feature extraction and enhance segmentation performance for medical images. Increasing model complexity requires more training data, and we further improve model generalization with CutMix data augmentation. Our approach is evaluated on the Synapse multi-organ segmentation dataset, achieving a 6.9% improvement in Mean Dice score and a 39.9% improvement in Hausdorff Distance (HD95) over an implementation without our enhancements. Our proposed model demonstrates improved segmentation accuracy for complex anatomical structures, outperforming existing state-of-the-art methods.

Title: Rethinking Diffusion for Text-Driven Human Motion Generation

Authors: Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, Huaizu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16575
Pdf URL: https://arxiv.org/pdf/2411.16575
Copy Paste: [[2411.16575]] Rethinking Diffusion for Text-Driven Human Motion Generation(https://arxiv.org/abs/2411.16575)
Keywords: robust, fair, diffusion
Abstract: Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.

Title: Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Authors: Zhiheng Xi, Dingwen Yang, Jixuan Huang, Jiafu Tang, Guanyu Li, Yiwen Ding, Wei He, Boyang Hong, Shihan Do, Wenyu Zhan, Xiao Wang, Rui Zheng, Tao Ji, Xiaowei Shi, Yitao Zhai, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Zuxuan Wu, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Yu-Gang Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16579
Pdf URL: https://arxiv.org/pdf/2411.16579
Copy Paste: [[2411.16579]] Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision(https://arxiv.org/abs/2411.16579)
Keywords: large language model
Abstract: Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model's capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of $76,321$ responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor's self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor's exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase its potential. Our code and datasets are at \href{this https URL}{this https URL}.

Title: Adversarial Attacks for Drift Detection

Authors: Fabian Hinder, Valerie Vaquet, Barbara Hammer
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16591
Pdf URL: https://arxiv.org/pdf/2411.16591
Copy Paste: [[2411.16591]] Adversarial Attacks for Drift Detection(https://arxiv.org/abs/2411.16591)
Keywords: attack, robust
Abstract: Concept drift refers to the change of data distributions over time. While drift poses a challenge for learning models, requiring their continual adaption, it is also relevant in system monitoring to detect malfunctions, system failures, and unexpected behavior. In the latter case, the robust and reliable detection of drifts is imperative. This work studies the shortcomings of commonly used drift detection schemes. We show how to construct data streams that are drifting without being detected. We refer to those as drift adversarials. In particular, we compute all possible adversairals for common detection schemes and underpin our theoretical findings with empirical evaluations.

Title: Unlocking The Potential of Adaptive Attacks on Diffusion-Based Purification

Authors: Andre Kassis, Urs Hengartner, Yaoliang Yu
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16598
Pdf URL: https://arxiv.org/pdf/2411.16598
Copy Paste: [[2411.16598]] Unlocking The Potential of Adaptive Attacks on Diffusion-Based Purification(https://arxiv.org/abs/2411.16598)
Keywords: protect, defense, attack, robust, watermark, diffusion
Abstract: Diffusion-based purification (DBP) is a defense against adversarial examples (AEs), amassing popularity for its ability to protect classifiers in an attack-oblivious manner and resistance to strong adversaries with access to the defense. Its robustness has been claimed to ensue from the reliance on diffusion models (DMs) that project the AEs onto the natural distribution. We revisit this claim, focusing on gradient-based strategies that back-propagate the loss gradients through the defense, commonly referred to as ``adaptive attacks". Analytically, we show that such an optimization method invalidates DBP's core foundations, effectively targeting the DM rather than the classifier and restricting the purified outputs to a distribution over malicious samples instead. Thus, we reassess the reported empirical robustness, uncovering implementation flaws in the gradient back-propagation techniques used thus far for DBP. We fix these issues, providing the first reliable gradient library for DBP and demonstrating how adaptive attacks drastically degrade its robustness. We then study a less efficient yet stricter majority-vote setting where the classifier evaluates multiple purified copies of the input to make its decision. Here, DBP's stochasticity enables it to remain partially robust against traditional norm-bounded AEs. We propose a novel adaptation of a recent optimization method against deepfake watermarking that crafts systemic malicious perturbations while ensuring imperceptibility. When integrated with the adaptive attack, it completely defeats DBP, even in the majority-vote setup. Our findings prove that DBP, in its current state, is not a viable defense against AEs.

Title: Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

Authors: Ronghuan Wu, Wanchao Su, Jing Liao
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16602
Pdf URL: https://arxiv.org/pdf/2411.16602
Copy Paste: [[2411.16602]] Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models(https://arxiv.org/abs/2411.16602)
Keywords: diffusion, large language model
Abstract: Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.

Title: Recent Trends in Linear Text Segmentation: a Survey

Authors: Iacopo Ghinassi, Lin Wang, Chris Newell, Matthew Purver
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16613
Pdf URL: https://arxiv.org/pdf/2411.16613
Copy Paste: [[2411.16613]] Recent Trends in Linear Text Segmentation: a Survey(https://arxiv.org/abs/2411.16613)
Keywords: segmentation
Abstract: Linear Text Segmentation is the task of automatically tagging text documents with topic shifts, i.e. the places in the text where the topics change. A well-established area of research in Natural Language Processing, drawing from well-understood concepts in linguistic and computational linguistic research, the field has recently seen a lot of interest as a result of the surge of text, video, and audio available on the web, which in turn require ways of summarising and categorizing the mole of content for which linear text segmentation is a fundamental step. In this survey, we provide an extensive overview of current advances in linear text segmentation, describing the state of the art in terms of resources and approaches for the task. Finally, we highlight the limitations of available resources and of the task itself, while indicating ways forward based on the most recent literature and under-explored research directions.

Title: GeoFormer: A Multi-Polygon Segmentation Transformer

Authors: Maxim Khomiakov, Michael Riis Andersen, Jes Frellsen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16616
Pdf URL: https://arxiv.org/pdf/2411.16616
Copy Paste: [[2411.16616]] GeoFormer: A Multi-Polygon Segmentation Transformer(https://arxiv.org/abs/2411.16616)
Keywords: robust, transformer, segmentation
Abstract: In remote sensing there exists a common need for learning scale invariant shapes of objects like buildings. Prior works relies on tweaking multiple loss functions to convert segmentation maps into the final scale invariant representation, necessitating arduous design and optimization. For this purpose we introduce the GeoFormer, a novel architecture which presents a remedy to the said challenges, learning to generate multipolygons end-to-end. By modeling keypoints as spatially dependent tokens in an auto-regressive manner, the GeoFormer outperforms existing works in delineating building objects from satellite imagery. We evaluate the robustness of the GeoFormer against former methods through a variety of parameter ablations and highlight the advantages of optimizing a single likelihood function. Our study presents the first successful application of auto-regressive transformer models for multi-polygon predictions in remote sensing, suggesting a promising methodological alternative for building vectorization.

Title: StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training

Authors: Kaustubh Ponkshe, Venkatapathy Subramanian, Natwar Modani, Ganesh Ramakrishnan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16618
Pdf URL: https://arxiv.org/pdf/2411.16618
Copy Paste: [[2411.16618]] StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training(https://arxiv.org/abs/2411.16618)
Keywords: transformer
Abstract: Most state-of-the-art techniques for Language Models (LMs) today rely on transformer-based architectures and their ubiquitous attention mechanism. However, the exponential growth in computational requirements with longer input sequences confines Transformers to handling short passages. Recent efforts have aimed to address this limitation by introducing selective attention mechanisms, notably local and global attention. While sparse attention mechanisms, akin to full attention in being Turing-complete, have been theoretically established, their practical impact on pre-training remains unexplored. This study focuses on empirically assessing the influence of global attention on BERT pre-training. The primary steps involve creating an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. We carry out pre-training on these two datasets, investigate shifts in attention patterns, and assess their implications for downstream tasks. Our analysis underscores the significance of incorporating document structure into LM models, demonstrating their capacity to excel in more abstract tasks, such as document understanding.

Title: Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective

Authors: Jean Marie Tshimula, Xavier Ndona, D'Jeff K. Nkashama, Pierre-Martin Tardif, Froduald Kabanza, Marc Frappier, Shengrui Wang
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16642
Pdf URL: https://arxiv.org/pdf/2411.16642
Copy Paste: [[2411.16642]] Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective(https://arxiv.org/abs/2411.16642)
Keywords: security, protect, defense, extraction, large language model
Abstract: Jailbreak prompts pose a significant threat in AI and cybersecurity, as they are crafted to bypass ethical safeguards in large language models, potentially enabling misuse by cybercriminals. This paper analyzes jailbreak prompts from a cyber defense perspective, exploring techniques like prompt injection and context manipulation that allow harmful content generation, content filter evasion, and sensitive information extraction. We assess the impact of successful jailbreaks, from misinformation and automated social engineering to hazardous content creation, including bioweapons and explosives. To address these threats, we propose strategies involving advanced prompt analysis, dynamic safety protocols, and continuous model fine-tuning to strengthen AI resilience. Additionally, we highlight the need for collaboration among AI researchers, cybersecurity experts, and policymakers to set standards for protecting AI systems. Through case studies, we illustrate these cyber defense approaches, promoting responsible AI practices to maintain system integrity and public trust. \textbf{\color{red}Warning: This paper contains content which the reader may find offensive.}

Title: Exploring Discrete Flow Matching for 3D De Novo Molecule Generation

Authors: Ian Dunn, David R. Koes
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2411.16644
Pdf URL: https://arxiv.org/pdf/2411.16644
Copy Paste: [[2411.16644]] Exploring Discrete Flow Matching for 3D De Novo Molecule Generation(https://arxiv.org/abs/2411.16644)
Keywords: generative
Abstract: Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Flow matching is a recently proposed generative modeling framework that has achieved impressive performance on a variety of tasks including those on biomolecular structures. The seminal flow matching framework was developed only for continuous data. However, de novo molecular design tasks require generating discrete data such as atomic elements or sequences of amino acid residues. Several discrete flow matching methods have been proposed recently to address this gap. In this work we benchmark the performance of existing discrete flow matching methods for 3D de novo small molecule generation and provide explanations of their differing behavior. As a result we present FlowMol-CTMC, an open-source model that achieves state of the art performance for 3D de novo design with fewer learnable parameters than existing methods. Additionally, we propose the use of metrics that capture molecule quality beyond local chemical valency constraints and towards higher-order structural motifs. These metrics show that even though basic constraints are satisfied, the models tend to produce unusual and potentially problematic functional groups outside of the training data distribution. Code and trained models for reproducing this work are available at \url{this https URL}.

Title: Self-Generated Critiques Boost Reward Modeling for Language Models

Authors: Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16646
Pdf URL: https://arxiv.org/pdf/2411.16646
Copy Paste: [[2411.16646]] Self-Generated Critiques Boost Reward Modeling for Language Models(https://arxiv.org/abs/2411.16646)
Keywords: large language model
Abstract: Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.

Title: DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Authors: Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16657
Pdf URL: https://arxiv.org/pdf/2411.16657
Copy Paste: [[2411.16657]] DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation(https://arxiv.org/abs/2411.16657)
Keywords: robust, large language model
Abstract: Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.

Title: Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Authors: Bernd Von Gimborn, Philipp Ausserlechner, Markus Vincze, Stefan Thalhammer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16668
Pdf URL: https://arxiv.org/pdf/2411.16668
Copy Paste: [[2411.16668]] Diffusion Features for Zero-Shot 6DoF Object Pose Estimation(https://arxiv.org/abs/2411.16668)
Keywords: diffusion, transformer
Abstract: Zero-shot object pose estimation enables the retrieval of object poses from images without necessitating object-specific training. In recent approaches this is facilitated by vision foundation models (VFM), which are pre-trained models that are effectively general-purpose feature extractors. The characteristics exhibited by these VFMs vary depending on the training data, network architecture, and training paradigm. The prevailing choice in this field are self-supervised Vision Transformers (ViT). This study assesses the influence of Latent Diffusion Model (LDM) backbones on zero-shot pose estimation. In order to facilitate a comparison between the two families of models on a common ground we adopt and modify a recent approach. Therefore, a template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented. The efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation. The experiments demonstrate an Average Recall improvement of up to 27% over the ViT baseline. The source code is available at: this https URL.

Title: Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Authors: Sohee Yang, Nora Kassner, Elena Gribovskaya, Sebastian Riedel, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16679
Pdf URL: https://arxiv.org/pdf/2411.16679
Copy Paste: [[2411.16679]] Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?(https://arxiv.org/abs/2411.16679)
Keywords: large language model
Abstract: We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like "In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of". One major challenge in evaluating this ability is that LLMs may have developed shortcuts by encounters of the head entity "Scarlett Johansson" and the answer entity "United States" in the same training sequences or merely guess the answer based on frequency-based priors. To prevent shortcuts, we exclude test queries where the head and answer entities co-appear in pretraining corpora. Through careful selection of relations and facts and systematic removal of cases where models might guess answers or exploit partial matches, we construct an evaluation dataset SOCRATES (ShOrtCut-fRee lATent rEaSoning). We observe that LLMs demonstrate promising latent multi-hop reasoning abilities without exploiting shortcuts, but only for certain types of queries. For queries requiring latent recall of countries as the intermediate answer, the best models achieve 80% latent composability, but this drops to just 5% for the recall of years. Comparisons with Chain-of-Thought composability highlight a significant gap between the ability of models to reason latently versus explicitly. Analysis reveals that latent representations of the intermediate answer are constructed more often in queries with higher latent composability, and shows the emergence of latent multi-hop reasoning during pretraining.

Title: Quark: Real-time, High-resolution, and General Neural View Synthesis

Authors: John Flynn, Michael Broxton, Lukas Murmann, Lucy Chai, Matthew DuVall, Clément Godard, Kathryn Heal, Srinivas Kaza, Stephen Lombardi, Xuan Luo, Supreeth Achar, Kira Prabhu, Tiancheng Sun, Lynn Tsai, Ryan Overbeck
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16680
Pdf URL: https://arxiv.org/pdf/2411.16680
Copy Paste: [[2411.16680]] Quark: Real-time, High-resolution, and General Neural View Synthesis(https://arxiv.org/abs/2411.16680)
Keywords: transformer
Abstract: We present a novel neural algorithm for performing high-quality, high-resolution, real-time novel view synthesis. From a sparse set of input RGB images or videos streams, our network both reconstructs the 3D scene and renders novel views at 1080p resolution at 30fps on an NVIDIA A100. Our feed-forward network generalizes across a wide variety of datasets and scenes and produces state-of-the-art quality for a real-time method. Our quality approaches, and in some cases surpasses, the quality of some of the top offline methods. In order to achieve these results we use a novel combination of several key concepts, and tie them together into a cohesive and effective algorithm. We build on previous works that represent the scene using semi-transparent layers and use an iterative learned render-and-refine approach to improve those layers. Instead of flat layers, our method reconstructs layered depth maps (LDMs) that efficiently represent scenes with complex depth and occlusions. The iterative update steps are embedded in a multi-scale, UNet-style architecture to perform as much compute as possible at reduced resolution. Within each update step, to better aggregate the information from multiple input views, we use a specialized Transformer-based network component. This allows the majority of the per-input image processing to be performed in the input image space, as opposed to layer space, further increasing efficiency. Finally, due to the real-time nature of our reconstruction and rendering, we dynamically create and discard the internal 3D geometry for each frame, generating the LDM for each view. Taken together, this produces a novel and effective algorithm for view synthesis. Through extensive evaluation, we demonstrate that we achieve state-of-the-art quality at real-time rates. Project page: this https URL

Title: Factorized Visual Tokenization and Generation

Authors: Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong He, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16681
Pdf URL: https://arxiv.org/pdf/2411.16681
Copy Paste: [[2411.16681]] Factorized Visual Tokenization and Generation(https://arxiv.org/abs/2411.16681)
Keywords: transformer
Abstract: Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. this https URL

Title: Generative Omnimatte: Learning to Decompose Video into Layers

Authors: Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, Forrester Cole
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16683
Pdf URL: https://arxiv.org/pdf/2411.16683
Copy Paste: [[2411.16683]] Generative Omnimatte: Learning to Decompose Video into Layers(https://arxiv.org/abs/2411.16683)
Keywords: diffusion, generative
Abstract: Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.