2024-11-27

Title: Leveraging Conversational Generative AI for Anomaly Detection in Digital Substations

Authors: Aydin Zaboli, Seong Lok Choi, Junho Hong
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2411.16692
Pdf URL: https://arxiv.org/pdf/2411.16692
Copy Paste: [[2411.16692]] Leveraging Conversational Generative AI for Anomaly Detection in Digital Substations(https://arxiv.org/abs/2411.16692)
Keywords: security, generative
Abstract: This study addresses critical challenges of cybersecurity in digital substations by proposing an innovative task-oriented dialogue (ToD) system for anomaly detection (AD) in multicast messages, specifically, generic object oriented substation event (GOOSE) and sampled value (SV) datasets. Leveraging generative artificial intelligence (GenAI) technology, the proposed framework demonstrates superior error reduction, scalability, and adaptability compared with traditional human-in-the-loop (HITL) processes. Notably, this methodology offers significant advantages over machine learning (ML) techniques in terms of efficiency and implementation speed when confronting novel and/or unknown cyber threats, while also maintaining model complexity and precision. The research employs advanced performance metrics to conduct a comparative assessment between the proposed AD and HITL-based AD frameworks, utilizing a hardware-in-the-loop (HIL) testbed for generating and extracting features of IEC61850 communication messages. This approach presents a promising solution for enhancing the reliability of power system operations in the face of evolving cybersecurity challenges.

Title: Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework

Authors: Mengshuo Jia, Zeyu Cui, Gabriela Hug
Subjects: cs.CL, cs.AI, cs.MA, eess.SY
Abstract URL: https://arxiv.org/abs/2411.16707
Pdf URL: https://arxiv.org/pdf/2411.16707
Copy Paste: [[2411.16707]] Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework(https://arxiv.org/abs/2411.16707)
Keywords: large language model
Abstract: The integration of experimental technologies with large language models (LLMs) is transforming scientific research, positioning AI as a versatile research assistant rather than a mere problem-solving tool. In the field of power systems, however, managing simulations -- one of the essential experimental technologies -- remains a challenge for LLMs due to their limited domain-specific knowledge, restricted reasoning capabilities, and imprecise handling of simulation parameters. To address these limitations, we propose a feedback-driven, multi-agent framework that incorporates three proposed modules: an enhanced retrieval-augmented generation (RAG) module, an improved reasoning module, and a dynamic environmental acting module with an error-feedback mechanism. Validated on 69 diverse tasks from Daline and MATPOWER, this framework achieves success rates of 93.13% and 96.85%, respectively, significantly outperforming the latest LLMs (ChatGPT 4o and o1-preview), which achieved a 27.77% success rate on standard simulation tasks and 0% on complex tasks. Additionally, our framework also supports rapid, cost-effective task execution, completing each simulation in approximately 30 seconds at an average cost of 0.014 USD for tokens. Overall, this adaptable framework lays a foundation for developing intelligent LLM-based assistants for human researchers, facilitating power system research and beyond.

Title: SafeLight: Enhancing Security in Optical Convolutional Neural Network Accelerators

Authors: Salma Afifi, Ishan Thakkar, Sudeep Pasricha
Subjects: cs.CR, cs.AR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16712
Pdf URL: https://arxiv.org/pdf/2411.16712
Copy Paste: [[2411.16712]] SafeLight: Enhancing Security in Optical Convolutional Neural Network Accelerators(https://arxiv.org/abs/2411.16712)
Keywords: security, attack, robust
Abstract: The rapid proliferation of deep learning has revolutionized computing hardware, driving innovations to improve computationally expensive multiply-and-accumulate operations in deep neural networks. Among these innovations are integrated silicon-photonic systems that have emerged as energy-efficient platforms capable of achieving light speed computation and communication, positioning optical neural network (ONN) platforms as a transformative technology for accelerating deep learning models such as convolutional neural networks (CNNs). However, the increasing complexity of optical hardware introduces new vulnerabilities, notably the risk of hardware trojan (HT) attacks. Despite the growing interest in ONN platforms, little attention has been given to how HT-induced threats can compromise performance and security. This paper presents an in-depth analysis of the impact of such attacks on the performance of CNN models accelerated by ONN accelerators. Specifically, we show how HTs can compromise microring resonators (MRs) in a state-of-the-art non-coherent ONN accelerator and reduce classification accuracy across CNN models by up to 7.49% to 80.46% by just targeting 10% of MRs. We then propose techniques to enhance ONN accelerator robustness against these attacks and show how the best techniques can effectively recover the accuracy drops.

Title: Conditional Text-to-Image Generation with Reference Guidance

Authors: Taewook Kim, Ze Wang, Zhengyuan Yang, Jiang Wang, Lijuan Wang, Zicheng Liu, Qiang Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16713
Pdf URL: https://arxiv.org/pdf/2411.16713
Copy Paste: [[2411.16713]] Conditional Text-to-Image Generation with Reference Guidance(https://arxiv.org/abs/2411.16713)
Keywords: diffusion
Abstract: Text-to-image diffusion models have demonstrated tremendous success in synthesizing visually stunning images given textual instructions. Despite remarkable progress in creating high-fidelity visuals, text-to-image models can still struggle with precisely rendering subjects, such as text spelling. To address this challenge, this paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. In addition, this reference condition empowers the model to be conditioned in ways that the vocabularies of the text tokenizer cannot adequately represent, and further extends the model's generalization to novel capabilities such as generating non-English text spellings. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Each plugin is trained with auxiliary networks and loss functions customized for applications such as English scene-text generation, multi-lingual scene-text generation, and logo-image generation. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.

Title: TPIE: Topology-Preserved Image Editing With Text Instructions

Authors: Nivetha Jayakumar, Srivardhan Reddy Gadila, Tonmoy Hossain, Yangfeng Ji, Miaomiao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16714
Pdf URL: https://arxiv.org/pdf/2411.16714
Copy Paste: [[2411.16714]] TPIE: Topology-Preserved Image Editing With Text Instructions(https://arxiv.org/abs/2411.16714)
Keywords: diffusion, generative
Abstract: Preserving topological structures is important in real-world applications, particularly in sensitive domains such as healthcare and medicine, where the correctness of human anatomy is critical. However, most existing image editing models focus on manipulating intensity and texture features, often overlooking object geometry within images. To address this issue, this paper introduces a novel method, Topology-Preserved Image Editing with text instructions (TPIE), that for the first time ensures the topology and geometry remaining intact in edited images through text-guided generative diffusion models. More specifically, our method treats newly generated samples as deformable variations of a given input template, allowing for controllable and structure-preserving edits. Our proposed TPIE framework consists of two key modules: (i) an autoencoder-based registration network that learns latent representations of object transformations, parameterized by velocity fields, from pairwise training images; and (ii) a novel latent conditional geometric diffusion (LCDG) model efficiently capturing the data distribution of learned transformation features conditioned on custom-defined text instructions. We validate TPIE on a diverse set of 2D and 3D images and compare them with state-of-the-art image editing approaches. Experimental results show that our method outperforms other baselines in generating more realistic images with well-preserved topology. Our code will be made publicly available on Github.

Title: Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients

Authors: Xiaoling Hu, Oula Puonti, Juan Eugenio Iglesias, Bruce Fischl, Yael Balbastre
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16719
Pdf URL: https://arxiv.org/pdf/2411.16719
Copy Paste: [[2411.16719]] Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients(https://arxiv.org/abs/2411.16719)
Keywords: segmentation
Abstract: Domain randomization through synthesis is a powerful strategy to train networks that are unbiased with respect to the domain of the input images. Randomization allows networks to see a virtually infinite range of intensities and artifacts during training, thereby minimizing overfitting to appearance and maximizing generalization to unseen data. While powerful, this approach relies on the accurate tuning of a large set of hyper-parameters governing the probabilistic distribution of the synthesized images. Instead of manually tuning these parameters, we introduce Learn2Synth, a novel procedure in which synthesis parameters are learned using a small set of real labeled data. Unlike methods that impose constraints to align synthetic data with real data (e.g., contrastive or adversarial techniques), which risk misaligning the image and its label map, we tune an augmentation engine such that a segmentation network trained on synthetic data has optimal accuracy when applied to real data. This approach allows the training procedure to benefit from real labeled examples, without ever using these real examples to train the segmentation network, which avoids biasing the network towards the properties of the training set. Specifically, we develop both parametric and nonparametric strategies to augment the synthetic images, enhancing the segmentation network's performance. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of this learning strategy. Code is available at: this https URL.

Title: Importance-based Token Merging for Diffusion Models

Authors: Haoyu Wu, Jingyi Xu, Hieu Le, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16720
Pdf URL: https://arxiv.org/pdf/2411.16720
Copy Paste: [[2411.16720]] Importance-based Token Merging for Diffusion Models(https://arxiv.org/abs/2411.16720)
Keywords: diffusion
Abstract: Diffusion models excel at high-quality image and video generation. However, a major drawback is their high latency. A simple yet powerful way to speed them up is by merging similar tokens for faster computation, though this can result in some quality loss. In this paper, we demonstrate that preserving important tokens during merging significantly improves sample quality. Notably, the importance of each token can be reliably determined using the classifier-free guidance magnitude, as this measure is strongly correlated with the conditioning input and corresponds to output fidelity. Since classifier-free guidance incurs no additional computational cost or requires extra modules, our method can be easily integrated into most diffusion-based frameworks. Experiments show that our approach significantly outperforms the baseline across various applications, including text-to-image synthesis, multi-view image generation, and video generation.

Title: Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Authors: Han Wang, Gang Wang, Huan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16721
Pdf URL: https://arxiv.org/pdf/2411.16721
Copy Paste: [[2411.16721]] Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks(https://arxiv.org/abs/2411.16721)
Keywords: defense, attack
Abstract: Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against both unseen attacks at design time (i.e., structured-based attacks) and adversarial images from diverse distributions.

Title: $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models

Authors: Dahye Kim, Xavier Thomas, Deepti Ghadiyaram
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16725
Pdf URL: https://arxiv.org/pdf/2411.16725
Copy Paste: [[2411.16725]] $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models(https://arxiv.org/abs/2411.16725)
Keywords: interpretability, diffusion
Abstract: We study $\textit{how}$ rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. We uncover monosemantic interpretable features by leveraging k-sparse autoencoders (k-SAE). We substantiate our mechanistic interpretations via transfer learning using light-weight classifiers on off-the-shelf diffusion models' features. On $4$ datasets, we demonstrate the effectiveness of diffusion features for representation learning. We provide in-depth analysis of how different diffusion architectures, pre-training datasets, and language model conditioning impacts visual representation granularity, inductive biases, and transfer learning capabilities. Our work is a critical step towards deepening interpretability of black-box diffusion models. Code and visualizations available at: this https URL

Title: EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Authors: Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, Qingfeng Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16726
Pdf URL: https://arxiv.org/pdf/2411.16726
Copy Paste: [[2411.16726]] EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion(https://arxiv.org/abs/2411.16726)
Keywords: diffusion
Abstract: Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.

Title: "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks

Authors: Libo Wang
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16730
Pdf URL: https://arxiv.org/pdf/2411.16730
Copy Paste: [[2411.16730]] "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks(https://arxiv.org/abs/2411.16730)
Keywords: attack, large language model
Abstract: As the application of large language models continues to expand in various fields, it poses higher challenges to the effectiveness of identifying harmful content generation and guardrail mechanisms. This research aims to evaluate the effectiveness of guardrails in the face of multi-step jailbreak prompt-generated verbal attacks, through black-box testing of seemingly ethical prompt simulations. The experimental subjects were selected GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5 and Claude 3.5 Sonnet. The researcher used the same multi-step prompt to simulate moral attacks by designing a scenario of "enterprise middle managers competing for promotion" and observed the model's response at each step. During the experiment, the guardrails of the above model were all bypassed in this experiment and the content of verbal attacks was generated. The data results show that Claude 3.5 Sonnet performs better than other models in terms of its tendency to identify jailbreak prompts. The researcher hopes to use this to remind developers and future research that guardrails not only inappropriately play the role of content filters, but should also have a preventive function. In order to ensure the objectivity and generalizability of the experiment, the researcher has uploaded the experimental process, black box test code, and enhanced guardrail code to GitHub to promote cooperation in the development community: this https URL.

Title: Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge

Authors: Joohyun Lee, Minji Roh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16732
Pdf URL: https://arxiv.org/pdf/2411.16732
Copy Paste: [[2411.16732]] Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge(https://arxiv.org/abs/2411.16732)
Keywords: large language model
Abstract: As Large Language Models (LLMs) increasingly address domain-specific problems, their application in the financial sector has expanded rapidly. Tasks that are both highly valuable and time-consuming, such as analyzing financial statements, disclosures, and related documents, are now being effectively tackled using LLMs. This paper details the development of a high-performance, finance-specific Retrieval-Augmented Generation (RAG) system for the ACM-ICAIF '24 FinanceRAG competition. We optimized performance through ablation studies on query expansion and corpus refinement during the pre-retrieval phase. To enhance retrieval accuracy, we employed multiple reranker models. Notably, we introduced an efficient method for managing long context sizes during the generation phase, significantly improving response quality without sacrificing performance. We ultimately achieve 2nd place in the FinanceRAG Challenge. Our key contributions include: (1) pre-retrieval ablation analysis, (2) an enhanced retrieval algorithm, and (3) a novel approach for long-context management. This work demonstrates the potential of LLMs in effectively processing and analyzing complex financial data to generate accurate and valuable insights. The source code and further details are available at this https URL.

Title: Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method

Authors: Pan Yin, Kaiyu Li, Xiangyong Cao, Jing Yao, Lei Liu, Xueru Bai, Feng Zhou, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16733
Pdf URL: https://arxiv.org/pdf/2411.16733
Copy Paste: [[2411.16733]] Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method(https://arxiv.org/abs/2411.16733)
Keywords: extraction
Abstract: Recently, road graph extraction has garnered increasing attention due to its crucial role in autonomous driving, navigation, etc. However, accurately and efficiently extracting road graphs remains a persistent challenge, primarily due to the severe scarcity of labeled data. To address this limitation, we collect a global-scale satellite road graph extraction dataset, i.e. Global-Scale dataset. Specifically, the Global-Scale dataset is $\sim20 \times$ larger than the largest existing public road extraction dataset and spans over 13,800 $km^2$ globally. Additionally, we develop a novel road graph extraction model, i.e. SAM-Road++, which adopts a node-guided resampling method to alleviate the mismatch issue between training and inference in SAM-Road, a pioneering state-of-the-art road graph extraction model. Furthermore, we propose a simple yet effective ``extended-line'' strategy in SAM-Road++ to mitigate the occlusion issue on the road. Extensive experiments demonstrate the validity of the collected Global-Scale dataset and the proposed SAM-Road++ method, particularly highlighting its superior predictive power in unseen regions. The dataset and code are available at \url{this https URL}.

Title: ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

Authors: Haochen Zhao, Xiangru Tang, Ziran Yang, Xiao Han, Xuanzhi Feng, Yueqing Fan, Senhao Cheng, Di Jin, Yilun Zhao, Arman Cohan, Mark Gerstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16736
Pdf URL: https://arxiv.org/pdf/2411.16736
Copy Paste: [[2411.16736]] ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain(https://arxiv.org/abs/2411.16736)
Keywords: robust, large language model
Abstract: The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasingly deeper chemical knowledge. Our dataset has more than 30K samples across various chemical materials. We incorporate handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework thoroughly assesses the safety, accuracy, and appropriateness of LLM responses. Extensive experiments with state-of-the-art LLMs reveal notable strengths and critical vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. Our code and dataset are available at this https URL. Warning: this paper contains discussions on the synthesis of controlled chemicals using AI models.

Title: Federated Learning in Chemical Engineering: A Tutorial on a Framework for Privacy-Preserving Collaboration Across Distributed Data Sources

Authors: Siddhant Dutta, Iago Leal de Freitas, Pedro Maciel Xavier, Claudio Miceli de Farias, David Esteban Bernal Neira
Subjects: cs.LG, cs.DC, cs.NE
Abstract URL: https://arxiv.org/abs/2411.16737
Pdf URL: https://arxiv.org/pdf/2411.16737
Copy Paste: [[2411.16737]] Federated Learning in Chemical Engineering: A Tutorial on a Framework for Privacy-Preserving Collaboration Across Distributed Data Sources(https://arxiv.org/abs/2411.16737)
Keywords: privacy, protect, federate
Abstract: Federated Learning (FL) is a decentralized machine learning approach that has gained attention for its potential to enable collaborative model training across clients while protecting data privacy, making it an attractive solution for the chemical industry. This work aims to provide the chemical engineering community with an accessible introduction to the discipline. Supported by a hands-on tutorial and a comprehensive collection of examples, it explores the application of FL in tasks such as manufacturing optimization, multimodal data integration, and drug discovery while addressing the unique challenges of protecting proprietary information and managing distributed datasets. The tutorial was built using key frameworks such as $\texttt{Flower}$ and $\texttt{TensorFlow Federated}$ and was designed to provide chemical engineers with the right tools to adopt FL in their specific needs. We compare the performance of FL against centralized learning across three different datasets relevant to chemical engineering applications, demonstrating that FL will often maintain or improve classification performance, particularly for complex and heterogeneous data. We conclude with an outlook on the open challenges in federated learning to be tackled and current approaches designed to remediate and improve this framework.

Title: Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

Authors: Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16738
Pdf URL: https://arxiv.org/pdf/2411.16738
Copy Paste: [[2411.16738]] Classifier-Free Guidance inside the Attraction Basin May Cause Memorization(https://arxiv.org/abs/2411.16738)
Keywords: privacy, diffusion
Abstract: Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to understand the memorization phenomenon, and propose a simple yet effective approach to mitigate it. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs from which classifier-free guidance is applied. This leads to the generation of non-memorized images that are high in image quality and well-aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, \emph{opposite guidance}, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.

Title: LoBAM: LoRA-Based Backdoor Attack on Model Merging

Authors: Ming Yin, Jingyang Zhang, Jingwei Sun, Minghong Fang, Hai Li, Yiran Chen
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16746
Pdf URL: https://arxiv.org/pdf/2411.16746
Copy Paste: [[2411.16746]] LoBAM: LoRA-Based Backdoor Attack on Model Merging(https://arxiv.org/abs/2411.16746)
Keywords: attack, steal
Abstract: Model merging is an emerging technique that integrates multiple models fine-tuned on different tasks to create a versatile model that excels in multiple domains. This scheme, in the meantime, may open up backdoor attack opportunities where one single malicious model can jeopardize the integrity of the merged model. Existing works try to demonstrate the risk of such attacks by assuming substantial computational resources, focusing on cases where the attacker can fully fine-tune the pre-trained model. Such an assumption, however, may not be feasible given the increasing size of machine learning models. In practice where resources are limited and the attacker can only employ techniques like Low-Rank Adaptation (LoRA) to produce the malicious model, it remains unclear whether the attack can still work and pose threats. In this work, we first identify that the attack efficacy is significantly diminished when using LoRA for fine-tuning. Then, we propose LoBAM, a method that yields high attack success rate with minimal training resources. The key idea of LoBAM is to amplify the malicious weights in an intelligent way that effectively enhances the attack efficacy. We demonstrate that our design can lead to improved attack success rate through both theoretical proof and extensive empirical experiments across various model merging scenarios. Moreover, we show that our method has strong stealthiness and is difficult to detect.

Title: FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction

Authors: Junwei You, Rui Gan, Weizhe Tang, Zilin Huang, Jiaxi Liu, Zhuoyu Jiang, Haotian Shi, Keshu Wu, Keke Long, Sicheng Fu, Sikai Chen, Bin Ran
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2411.16747
Pdf URL: https://arxiv.org/pdf/2411.16747
Copy Paste: [[2411.16747]] FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction(https://arxiv.org/abs/2411.16747)
Keywords: robust, diffusion, transformer, generative
Abstract: Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS). Although deep learning-based approaches - especially those utilizing transformer-based and generative models - have markedly improved prediction accuracy by capturing complex, non-linear patterns in vehicle dynamics and traffic interactions, they frequently overlook detailed car-following behaviors and the inter-vehicle interactions critical for real-world driving applications, particularly in fully autonomous or mixed traffic scenarios. To address the issue, this study introduces a scaled noise conditional diffusion model for car-following trajectory prediction, which integrates detailed inter-vehicular interactions and car-following dynamics into a generative framework, improving both the accuracy and plausibility of predicted trajectories. The model utilizes a novel pipeline to capture historical vehicle dynamics by scaling noise with encoded historical features within the diffusion process. Particularly, it employs a cross-attention-based transformer architecture to model intricate inter-vehicle dependencies, effectively guiding the denoising process and enhancing prediction accuracy. Experimental results on diverse real-world driving scenarios demonstrate the state-of-the-art performance and robustness of the proposed method.

Title: LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Authors: Haojie Zhang, Zhihao Liang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Chenxing Li, Jianhua Tao, Yaling Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16748
Pdf URL: https://arxiv.org/pdf/2411.16748
Copy Paste: [[2411.16748]] LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis(https://arxiv.org/abs/2411.16748)
Keywords: diffusion, transformer
Abstract: Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.

Title: AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

Authors: You Li, Fan Ma, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16749
Pdf URL: https://arxiv.org/pdf/2411.16749
Copy Paste: [[2411.16749]] AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks(https://arxiv.org/abs/2411.16749)
Keywords: diffusion, large language model, segmentation
Abstract: Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.

Title: PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

Authors: Ziyao Zeng, Jingcheng Ni, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong
Subjects: cs.CV, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2411.16750
Pdf URL: https://arxiv.org/pdf/2411.16750
Copy Paste: [[2411.16750]] PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation(https://arxiv.org/abs/2411.16750)
Keywords: robust, diffusion
Abstract: This paper explores the potential of leveraging language priors learned by text-to-image diffusion models to address ambiguity and visual nuisance in monocular depth estimation. Particularly, traditional monocular depth estimation suffers from inherent ambiguity due to the absence of stereo or multi-view depth cues, and nuisance due to lack of robustness of vision. We argue that language prior in diffusion models can enhance monocular depth estimation by leveraging the geometric prior aligned with the language description, which is learned during text-to-image pre-training. To generate images that reflect the text properly, the model must comprehend the size and shape of specified objects, their spatial relationship, and the scale of the scene. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both image and text description that aligned with the scene to infer affine-invariant depth through a denoising process. We also show that language priors can guide the model's attention to specific regions and help it perceive the 3D scene in alignment with user intent. Simultaneously, it acts as a constraint to accelerate the convergence of the diffusion trajectory, since learning 3D properties from a condensed, low-dimensional language feature is more efficient compared with learning from a redundant, high-dimensional image feature. By training on HyperSim and Virtual KITTI, we achieve state-of-the-art zero-shot performance and a faster convergence speed, compared with other diffusion-based depth estimators, across NYUv2, KITTI, ETH3D, and ScanNet.

Title: Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy

Authors: You Li, Fan Ma, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16752
Pdf URL: https://arxiv.org/pdf/2411.16752
Copy Paste: [[2411.16752]] Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy(https://arxiv.org/abs/2411.16752)
Keywords: robust, large language model
Abstract: The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions. Current methods focus on projecting the query image into the text feature space, subsequently combining them with features of query texts for retrieval. However, retrieving images only with the text features cannot guarantee detailed alignment due to the natural gap between images and text. In this paper, we introduce Imagined Proxy for CIR (IP-CIR), a training-free method that creates a proxy image aligned with the query image and text description, enhancing query representation in the retrieval process. We first leverage the large language model's generalization capability to generate an image layout, and then apply both the query text and image for conditional generation. The robust query features are enhanced by merging the proxy image, query image, and text semantic perturbation. Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image while incorporating image-side information into the process. Experiments on three public datasets demonstrate that our method significantly improves retrieval performances. We achieve state-of-the-art (SOTA) results on the CIRR dataset with a Recall@K of 70.07 at K=10. Additionally, we achieved an improvement in Recall@10 on the FashionIQ dataset, rising from 45.11 to 45.74, and improved the baseline performance in CIRCO with a mAPK@10 score, increasing from 32.24 to 34.26.

Title: Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)

Authors: Nasrin Imanpour, Shashwat Bajpai, Subhankar Ghosh, Sainath Reddy Sankepally, Abhilekh Borah, Hasnat Md Abdullah, Nishoak Kosaraju, Shreyas Dixit, Ashhar Aziz, Shwetangshu Biswas, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16754
Pdf URL: https://arxiv.org/pdf/2411.16754
Copy Paste: [[2411.16754]] Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)(https://arxiv.org/abs/2411.16754)
Keywords: diffusion, generative
Abstract: The proliferation of AI techniques for image generation, coupled with their increasing accessibility, has raised significant concerns about the potential misuse of these images to spread misinformation. Recent AI-generated image detection (AGID) methods include CNNDetection, NPR, DM Image Detection, Fake Image Detection, DIRE, LASTED, GAN Image Detection, AIDE, SSP, DRCT, RINE, OCC-CLIP, De-Fake, and Deep Fake Detection. However, we argue that the current state-of-the-art AGID techniques are inadequate for effectively detecting contemporary AI-generated images and advocate for a comprehensive reevaluation of these methods. We introduce the Visual Counter Turing Test (VCT^2), a benchmark comprising ~130K images generated by contemporary text-to-image models (Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and Midjourney 6). VCT^2 includes two sets of prompts sourced from tweets by the New York Times Twitter account and captions from the MS COCO dataset. We also evaluate the performance of the aforementioned AGID techniques on the VCT$^2$ benchmark, highlighting their ineffectiveness in detecting AI-generated images. As image-generative AI models continue to evolve, the need for a quantifiable framework to evaluate these models becomes increasingly critical. To meet this need, we propose the Visual AI Index (V_AI), which assesses generated images from various visual perspectives, including texture complexity and object coherence, setting a new standard for evaluating image-generative AI models. To foster research in this domain, we make our this https URL and this https URL datasets publicly available.

Title: LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

Authors: Faridoun Mehri (1), Mahdieh Soleymani Baghshah (1), Mohammad Taher Pilehvar (2) ((1) Sharif University of Technology, (2) Cardiff University)
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16760
Pdf URL: https://arxiv.org/pdf/2411.16760
Copy Paste: [[2411.16760]] LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions(https://arxiv.org/abs/2411.16760)
Keywords: transformer, segmentation
Abstract: Why do gradient-based explanations struggle with Transformers, and how can we improve them? We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad -- a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. We evaluate LibraGrad using three metric families: Faithfulness, which quantifies prediction changes under perturbations of the most and least relevant features; Completeness Error, which measures attribution conservation relative to model outputs; and Segmentation AP, which assesses alignment with human perception. Extensive experiments across 8 architectures, 4 model sizes, and 4 datasets show that LibraGrad universally enhances gradient-based methods, outperforming existing white-box methods -- including Transformer-specific approaches -- across all metrics. We demonstrate superior qualitative results through two complementary evaluations: precise text-prompted region highlighting on CLIP models and accurate class discrimination between co-occurring animals on ImageNet-finetuned models -- two settings on which existing methods often struggle. LibraGrad is effective even on the attention-free MLP-Mixer architecture, indicating potential for extension to other modern architectures. Our code is freely available at this https URL.

Title: Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning

Authors: Ji Hyeok Jung, Eun Tae Kim, Seo Yeon Kim, Joo Ho Lee, Bumsoo Kim, Buru Chang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16761
Pdf URL: https://arxiv.org/pdf/2411.16761
Copy Paste: [[2411.16761]] Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning(https://arxiv.org/abs/2411.16761)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications. However, current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data, hindering the development of a coherent orientation understanding. To overcome this, we propose egocentric instruction tuning, which aligns MLLMs' orientation understanding with the user's perspective, based on a consistent annotation standard derived from the user's egocentric viewpoint. We first generate egocentric instruction data that leverages MLLMs' ability to recognize object details and applies prior knowledge for orientation understanding. Using this data, we perform instruction tuning to enhance the model's capability for accurate orientation interpretation. In addition, we introduce EgoOrientBench, a benchmark that evaluates MLLMs' orientation understanding across three tasks using images collected from diverse domains. Experimental results on this benchmark show that egocentric instruction tuning significantly improves orientation understanding without compromising overall MLLM performance. The instruction data and benchmark dataset are available on our project page at this https URL.

Title: Hide in Plain Sight: Clean-Label Backdoor for Auditing Membership Inference

Authors: Depeng Chen, Hao Chen, Hulin Jin, Jie Cui, Hong Zhong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16763
Pdf URL: https://arxiv.org/pdf/2411.16763
Copy Paste: [[2411.16763]] Hide in Plain Sight: Clean-Label Backdoor for Auditing Membership Inference(https://arxiv.org/abs/2411.16763)
Keywords: privacy, protect, attack, robust, steal, membership infer
Abstract: Membership inference attacks (MIAs) are critical tools for assessing privacy risks and ensuring compliance with regulations like the General Data Protection Regulation (GDPR). However, their potential for auditing unauthorized use of data remains under explored. To bridge this gap, we propose a novel clean-label backdoor-based approach for MIAs, designed specifically for robust and stealthy data auditing. Unlike conventional methods that rely on detectable poisoned samples with altered labels, our approach retains natural labels, enhancing stealthiness even at low poisoning rates. Our approach employs an optimal trigger generated by a shadow model that mimics the target model's behavior. This design minimizes the feature-space distance between triggered samples and the source class while preserving the original data labels. The result is a powerful and undetectable auditing mechanism that overcomes limitations of existing approaches, such as label inconsistencies and visual artifacts in poisoned samples. The proposed method enables robust data auditing through black-box access, achieving high attack success rates across diverse datasets and model architectures. Additionally, it addresses challenges related to trigger stealthiness and poisoning durability, establishing itself as a practical and effective solution for data auditing. Comprehensive experiments validate the efficacy and generalizability of our approach, outperforming several baseline methods in both stealth and attack success metrics.

Title: SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

Authors: Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu, Alexander H. Liu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16765
Pdf URL: https://arxiv.org/pdf/2411.16765
Copy Paste: [[2411.16765]] SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction(https://arxiv.org/abs/2411.16765)
Keywords: transformer
Abstract: Sign language processing has traditionally relied on task-specific models,limiting the potential for transfer learning across tasks. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised transformer encoder that learns strong representations from approximately 1,000 hours of American Sign Language (ASL) video content. Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input, learning to predict multiple targets for corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple benchmarks. On sign language translation, it outperforms prior methods trained on publicly available data on the How2Sign (+0.7 BLEU), OpenASL (+10.0 BLEU), and FLEURS-ASL (+0.3 BLEU) benchmarks. Similarly for isolated sign language recognition, SHuBERT's accuracy surpasses that of specialized models on ASL-Citizen (+5\%) and SEM-LEX (+20.6\%), while coming close to them on WLASL2000 (-3\%). Ablation studies confirm the contribution of each component of the approach.

Title: Revisiting DDIM Inversion for Controlling Defect Generation by Disentangling the Background

Authors: Youngjae Cho, Gwangyeol Kim, Sirojbek Safarov, Seongdeok Bang, Jaewoo Park
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16767
Pdf URL: https://arxiv.org/pdf/2411.16767
Copy Paste: [[2411.16767]] Revisiting DDIM Inversion for Controlling Defect Generation by Disentangling the Background(https://arxiv.org/abs/2411.16767)
Keywords: generative
Abstract: In anomaly detection, the scarcity of anomalous data compared to normal data poses a challenge in effectively utilizing deep neural network representations to identify anomalous features. From a data-centric perspective, generative models can solve this data imbalance issue by synthesizing anomaly datasets. Although previous research tried to enhance the controllability and quality of generating defects, they do not consider the relation between background and defect. Since the defect depends on the object's background (i.e., the normal part of an object), training only the defect area cannot utilize the background information, and even generation can be biased depending on the mask information. In addition, controlling logical anomalies should consider the dependency between background and defect areas (e.g., orange colored defect on a orange juice bottle). In this paper, our paper proposes modeling a relationship between the background and defect, where background affects denoising defects; however, the reverse is not. We introduce the regularizing term to disentangle denoising background from defects. From the disentanglement loss, we rethink defect generation with DDIM Inversion, where we generate the defect on the target normal image. Additionally, we theoretically prove that our methodology can generate a defect on the target normal image with an invariant background. We demonstrate our synthetic data is realistic and effective in several experiments.

Title: In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models

Authors: Zhi-Yi Chin, Kuan-Chen Mu, Mario Fritz, Pin-Yu Chen, Wei-Chen Chiu
Subjects: cs.LG, cs.CL, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16769
Pdf URL: https://arxiv.org/pdf/2411.16769
Copy Paste: [[2411.16769]] In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models(https://arxiv.org/abs/2411.16769)
Keywords: attack, robust, diffusion, large language model
Abstract: Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. While various safety mechanisms have been developed, the field lacks systematic tools for evaluating their effectiveness against real-world misuse scenarios. In this work, we propose ICER, a novel red-teaming framework that leverages Large Language Models (LLMs) and a bandit optimization-based algorithm to generate interpretable and semantic meaningful problematic prompts by learning from past successful red-teaming attempts. Our ICER efficiently probes safety mechanisms across different T2I models without requiring internal access or additional training, making it broadly applicable to deployed systems. Through extensive experiments, we demonstrate that ICER significantly outperforms existing prompt attack methods in identifying model vulnerabilities while maintaining high semantic similarity with intended content. By uncovering that successful jailbreaking instances can systematically facilitate the discovery of new vulnerabilities, our work provides crucial insights for developing more robust safety mechanisms in T2I systems.

Title: VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

Authors: Wey Yeh Choong, Yangyang Guo, Mohan Kankanhalli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16771
Pdf URL: https://arxiv.org/pdf/2411.16771
Copy Paste: [[2411.16771]] VidHal: Benchmarking Temporal Hallucinations in Vision LLMs(https://arxiv.org/abs/2411.16771)
Keywords: large language model
Abstract: Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucination. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluate a broad selection of models. Our results uncover significant limitations in existing VLLMs regarding hallucination generation. Through our benchmark, we aim to inspire further research on 1) holistic understanding of VLLM capabilities, particularly regarding hallucination, and 2) extensive development of advanced VLLMs to alleviate this problem.

Title: MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

Authors: Feifei Shao, Ping Liu, Zhao Wang, Yawei Luo, Hongwei Wang, Jun Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16773
Pdf URL: https://arxiv.org/pdf/2411.16773
Copy Paste: [[2411.16773]] MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing(https://arxiv.org/abs/2411.16773)
Keywords: segmentation
Abstract: Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task and intra-task sensitivity issues in current ICL methods for PCP, which we attribute to inflexible sampling strategies lacking context adaptation at the point and prompt levels. To address these challenges, we propose MICAS, an advanced ICL framework featuring a multi-grained adaptive sampling mechanism tailored for PCP. MICAS introduces two core components: task-adaptive point sampling, which leverages inter-task cues for point-level sampling, and query-specific prompt sampling, which selects optimal prompts per query to mitigate intra-task sensitivity. To our knowledge, this is the first approach to introduce adaptive sampling tailored to the unique requirements of point clouds within an ICL framework. Extensive experiments show that MICAS not only efficiently handles various PCP tasks but also significantly outperforms existing methods. Notably, it achieves a remarkable $4.1\%$ improvement in the part segmentation task and delivers consistent gains across various PCP applications.

Title: SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

Authors: Harsh Goel, Sai Shankar Narasimhan, Oguzhan Akcin, Sandeep Chinchali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16776
Pdf URL: https://arxiv.org/pdf/2411.16776
Copy Paste: [[2411.16776]] SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models(https://arxiv.org/abs/2411.16776)
Keywords: robust, diffusion, segmentation
Abstract: In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as "Clear and Day" weather, leading to decreased performance in under-represented conditions like "Rainy and Night". To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model.

Title: GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Authors: Bo Liu, Ke Zou, Liming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu, Huazhu Fu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16778
Pdf URL: https://arxiv.org/pdf/2411.16778
Copy Paste: [[2411.16778]] GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis(https://arxiv.org/abs/2411.16778)
Keywords: explainability
Abstract: Medical Visual Question Answering (VQA) is an essential technology that integrates computer vision and natural language processing to automatically respond to clinical inquiries about medical images. However, current medical VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, which impedes their ability to satisfy the comprehension needs of patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements encountered in clinical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) A multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) Four distinct question types, open-ended, closed-ended, single-choice, and multiple-choice, that better reflect diverse clinical needs. We evaluated 10 representative large vision language models on GEMeX and found that they underperformed, highlighting the dataset's complexity. However, after fine-tuning a baseline model using the training set, we observed a significant performance improvement, demonstrating the dataset's effectiveness. The project is available at this http URL.

Title: NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model

Authors: Jinpeng Liu, Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Ying Shan, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16779
Pdf URL: https://arxiv.org/pdf/2411.16779
Copy Paste: [[2411.16779]] NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model(https://arxiv.org/abs/2411.16779)
Keywords: diffusion, transformer, generative
Abstract: We introduce NovelGS, a diffusion model for Gaussian Splatting (GS) given sparse-view images. Recent works leverage feed-forward networks to generate pixel-aligned Gaussians, which could be fast rendered. Unfortunately, the method was unable to produce satisfactory results for areas not covered by the input images due to the formulation of these methods. In contrast, we leverage the novel view denoising through a transformer-based network to generate 3D Gaussians. Specifically, by incorporating both conditional views and noisy target views, the network predicts pixel-aligned Gaussians for each view. During training, the rendered target and some additional views of the Gaussians are supervised. During inference, the target views are iteratively rendered and denoised from pure noise. Our approach demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge. Due to generative modeling of unseen regions, NovelGS effectively reconstructs 3D objects with consistent and sharp textures. Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively. We also demonstrate the potential of NovelGS in generative tasks, such as text-to-3D and image-to-3D, by integrating it with existing multiview diffusion models. We will make the code publicly accessible.

Title: UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Authors: Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16781
Pdf URL: https://arxiv.org/pdf/2411.16781
Copy Paste: [[2411.16781]] UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing(https://arxiv.org/abs/2411.16781)
Keywords: large language model
Abstract: Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

Title: Scaling Laws for Black box Adversarial Attacks

Authors: Chuan Liu, Huanran Chen, Yichi Zhang, Yinpeng Dong, Jun Zhu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16782
Pdf URL: https://arxiv.org/pdf/2411.16782
Copy Paste: [[2411.16782]] Scaling Laws for Black box Adversarial Attacks(https://arxiv.org/abs/2411.16782)
Keywords: attack, interpretability, large language model
Abstract: A longstanding problem of deep learning models is their vulnerability to adversarial examples, which are often generated by applying imperceptible perturbations to natural examples. Adversarial examples exhibit cross-model transferability, enabling to attack black-box models with limited information about their architectures and parameters. Model ensembling is an effective strategy to improve the transferability by attacking multiple surrogate models simultaneously. However, as prior studies usually adopt few models in the ensemble, there remains an open question of whether scaling the number of models can further improve black-box attacks. Inspired by the findings in large foundation models, we investigate the scaling laws of black-box adversarial attacks in this work. By analyzing the relationship between the number of surrogate models and transferability of adversarial examples, we conclude with clear scaling laws, emphasizing the potential of using more surrogate models to enhance adversarial transferability. Extensive experiments verify the claims on standard image classifiers, multimodal large language models, and even proprietary models like GPT-4o, demonstrating consistent scaling effects and impressive attack success rates with more surrogate models. Further studies by visualization indicate that scaled attacks bring better interpretability in semantics, indicating that the common features of models are captured.

Title: CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis

Authors: Aravindan Sundaram, Ujjayan Pal, Abhimanyu Chauhan, Aishwarya Agarwal, Srikrishna Karanam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16783
Pdf URL: https://arxiv.org/pdf/2411.16783
Copy Paste: [[2411.16783]] CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis(https://arxiv.org/abs/2411.16783)
Keywords: diffusion
Abstract: Despite recent advancements in text-to-image models, achieving semantically accurate images in text-to-image diffusion models is a persistent challenge. While existing initial latent optimization methods have demonstrated impressive performance, we identify two key limitations: (a) attention neglect, where the synthesized image omits certain subjects from the input prompt because they do not have a designated segment in the self-attention map despite despite having a high-response cross-attention, and (b) attention interference, where the generated image has mixed-up properties of multiple subjects because of a conflicting overlap between cross- and self-attention maps of different subjects. To address these limitations, we introduce CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps. Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented. Our approach operates within a noise optimization framework, avoiding the need to retrain base models. Through extensive experiments on multiple benchmarks, we demonstrate that CoCoNO significantly improves text-image alignment and outperforms the current state of the art.

Title: TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction

Authors: Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16788
Pdf URL: https://arxiv.org/pdf/2411.16788
Copy Paste: [[2411.16788]] TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction(https://arxiv.org/abs/2411.16788)
Keywords: robust, diffusion
Abstract: We consider the problem of single-source domain generalization. Existing methods typically rely on extensive augmentations to synthetically cover diverse domains during training. However, they struggle with semantic shifts (e.g., background and viewpoint changes), as they often learn global features instead of local concepts that tend to be domain invariant. To address this gap, we propose an approach that compels models to leverage such local concepts during prediction. Given no suitable dataset with per-class concepts and localization maps exists, we first develop a novel pipeline to generate annotations by exploiting the rich features of diffusion and large-language models. Our next innovation is TIDE, a novel training scheme with a concept saliency alignment loss that ensures model focus on the right per-concept regions and a local concept contrastive loss that promotes learning domain-invariant concept representations. This not only gives a robust model but also can be visually interpreted using the predicted concept saliency maps. Given these maps at test time, our final contribution is a new correction algorithm that uses the corresponding local concept representations to iteratively refine the prediction until it aligns with prototypical concept representations that we store at the end of model training. We evaluate our approach extensively on four standard DG benchmark datasets and substantially outperform the current state-ofthe-art (12% improvement on average) while also demonstrating that our predictions can be visually interpreted

Title: Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Authors: Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16789
Pdf URL: https://arxiv.org/pdf/2411.16789
Copy Paste: [[2411.16789]] Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation(https://arxiv.org/abs/2411.16789)
Keywords: large language model
Abstract: Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we generate detailed textual descriptions of sign language components using MLLMs. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be effectively utilized in SLT.

Title: Learning Predictive Checklists with Probabilistic Logic Programming

Authors: Yukti Makhija, Edward De Brouwer, Rahul G. Krishnan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16790
Pdf URL: https://arxiv.org/pdf/2411.16790
Copy Paste: [[2411.16790]] Learning Predictive Checklists with Probabilistic Logic Programming(https://arxiv.org/abs/2411.16790)
Keywords: interpretability
Abstract: Checklists have been widely recognized as effective tools for completing complex tasks in a systematic manner. Although originally intended for use in procedural tasks, their interpretability and ease of use have led to their adoption for predictive tasks as well, including in clinical settings. However, designing checklists can be challenging, often requiring expert knowledge and manual rule design based on available data. Recent work has attempted to address this issue by using machine learning to automatically generate predictive checklists from data, although these approaches have been limited to Boolean data. We propose a novel method for learning predictive checklists from diverse data modalities, such as images and time series. Our approach relies on probabilistic logic programming, a learning paradigm that enables matching the discrete nature of checklist with continuous-valued data. We propose a regularization technique to tradeoff between the information captured in discrete concepts of continuous data and permit a tunable level of interpretability for the learned checklist concepts. We demonstrate that our method outperforms various explainable machine learning techniques on prediction tasks involving image sequences, time series, and clinical notes.

Title: What can LLM tell us about cities?

Authors: Zhuoheng Li, Yaochen Wang, Zhixue Song, Yuqi Huang, Rui Bao, Guanjie Zheng, Zhenhui Jessie Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16791
Pdf URL: https://arxiv.org/pdf/2411.16791
Copy Paste: [[2411.16791]] What can LLM tell us about cities?(https://arxiv.org/abs/2411.16791)
Keywords: large language model
Abstract: This study explores the capabilities of large language models (LLMs) in providing knowledge about cities and regions on a global scale. We employ two methods: directly querying the LLM for target variable values and extracting explicit and implicit features from the LLM correlated with the target variable. Our experiments reveal that LLMs embed a broad but varying degree of knowledge across global cities, with ML models trained on LLM-derived features consistently leading to improved predictive accuracy. Additionally, we observe that LLMs demonstrate a certain level of knowledge across global cities on all continents, but it is evident when they lack knowledge, as they tend to generate generic or random outputs for unfamiliar tasks. These findings suggest that LLMs can offer new opportunities for data-driven decision-making in the study of cities.

Title: From Diffusion to Resolution: Leveraging 2D Diffusion Models for 3D Super-Resolution Task

Authors: Bohao Chen, Yanchao Zhang, Yanan Lv, Hua Han, Xi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16792
Pdf URL: https://arxiv.org/pdf/2411.16792
Copy Paste: [[2411.16792]] From Diffusion to Resolution: Leveraging 2D Diffusion Models for 3D Super-Resolution Task(https://arxiv.org/abs/2411.16792)
Keywords: robust, diffusion
Abstract: Diffusion models have recently emerged as a powerful technique in image generation, especially for image super-resolution tasks. While 2D diffusion models significantly enhance the resolution of individual images, existing diffusion-based methods for 3D volume super-resolution often struggle with structure discontinuities in axial direction and high sampling costs. In this work, we present a novel approach that leverages the 2D diffusion model and lateral continuity within the volume to enhance 3D volume electron microscopy (vEM) super-resolution. We first simulate lateral degradation with slices in the XY plane and train a 2D diffusion model to learn how to restore the degraded slices. The model is then applied slice-by-slice in the lateral direction of low-resolution volume, recovering slices while preserving inherent lateral continuity. Following this, a high-frequency-aware 3D super-resolution network is trained on the recovery lateral slice sequences to learn spatial feature transformation across slices. Finally, the network is applied to infer high-resolution volumes in the axial direction, enabling 3D super-resolution. We validate our approach through comprehensive evaluations, including image similarity assessments, resolution analysis, and performance on downstream tasks. Our results on two publicly available focused ion beam scanning electron microscopy (FIB-SEM) datasets demonstrate the robustness and practical applicability of our framework for 3D volume super-resolution.

Title: Phase-Informed Tool Segmentation for Manual Small-Incision Cataract Surgery

Authors: Bhuvan Sachdeva, Naren Akash, Tajamul Ashraf, Simon Muller, Thomas Schultz, Maximilian W. M. Wintergerst, Niharika Singri Prasad, Kaushik Murali, Mohit Jain
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16794
Pdf URL: https://arxiv.org/pdf/2411.16794
Copy Paste: [[2411.16794]] Phase-Informed Tool Segmentation for Manual Small-Incision Cataract Surgery(https://arxiv.org/abs/2411.16794)
Keywords: segmentation
Abstract: Cataract surgery is the most common surgical procedure globally, with a disproportionately higher burden in developing countries. While automated surgical video analysis has been explored in general surgery, its application to ophthalmic procedures remains limited. Existing works primarily focus on Phaco cataract surgery, an expensive technique not accessible in regions where cataract treatment is most needed. In contrast, Manual Small-Incision Cataract Surgery (MSICS) is the preferred low-cost, faster alternative in high-volume settings and for challenging cases. However, no dataset exists for MSICS. To address this gap, we introduce Cataract-MSICS, the first comprehensive dataset containing 53 surgical videos annotated for 18 surgical phases and 3,527 frames with 13 surgical tools at the pixel level. We benchmark this dataset on state-of-the-art models and present ToolSeg, a novel framework that enhances tool segmentation by introducing a phase-conditional decoder and a simple yet effective semi-supervised setup leveraging pseudo-labels from foundation models. Our approach significantly improves segmentation performance, achieving a $23.77\%$ to $38.10\%$ increase in mean Dice scores, with a notable boost for tools that are less prevalent and small. Furthermore, we demonstrate that ToolSeg generalizes to other surgical settings, showcasing its effectiveness on the CaDIS dataset.

Title: Towards Efficient Model-Heterogeneity Federated Learning for Large Models

Authors: Ruofan Jia, Weiying Xie, Jie Lei, Haonan Qin, Jitao Ma, Leyuan Fang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16796
Pdf URL: https://arxiv.org/pdf/2411.16796
Copy Paste: [[2411.16796]] Towards Efficient Model-Heterogeneity Federated Learning for Large Models(https://arxiv.org/abs/2411.16796)
Keywords: federate
Abstract: As demand grows for complex tasks and high-performance applications in edge computing, the deployment of large models in federated learning has become increasingly urgent, given their superior representational power and generalization capabilities. However, the resource constraints and heterogeneity among clients present significant challenges to this deployment. To tackle these challenges, we introduce HeteroTune, an innovative fine-tuning framework tailored for model-heterogeneity federated learning (MHFL). In particular, we propose a novel parameter-efficient fine-tuning (PEFT) structure, called FedAdapter, which employs a multi-branch cross-model aggregator to enable efficient knowledge aggregation across diverse models. Benefiting from the lightweight FedAdapter, our approach significantly reduces both the computational and communication overhead. Finally, our approach is simple yet effective, making it applicable to a wide range of large model fine-tuning tasks. Extensive experiments on computer vision (CV) and natural language processing (NLP) tasks demonstrate that our method achieves state-of-the-art results, seamlessly integrating efficiency and performance.

Title: Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models

Authors: Alireza Amiri-Margavi, Iman Jebellat, Ehsan Jebellat, Seyed Pouyan Mousavi Davoudi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16797
Pdf URL: https://arxiv.org/pdf/2411.16797
Copy Paste: [[2411.16797]] Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models(https://arxiv.org/abs/2411.16797)
Keywords: large language model
Abstract: We explore the collaborative dynamics of an innovative language model interaction system involving advanced models such as GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash. These models generate and answer complex, PhD-level statistical questions without exact ground-truth answers. Our study investigates how inter-model consensus enhances the reliability and precision of responses. By employing statistical methods such as chi-square tests, Fleiss' Kappa, and confidence interval analysis, we evaluate consensus rates and inter-rater agreement to quantify the reliability of collaborative outputs. Key results reveal that Claude and GPT-4 exhibit the highest reliability and consistency, as evidenced by their narrower confidence intervals and higher alignment with question-generating models. Conversely, Gemini and LLaMA show more significant variability in their consensus rates, as reflected in wider confidence intervals and lower reliability percentages. These findings demonstrate that collaborative interactions among large language models (LLMs) significantly improve response reliability, offering novel insights into autonomous, cooperative reasoning and validation in AI systems.

Title: Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image

Authors: Jiajing Lin, Zhenzhong Wang, Shu Jiang, Yongjie Hou, Min Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16800
Pdf URL: https://arxiv.org/pdf/2411.16800
Copy Paste: [[2411.16800]] Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image(https://arxiv.org/abs/2411.16800)
Keywords: robust, diffusion
Abstract: The task of 4D content generation involves creating dynamic 3D models that evolve over time in response to specific input conditions, such as images. Existing methods rely heavily on pre-trained video diffusion models to guide 4D content dynamics, but these approaches often fail to capture essential physical principles, as video diffusion models lack a robust understanding of real-world physics. Moreover, these models face challenges in providing fine-grained control over dynamics and exhibit high computational costs. In this work, we propose Phys4DGen, a novel, high-efficiency framework that generates physics-compliant 4D content from a single image with enhanced control capabilities. Our approach uniquely integrates physical simulations into the 4D generation pipeline, ensuring adherence to fundamental physical laws. Inspired by the human ability to infer physical properties visually, we introduce a Physical Perception Module (PPM) that discerns the material properties and structural components of the 3D object from the input image, facilitating accurate downstream simulations. Phys4DGen significantly accelerates the 4D generation process by eliminating iterative optimization steps in the dynamics modeling phase. It allows users to intuitively control the movement speed and direction of generated 4D content by adjusting external forces, achieving finely tunable, physically plausible animations. Extensive evaluations show that Phys4DGen outperforms existing methods in both inference speed and physical realism, producing high-quality, controllable 4D content.

Title: Controllable Human Image Generation with Personalized Multi-Garments

Authors: Yisol Choi, Sangkyung Kwak, Sihyun Yu, Hyungwon Choi, Jinwoo Shin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16801
Pdf URL: https://arxiv.org/pdf/2411.16801
Copy Paste: [[2411.16801]] Controllable Human Image Generation with Personalized Multi-Garments(https://arxiv.org/abs/2411.16801)
Keywords: diffusion
Abstract: We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.

Title: Abnormality-Driven Representation Learning for Radiology Imaging

Authors: Marta Ligero, Tim Lenz, Georg Wölflein, Omar S.M. El Nahhas, Daniel Truhn, Jakob Nikolas Kather
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16803
Pdf URL: https://arxiv.org/pdf/2411.16803
Copy Paste: [[2411.16803]] Abnormality-Driven Representation Learning for Radiology Imaging(https://arxiv.org/abs/2411.16803)
Keywords: transformer
Abstract: To date, the most common approach for radiology deep learning pipelines is the use of end-to-end 3D networks based on models pre-trained on other tasks, followed by fine-tuning on the task at hand. In contrast, adjacent medical fields such as pathology, which focus on 2D images, have effectively adopted task-agnostic foundational models based on self-supervised learning (SSL), combined with weakly-supervised deep learning (DL). However, the field of radiology still lacks task-agnostic representation models due to the computational and data demands of 3D imaging and the anatomical complexity inherent to radiology scans. To address this gap, we propose CLEAR, a framework for radiology images that uses extracted embeddings from 2D slices along with attention-based aggregation for efficiently predicting clinical endpoints. As part of this framework, we introduce lesion-enhanced contrastive learning (LeCL), a novel approach to obtain visual representations driven by abnormalities in 2D axial slices across different locations of the CT scans. Specifically, we trained single-domain contrastive learning approaches using three different architectures: Vision Transformers, Vision State Space Models and Gated Convolutional Neural Networks. We evaluate our approach across three clinical tasks: tumor lesion location, lung disease detection, and patient staging, benchmarking against four state-of-the-art foundation models, including BiomedCLIP. Our findings demonstrate that CLEAR using representations learned through LeCL, outperforms existing foundation models, while being substantially more compute- and data-efficient.

Title: Blockchain Meets LLMs: A Living Survey on Bidirectional Integration

Authors: Jianghao Gong, Peiqi Yan, Yue Zhang, Hongli An, Logan Liu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16809
Pdf URL: https://arxiv.org/pdf/2411.16809
Copy Paste: [[2411.16809]] Blockchain Meets LLMs: A Living Survey on Bidirectional Integration(https://arxiv.org/abs/2411.16809)
Keywords: security, privacy, explainability, large language model
Abstract: In the domain of large language models, considerable advancements have been attained in multimodal large language models and explainability research, propelled by the continuous technological progress and innovation. Nonetheless, security and privacy concerns continue to pose as prominent challenges in this field. The emergence of blockchain technology, marked by its decentralized nature, tamper-proof attributes, distributed storage functionality, and traceability, has provided novel approaches for resolving these issues. Both of these technologies independently hold vast potential for development; yet, their combination uncovers substantial cross-disciplinary opportunities and growth prospects. The current research tendencies are increasingly concentrating on the integration of blockchain with large language models, with the aim of compensating for their respective limitations through this fusion and promoting further technological evolution. In this study, we evaluate the advantages and developmental constraints of the two technologies, and explore the possibility and development potential of their combination. This paper primarily investigates the technical convergence in two directions: Firstly, the application of large language models to blockchain, where we identify six major development directions and explore solutions to the shortcomings of blockchain technology and their application scenarios; Secondly, the application of blockchain technology to large language models, leveraging the characteristics of blockchain to remedy the deficiencies of large language models and exploring its application potential in multiple fields.

Title: Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation

Authors: Shengeng Tang, Jiayi He, Lechao Cheng, Jingjing Wu, Dan Guo, Richang Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16810
Pdf URL: https://arxiv.org/pdf/2411.16810
Copy Paste: [[2411.16810]] Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation(https://arxiv.org/abs/2411.16810)
Keywords: diffusion
Abstract: Generating continuous sign language videos from discrete segments is challenging due to the need for smooth transitions that preserve natural flow and meaning. Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task by simulating the absence of transition frames through random masking of segments in long-duration sign videos. The model learns to predict these masked frames by denoising Gaussian noise, conditioned on the surrounding sign observations, allowing it to handle complex, unstructured transitions. During inference, we apply a linearly interpolating padding strategy that initializes missing frames through interpolation between boundary frames, providing a stable foundation for iterative refinement by the diffusion model. Extensive experiments on the PHOENIX14T, USTC-CSL100, and USTC-SLR500 datasets demonstrate the effectiveness of our method in producing continuous, natural sign language videos.

Title: Fine-Tuning LLMs with Noisy Data for Political Argument Generation

Authors: Svetlana Churina, Kokil Jaidka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16813
Pdf URL: https://arxiv.org/pdf/2411.16813
Copy Paste: [[2411.16813]] Fine-Tuning LLMs with Noisy Data for Political Argument Generation(https://arxiv.org/abs/2411.16813)
Keywords: attack
Abstract: The incivility in social media discourse complicates the deployment of automated text generation models for politically sensitive content. Fine-tuning and prompting strategies are critical, but underexplored, solutions to mitigate toxicity in such contexts. This study investigates the fine-tuning and prompting effects on GPT-3.5 Turbo using subsets of the CLAPTON dataset of political discussion posts, comprising Twitter and Reddit data labeled for their justification, reciprocity and incivility. Fine-tuned models on Reddit data scored highest on discussion quality, while combined noisy data led to persistent toxicity. Prompting strategies reduced specific toxic traits, such as personal attacks, but had limited broader impact. The findings emphasize that high-quality data and well-crafted prompts are essential to reduce incivility and improve rhetorical quality in automated political discourse generation.

Title: XAI and Android Malware Models

Authors: Maithili Kulkarni, Mark Stamp
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16817
Pdf URL: https://arxiv.org/pdf/2411.16817
Copy Paste: [[2411.16817]] XAI and Android Malware Models(https://arxiv.org/abs/2411.16817)
Keywords: security, attack
Abstract: Android malware detection based on machine learning (ML) and deep learning (DL) models is widely used for mobile device security. Such models offer benefits in terms of detection accuracy and efficiency, but it is often difficult to understand how such learning models make decisions. As a result, these popular malware detection strategies are generally treated as black boxes, which can result in a lack of trust in the decisions made, as well as making adversarial attacks more difficult to detect. The field of eXplainable Artificial Intelligence (XAI) attempts to shed light on such black box models. In this paper, we apply XAI techniques to ML and DL models that have been trained on a challenging Android malware classification problem. Specifically, the classic ML models considered are Support Vector Machines (SVM), Random Forest, and $k$-Nearest Neighbors ($k$-NN), while the DL models we consider are Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN). The state-of-the-art XAI techniques that we apply to these trained models are Local Interpretable Model-agnostic Explanations (LIME), Shapley Additive exPlanations (SHAP), PDP plots, ELI5, and Class Activation Mapping (CAM). We obtain global and local explanation results, and we discuss the utility of XAI techniques in this problem domain. We also provide a literature review of XAI work related to Android malware.

Title: Enhancing In-Hospital Mortality Prediction Using Multi-Representational Learning with LLM-Generated Expert Summaries

Authors: Harshavardhan Battula, Jiacheng Liu, Jaideep Srivastava
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16818
Pdf URL: https://arxiv.org/pdf/2411.16818
Copy Paste: [[2411.16818]] Enhancing In-Hospital Mortality Prediction Using Multi-Representational Learning with LLM-Generated Expert Summaries(https://arxiv.org/abs/2411.16818)
Keywords: interpretability, large language model
Abstract: In-hospital mortality (IHM) prediction for ICU patients is critical for timely interventions and efficient resource allocation. While structured physiological data provides quantitative insights, clinical notes offer unstructured, context-rich narratives. This study integrates these modalities with Large Language Model (LLM)-generated expert summaries to improve IHM prediction accuracy. Using the MIMIC-III database, we analyzed time-series physiological data and clinical notes from the first 48 hours of ICU admission. Clinical notes were concatenated chronologically for each patient and transformed into expert summaries using Med42-v2 70B. A multi-representational learning framework was developed to integrate these data sources, leveraging LLMs to enhance textual data while mitigating direct reliance on LLM predictions, which can introduce challenges in uncertainty quantification and interpretability. The proposed model achieved an AUPRC of 0.6156 (+36.41%) and an AUROC of 0.8955 (+7.64%) compared to a time-series-only baseline. Expert summaries outperformed clinical notes or time-series data alone, demonstrating the value of LLM-generated knowledge. Performance gains were consistent across demographic groups, with notable improvements in underrepresented populations, underscoring the framework's equitable application potential. By integrating LLM-generated summaries with structured and unstructured data, the framework captures complementary patient information, significantly improving predictive performance. This approach showcases the potential of LLMs to augment critical care prediction models, emphasizing the need for domain-specific validation and advanced integration strategies for broader clinical adoption.

Title: Pathways on the Image Manifold: Image Editing via Video Generation

Authors: Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, Ron Kimmel
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16819
Pdf URL: https://arxiv.org/pdf/2411.16819
Copy Paste: [[2411.16819]] Pathways on the Image Manifold: Image Editing via Video Generation(https://arxiv.org/abs/2411.16819)
Keywords: diffusion
Abstract: Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.

Title: DetailGen3D: Generative 3D Geometry Enhancement via Data-Dependent Flow

Authors: Ken Deng, Yuanchen Guo, Jingxiang Sun, Zixin Zou, Yangguang Li, Xin Cai, Yanpei Cao, Yebin Liu, Ding Liang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16820
Pdf URL: https://arxiv.org/pdf/2411.16820
Copy Paste: [[2411.16820]] DetailGen3D: Generative 3D Geometry Enhancement via Data-Dependent Flow(https://arxiv.org/abs/2411.16820)
Keywords: generative
Abstract: Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.

Title: Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Authors: Yaqi Zhao, Yuanyang Yin, Lin Li, Mingan Lin, Victor Shea-Jay Huang, Siwei Chen, Weipeng Chen, Baoqun Yin, Zenan Zhou, Wentao Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16824
Pdf URL: https://arxiv.org/pdf/2411.16824
Copy Paste: [[2411.16824]] Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge(https://arxiv.org/abs/2411.16824)
Keywords: large language model
Abstract: Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model's interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data-images whose ambiguous visual representations challenge the VE's interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM's capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment. Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the LLM's embedding space but also align with the LLM's cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight the essential role of cognitive alignment in advancing multimodal systems.

Title: Decision Making under the Exponential Family: Distributionally Robust Optimisation with Bayesian Ambiguity Sets

Authors: Charita Dellaporta, Patrick O'Hara, Theodoros Damoulas
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2411.16829
Pdf URL: https://arxiv.org/pdf/2411.16829
Copy Paste: [[2411.16829]] Decision Making under the Exponential Family: Distributionally Robust Optimisation with Bayesian Ambiguity Sets(https://arxiv.org/abs/2411.16829)
Keywords: robust
Abstract: Decision making under uncertainty is challenging as the data-generating process (DGP) is often unknown. Bayesian inference proceeds by estimating the DGP through posterior beliefs on the model's parameters. However, minimising the expected risk under these beliefs can lead to suboptimal decisions due to model uncertainty or limited, noisy observations. To address this, we introduce Distributionally Robust Optimisation with Bayesian Ambiguity Sets (DRO-BAS) which hedges against model uncertainty by optimising the worst-case risk over a posterior-informed ambiguity set. We provide two such sets, based on posterior expectations (DRO-BAS(PE)) or posterior predictives (DRO-BAS(PP)) and prove that both admit, under conditions, strong dual formulations leading to efficient single-stage stochastic programs which are solved with a sample average approximation. For DRO-BAS(PE) this covers all conjugate exponential family members while for DRO-BAS(PP) this is shown under conditions on the predictive's moment generating function. Our DRO-BAS formulations Pareto dominate existing Bayesian DRO on the Newsvendor problem and achieve faster solve times with comparable robustness on the Portfolio problem.

Title: Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing

Authors: Hanhui Wang, Yihua Zhang, Ruizheng Bai, Yue Zhao, Sijia Liu, Zhengzhong Tu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16832
Pdf URL: https://arxiv.org/pdf/2411.16832
Copy Paste: [[2411.16832]] Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing(https://arxiv.org/abs/2411.16832)
Keywords: security, privacy, protect, defense, robust, biometric, diffusion, generative
Abstract: Recent advancements in diffusion models have made generative image editing more accessible, enabling creative edits but raising ethical concerns, particularly regarding malicious edits to human portraits that threaten privacy and identity security. Existing protection methods primarily rely on adversarial perturbations to nullify edits but often fail against diverse editing requests. We propose FaceLock, a novel approach to portrait protection that optimizes adversarial perturbations to destroy or significantly alter biometric information, rendering edited outputs biometrically unrecognizable. FaceLock integrates facial recognition and visual perception into perturbation optimization to provide robust protection against various editing attempts. We also highlight flaws in commonly used evaluation metrics and reveal how they can be manipulated, emphasizing the need for reliable assessments of protection. Experiments show FaceLock outperforms baselines in defending against malicious edits and is robust against purification techniques. Ablation studies confirm its stability and broad applicability across diffusion-based editing algorithms. Our work advances biometric defense and sets the foundation for privacy-preserving practices in image editing. The code is available at: this https URL.

Title: Open Vocabulary Monocular 3D Object Detection

Authors: Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16833
Pdf URL: https://arxiv.org/pdf/2411.16833
Copy Paste: [[2411.16833]] Open Vocabulary Monocular 3D Object Detection(https://arxiv.org/abs/2411.16833)
Keywords: robust
Abstract: In this work, we pioneer the study of open-vocabulary monocular 3D object detection, a novel task that aims to detect and localize objects in 3D space from a single RGB image without limiting detection to a predefined set of categories. We formalize this problem, establish baseline methods, and introduce a class-agnostic approach that leverages open-vocabulary 2D detectors and lifts 2D bounding boxes into 3D space. Our approach decouples the recognition and localization of objects in 2D from the task of estimating 3D bounding boxes, enabling generalization across unseen categories. Additionally, we propose a target-aware evaluation protocol to address inconsistencies in existing datasets, improving the reliability of model performance assessment. Extensive experiments on the Omni3D dataset demonstrate the effectiveness of the proposed method in zero-shot 3D detection for novel object categories, validating its robust generalization capabilities. Our method and evaluation protocols contribute towards the development of open-vocabulary object detection models that can effectively operate in real-world, category-diverse environments.

Title: SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Authors: Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16856
Pdf URL: https://arxiv.org/pdf/2411.16856
Copy Paste: [[2411.16856]] SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE(https://arxiv.org/abs/2411.16856)
Keywords: large language model
Abstract: Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.

Title: Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Authors: Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2411.16863
Pdf URL: https://arxiv.org/pdf/2411.16863
Copy Paste: [[2411.16863]] Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering(https://arxiv.org/abs/2411.16863)
Keywords: large language model
Abstract: Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and trained models are publicly available at this https URL.

Title: PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence

Authors: Zequn Chen, Jiezhi Yang, Heng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16877
Pdf URL: https://arxiv.org/pdf/2411.16877
Copy Paste: [[2411.16877]] PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence(https://arxiv.org/abs/2411.16877)
Keywords: robust
Abstract: We present PreF3R, Pose-Free Feed-forward 3D Reconstruction from an image sequence of variable length. Unlike previous approaches, PreF3R removes the need for camera calibration and reconstructs the 3D Gaussian field within a canonical coordinate frame directly from a sequence of unposed images, enabling efficient novel-view rendering. We leverage DUSt3R's ability for pair-wise 3D structure reconstruction, and extend it to sequential multi-view input via a spatial memory network, eliminating the need for optimization-based global alignment. Additionally, PreF3R incorporates a dense Gaussian parameter prediction head, which enables subsequent novel-view synthesis with differentiable rasterization. This allows supervising our model with the combination of photometric loss and pointmap regression loss, enhancing both photorealism and structural accuracy. Given a sequence of ordered images, PreF3R incrementally reconstructs the 3D Gaussian field at 20 FPS, therefore enabling real-time novel-view rendering. Empirical experiments demonstrate that PreF3R is an effective solution for the challenging task of pose-free feed-forward novel-view synthesis, while also exhibiting robust generalization to unseen scenes.

Title: Explainable AI Approach using Near Misses Analysis

Authors: Eran Kaufman, Avivit levy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16895
Pdf URL: https://arxiv.org/pdf/2411.16895
Copy Paste: [[2411.16895]] Explainable AI Approach using Near Misses Analysis(https://arxiv.org/abs/2411.16895)
Keywords: robust, explainability
Abstract: This paper introduces a novel XAI approach based on near-misses analysis (NMA). This approach reveals a hierarchy of logical 'concepts' inferred from the latent decision-making process of a Neural Network (NN) without delving into its explicit structure. We examined our proposed XAI approach on different network architectures that vary in size and shape (e.g., ResNet, VGG, EfficientNet, MobileNet) on several datasets (ImageNet and CIFAR100). The results demonstrate its usability to reflect NNs latent process of concepts generation. We generated a new metric for explainability. Moreover, our experiments suggest that efficient architectures, which achieve a similar accuracy level with much less neurons may still pay the price of explainability and robustness in terms of concepts generation. We, thus, pave a promising new path for XAI research to follow.

Title: Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Authors: Andong Deng, Zhongpai Gao, Anwesa Choudhuri, Benjamin Planche, Meng Zheng, Bin Wang, Terrence Chen, Chen Chen, Ziyan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16932
Pdf URL: https://arxiv.org/pdf/2411.16932
Copy Paste: [[2411.16932]] Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding(https://arxiv.org/abs/2411.16932)
Keywords: large language model
Abstract: Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional information across image sequences, clip sequences, and long videos. Experiments demonstrate the effectiveness of our method, achieving a 27.6% improvement in F1 score and 44.8% in CIDEr on the YouCook2 benchmark and a 14.7% increase in recall on the Charades-STA benchmark compared to the baseline.

Title: A SAM-guided and Match-based Semi-Supervised Segmentation Framework for Medical Imaging

Authors: Guoping Xu, Xiaoxue Qian, Hua Chieh Shao, Jax Luo, Weiguo Lu, You Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16949
Pdf URL: https://arxiv.org/pdf/2411.16949
Copy Paste: [[2411.16949]] A SAM-guided and Match-based Semi-Supervised Segmentation Framework for Medical Imaging(https://arxiv.org/abs/2411.16949)
Keywords: segmentation
Abstract: This study introduces SAMatch, a SAM-guided Match-based framework for semi-supervised medical image segmentation, aimed at improving pseudo label quality in data-scarce scenarios. While Match-based frameworks are effective, they struggle with low-quality pseudo labels due to the absence of ground truth. SAM, pre-trained on a large dataset, generalizes well across diverse tasks and assists in generating high-confidence prompts, which are then used to refine pseudo labels via fine-tuned SAM. SAMatch is trained end-to-end, allowing for dynamic interaction between the models. Experiments on the ACDC cardiac MRI, BUSI breast ultrasound, and MRLiver datasets show SAMatch achieving state-of-the-art results, with Dice scores of 89.36%, 77.76%, and 80.04%, respectively, using minimal labeled data. SAMatch effectively addresses challenges in semi-supervised segmentation, offering a powerful tool for segmentation in data-limited environments. Code and data are available at this https URL.

Title: Probing the limitations of multimodal language models for chemistry and materials research

Authors: Nawaf Alampara, Mara Schilling-Wilhelmi, Martiño Ríos-García, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, N. M. Anoop Krishnan, Kevin Maik Jablonka
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2411.16955
Pdf URL: https://arxiv.org/pdf/2411.16955
Copy Paste: [[2411.16955]] Probing the limitations of multimodal language models for chemistry and materials research(https://arxiv.org/abs/2411.16955)
Keywords: extraction
Abstract: Recent advancements in artificial intelligence have sparked interest in scientific assistants that could support researchers across the full spectrum of scientific workflows, from literature review to experimental design and data analysis. A key capability for such systems is the ability to process and reason about scientific information in both visual and textual forms - from interpreting spectroscopic data to understanding laboratory setups. Here, we introduce MaCBench, a comprehensive benchmark for evaluating how vision-language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental understanding, and results interpretation. Through a systematic evaluation of leading models, we find that while these systems show promising capabilities in basic perception tasks - achieving near-perfect performance in equipment identification and standardized data extraction - they exhibit fundamental limitations in spatial reasoning, cross-modal information synthesis, and multi-step logical inference. Our insights have important implications beyond chemistry and materials science, suggesting that developing reliable multimodal AI scientific assistants may require advances in curating suitable training data and approaches to training those models.

Title: MotionWavelet: Human Motion Prediction via Wavelet Manifold Learning

Authors: Yuming Feng, Zhiyang Dou, Ling-Hao Chen, Yuan Liu, Tianyu Li, Jingbo Wang, Zeyu Cao, Wenping Wang, Taku Komura, Lingjie Liu
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2411.16964
Pdf URL: https://arxiv.org/pdf/2411.16964
Copy Paste: [[2411.16964]] MotionWavelet: Human Motion Prediction via Wavelet Manifold Learning(https://arxiv.org/abs/2411.16964)
Keywords: diffusion
Abstract: Modeling temporal characteristics and the non-stationary dynamics of body movement plays a significant role in predicting human future motions. However, it is challenging to capture these features due to the subtle transitions involved in the complex human motions. This paper introduces MotionWavelet, a human motion prediction framework that utilizes Wavelet Transformation and studies human motion patterns in the spatial-frequency domain. In MotionWavelet, a Wavelet Diffusion Model (WDM) learns a Wavelet Manifold by applying Wavelet Transformation on the motion data therefore encoding the intricate spatial and temporal motion patterns. Once the Wavelet Manifold is built, WDM trains a diffusion model to generate human motions from Wavelet latent vectors. In addition to the WDM, MotionWavelet also presents a Wavelet Space Shaping Guidance mechanism to refine the denoising process to improve conformity with the manifold structure. WDM also develops Temporal Attention-Based Guidance to enhance prediction accuracy. Extensive experiments validate the effectiveness of MotionWavelet, demonstrating improved prediction accuracy and enhanced generalization across various benchmarks. Our code and models will be released upon acceptance.

Title: ZoomLDM: Latent Diffusion Model for multi-scale image generation

Authors: Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis, Prateek Prasanna, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16969
Pdf URL: https://arxiv.org/pdf/2411.16969
Copy Paste: [[2411.16969]] ZoomLDM: Latent Diffusion Model for multi-scale image generation(https://arxiv.org/abs/2411.16969)
Keywords: diffusion, generative
Abstract: Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. In this paper, to overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling particularly in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to $4096 \times 4096$ pixels and $4\times$ super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments. We provide high-resolution examples of the generated images on our website this https URL.

Title: SEMU-Net: A Segmentation-based Corrector for Fabrication Process Variations of Nanophotonics with Microscopic Images

Authors: Rambod Azimi, Yijian Kong, Dusan Gostimirovic, James J. Clark, Odile Liboiron-Ladouceur
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.16973
Pdf URL: https://arxiv.org/pdf/2411.16973
Copy Paste: [[2411.16973]] SEMU-Net: A Segmentation-based Corrector for Fabrication Process Variations of Nanophotonics with Microscopic Images(https://arxiv.org/abs/2411.16973)
Keywords: segmentation
Abstract: Integrated silicon photonic devices, which manipulate light to transmit and process information on a silicon-on-insulator chip, are highly sensitive to structural variations. Minor deviations during nanofabrication-the precise process of building structures at the nanometer scale-such as over- or under-etching, corner rounding, and unintended defects, can significantly impact performance. To address these challenges, we introduce SEMU-Net, a comprehensive set of methods that automatically segments scanning electron microscope images (SEM) and uses them to train two deep neural network models based on U-Net and its variants. The predictor model anticipates fabrication-induced variations, while the corrector model adjusts the design to address these issues, ensuring that the final fabricated structures closely align with the intended specifications. Experimental results show that the segmentation U-Net reaches an average IoU score of 99.30%, while the corrector attention U-Net in a tandem architecture achieves an average IoU score of 98.67%.

Title: ExpTest: Automating Learning Rate Searching and Tuning with Insights from Linearized Neural Networks

Authors: Zan Chaudhry, Naoko Mizuno
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16975
Pdf URL: https://arxiv.org/pdf/2411.16975
Copy Paste: [[2411.16975]] ExpTest: Automating Learning Rate Searching and Tuning with Insights from Linearized Neural Networks(https://arxiv.org/abs/2411.16975)
Keywords: robust
Abstract: Hyperparameter tuning remains a significant challenge for the training of deep neural networks (DNNs), requiring manual and/or time-intensive grid searches, increasing resource costs and presenting a barrier to the democratization of machine learning. The global initial learning rate for DNN training is particularly important. Several techniques have been proposed for automated learning rate tuning during training; however, they still require manual searching for the global initial learning rate. Though methods exist that do not require this initial selection, they suffer from poor performance. Here, we present ExpTest, a sophisticated method for initial learning rate searching and subsequent learning rate tuning for the training of DNNs. ExpTest draws on insights from linearized neural networks and the form of the loss curve, which we treat as a real-time signal upon which we perform hypothesis testing. We mathematically justify ExpTest and provide empirical support. ExpTest requires minimal overhead, is robust to hyperparameter choice, and achieves state-of-the-art performance on a variety of tasks and architectures, without initial learning rate selection or learning rate scheduling.

Title: EvoChain: a Recovery Approach for Permissioned Blockchain Applications

Authors: Francisco Faria, Samih Eisa, David R. Matos, Miguel L. Pardal
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2411.16976
Pdf URL: https://arxiv.org/pdf/2411.16976
Copy Paste: [[2411.16976]] EvoChain: a Recovery Approach for Permissioned Blockchain Applications(https://arxiv.org/abs/2411.16976)
Keywords: security
Abstract: Blockchain technology supports decentralized, consensus-driven data storage and processing, ensuring integrity and auditability. It is increasingly adopted for use cases with multiple stakeholders with shared ownership scenarios like digital identity and supply chain management. However, real-world deployments face challenges with mistakes and intrusions. This article presents EvoChain, a chaincode framework extension introducing controlled mutability for data redaction and recovery under time-limited or specific conditions. This mechanism allows corrections during a grace period before immutability takes effect. We validated our approach using WineTracker, a Hyperledger Fabric-based supply chain application. It enables some users to cancel unwanted operations while preserving the blockchain security and maintaining data consistency. Performance evaluations showed minimal overhead with functional benefits.

Title: Teaching Smaller Language Models To Generalise To Unseen Compositional Questions (Full Thesis)

Authors: Tim Hartill
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16985
Pdf URL: https://arxiv.org/pdf/2411.16985
Copy Paste: [[2411.16985]] Teaching Smaller Language Models To Generalise To Unseen Compositional Questions (Full Thesis)(https://arxiv.org/abs/2411.16985)
Keywords: large language model
Abstract: Pretrained large Language Models (LLMs) are able to answer questions that are unlikely to have been encountered during training. However a diversity of potential applications exist in the broad domain of reasoning systems and considerations such as latency, cost, available compute resource and internet connectivity are relevant in determining an appropriate approach. We consider the setting where some local compute capacity is available at inference time but internet connectivity is not. Similar to a general-purpose LLM, we assume that our much smaller Reasoning Models may be asked arbitrary questions from unknown distributions, so we focus on evaluation in an unseen setting. We train our models to answer diverse questions by instilling an ability to reason over a retrieved context. We acquire context from two knowledge sources; a Wikipedia corpus queried using a multi-hop dense retrieval system with novel extensions, and from rationales generated from a larger Language Model optimised to run in a lower resource environment. Our main contributions: We propose novel methods to show that our model is capable of answering contextualised questions without memorisation. We establish a comprehensive set of baseline results on unseen evaluation datasets. We show that the addition of novel retrieval-augmented training datasets (RATD) to the training regime of the Reasoning Model significantly improves results. We demonstrate further significant improvement through the application of methods for combining knowledge from two sources. The first method (RR) involves training a novel Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We use the scores to derive combined contexts. We also show that utilising the RATD datasets enables our model to become proficient at utilising combined noisy contexts.

Title: Decentralized Storage And Self-Sovereign Identity For Document-Based Claims

Authors: Bruno Gomes, Samih Eisa, David R. Matos, Miguel L. Pardal
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2411.16987
Pdf URL: https://arxiv.org/pdf/2411.16987
Copy Paste: [[2411.16987]] Decentralized Storage And Self-Sovereign Identity For Document-Based Claims(https://arxiv.org/abs/2411.16987)
Keywords: secure, privacy
Abstract: Users increasingly rely on identity providers for accessing online services and resources. However, centralized identity systems often compromise user privacy due to online activity tracking or data breaches. At the same time, many online services require digital copies of physical documents for validation in claims processes, such as providing proof of residence for opening a bank account or verifying medical images for health insurance claims. With centralized solutions, privacy depends entirely on the trusted party, but there are emerging decentralized approaches that offer greater transparency. This article introduces SoverClaim, a decentralized application prototype that empowers users to control their identity and also allows them to present digital documents with privacy. SoverClaim leverages Hyperledger Indy, a blockchain for issuing and presenting self-sovereign digital identities with transparent audit logs, and Storj, a decentralized peer-to-peer service, for secure and decentralized document storage and subsequent deletion. The prototype demonstrates the seamless integration of self-sovereign identities and document-based claims, achieving response times of under 750 ms, making it suitable for timely human interactions.

Title: CMAViT: Integrating Climate, Managment, and Remote Sensing Data for Crop Yield Estimation with Multimodel Vision Transformers

Authors: Hamid Kamangir, Brent. S. Sams, Nick Dokoozlian, Luis Sanchez, J. Mason. Earles
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16989
Pdf URL: https://arxiv.org/pdf/2411.16989
Copy Paste: [[2411.16989]] CMAViT: Integrating Climate, Managment, and Remote Sensing Data for Crop Yield Estimation with Multimodel Vision Transformers(https://arxiv.org/abs/2411.16989)
Keywords: transformer
Abstract: Crop yield prediction is essential for agricultural planning but remains challenging due to the complex interactions between weather, climate, and management practices. To address these challenges, we introduce a deep learning-based multi-model called Climate-Management Aware Vision Transformer (CMAViT), designed for pixel-level vineyard yield predictions. CMAViT integrates both spatial and temporal data by leveraging remote sensing imagery and short-term meteorological data, capturing the effects of growing season variations. Additionally, it incorporates management practices, which are represented in text form, using a cross-attention encoder to model their interaction with time-series data. This innovative multi-modal transformer tested on a large dataset from 2016-2019 covering 2,200 hectares and eight grape cultivars including more than 5 million vines, outperforms traditional models like UNet-ConvLSTM, excelling in spatial variability capture and yield prediction, particularly for extreme values in vineyards. CMAViT achieved an R2 of 0.84 and a MAPE of 8.22% on an unseen test dataset. Masking specific modalities lowered performance: excluding management practices, climate data, and both reduced R2 to 0.73, 0.70, and 0.72, respectively, and raised MAPE to 11.92%, 12.66%, and 12.39%, highlighting each modality's importance for accurate yield prediction. Code is available at this https URL.

Title: Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Authors: Yao Fu, Yin Yu, Xiaotian Han, Runchao Li, Xianxuan Long, Haotian Yu, Pan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16991
Pdf URL: https://arxiv.org/pdf/2411.16991
Copy Paste: [[2411.16991]] Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models(https://arxiv.org/abs/2411.16991)
Keywords: large language model
Abstract: Knowledge distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprints. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self-distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers' guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations' distillation from the last ones' generated logits. Additionally, to address prediction inaccuracies during the early iterations, we dynamically adjust the distillation influence and temperature values to enhance the adaptability of fine-tuning. Furthermore, DynSDPB is a novel fine-tuning policy that facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs) because they all require updating SLMs' parameters. We demonstrate the superior performance of DynSDPB on both encoder-only LMs (e.g., BERT model families) and decoder-only LMs (e.g., LLaMA model families), validating its effectiveness across natural language understanding (NLU) and natural language generation (NLG) benchmarks.

Title: Tree Transformers are an Ineffective Model of Syntactic Constituency

Authors: Michael Ginn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16993
Pdf URL: https://arxiv.org/pdf/2411.16993
Copy Paste: [[2411.16993]] Tree Transformers are an Ineffective Model of Syntactic Constituency(https://arxiv.org/abs/2411.16993)
Keywords: transformer
Abstract: Linguists have long held that a key aspect of natural language syntax is the recursive organization of language units into constituent structures, and research has suggested that current state-of-the-art language models lack an inherent bias towards this feature. A number of alternative models have been proposed to provide inductive biases towards constituency, including the Tree Transformer, which utilizes a modified attention mechanism to organize tokens into constituents. We investigate Tree Transformers to study whether they utilize meaningful and/or useful constituent structures. We pretrain a large Tree Transformer on language modeling in order to investigate the learned constituent tree representations of sentences, finding little evidence for meaningful structures. Next, we evaluate Tree Transformers with similar transformer models on error detection tasks requiring constituent structure. We find that while the Tree Transformer models may slightly outperform at these tasks, there is little evidence to suggest a meaningful improvement. In general, we conclude that there is little evidence to support Tree Transformer as an effective model of syntactic constituency.

Title: Curvature Informed Furthest Point Sampling

Authors: Shubham Bhardwaj, Ashwin Vinod, Soumojit Bhattacharya, Aryan Koganti, Aditya Sai Ellendula, Balakrishna Reddy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16995
Pdf URL: https://arxiv.org/pdf/2411.16995
Copy Paste: [[2411.16995]] Curvature Informed Furthest Point Sampling(https://arxiv.org/abs/2411.16995)
Keywords: robust, segmentation
Abstract: Point cloud representation has gained traction due to its efficient memory usage and simplicity in acquisition, manipulation, and storage. However, as point cloud sizes increase, effective down-sampling becomes essential to address the computational requirements of downstream tasks. Classical approaches, such as furthest point sampling (FPS), perform well on benchmarks but rely on heuristics and overlook geometric features, like curvature, during down-sampling. In this paper, We introduce a reinforcement learning-based sampling algorithm that enhances FPS by integrating curvature information. Our approach ranks points by combining FPS-derived soft ranks with curvature scores computed by a deep neural network, allowing us to replace a proportion of low-curvature points in the FPS set with high-curvature points from the unselected set. Existing differentiable sampling techniques often suffer from training instability, hindering their integration into end-to-end learning frameworks. By contrast, our method achieves stable end-to-end learning, consistently outperforming baseline models across multiple downstream geometry processing tasks. We provide comprehensive ablation studies, with both qualitative and quantitative insights into the effect of each feature on performance. Our algorithm establishes state-of-the-art results for classification, segmentation and shape completion, showcasing its robustness and adaptability.

Title: Can a Single Tree Outperform an Entire Forest?

Authors: Qiangqiang Mao, Yankai Cao
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2411.17003
Pdf URL: https://arxiv.org/pdf/2411.17003
Copy Paste: [[2411.17003]] Can a Single Tree Outperform an Entire Forest?(https://arxiv.org/abs/2411.17003)
Keywords: interpretability
Abstract: The prevailing mindset is that a single decision tree underperforms classic random forests in testing accuracy, despite its advantages in interpretability and lightweight structure. This study challenges such a mindset by significantly improving the testing accuracy of an oblique regression tree through our gradient-based entire tree optimization framework, making its performance comparable to the classic random forest. Our approach reformulates tree training as a differentiable unconstrained optimization task, employing a scaled sigmoid approximation strategy. To ameliorate numerical instability, we propose an algorithmic scheme that solves a sequence of increasingly accurate approximations. Additionally, a subtree polish strategy is implemented to reduce approximation errors accumulated across the tree. Extensive experiments on 16 datasets demonstrate that our optimized tree outperforms the classic random forest by an average of $2.03\%$ improvements in testing accuracy.

Title: HOPE: Homomorphic Order-Preserving Encryption for Outsourced Databases -- A Stateless Approach

Authors: Baiqiang Wang, Dongfang Zhao
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2411.17009
Pdf URL: https://arxiv.org/pdf/2411.17009
Copy Paste: [[2411.17009]] HOPE: Homomorphic Order-Preserving Encryption for Outsourced Databases -- A Stateless Approach(https://arxiv.org/abs/2411.17009)
Keywords: secure, security, attack
Abstract: Order-preserving encryption (OPE) is a fundamental cryptographic tool for enabling efficient range queries on encrypted data in outsourced databases. Despite its importance, existing OPE schemes face critical limitations that hinder their practicality. Stateful designs require clients to maintain plaintext-to-ciphertext mappings, imposing significant storage and management overhead. Stateless designs often rely on interactive protocols between the client and server, leading to high communication latency and limited scalability. These limitations make existing schemes unsuitable for real-world applications that demand simplicity, efficiency, and scalability. In this work, we present Homomorphic OPE (HOPE), a new OPE scheme that eliminates client-side storage and avoids additional client-server interaction during query execution. HOPE leverages the additive property of homomorphic encryption to introduce a novel comparison key mechanism, which transforms ciphertext comparison into a randomized difference computation. This mechanism ensures that only the sign of the comparison is preserved while fully masking the underlying plaintext values, enabling secure and efficient range queries without leaking additional information about the data. We provide a formal cryptographic analysis of HOPE, proving its security under the widely accepted IND-OCPA model. Our proofs rigorously demonstrate that the comparison key mechanism reveals no information beyond the order of the plaintexts and ensures resistance to both chosen-plaintext attacks and frequency analysis. To validate the practicality of HOPE, we conduct extensive experiments comparing it with state-of-the-art OPE schemes. The results demonstrate that HOPE achieves competitive query performance while addressing the key limitations of existing designs, making it a scalable and secure solution for outsourced database systems.

Title: TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

Authors: Zhenchen Wan, Yanwu Xu, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17017
Pdf URL: https://arxiv.org/pdf/2411.17017
Copy Paste: [[2411.17017]] TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On(https://arxiv.org/abs/2411.17017)
Keywords: robust, diffusion, transformer, generative, large language model
Abstract: Recent advancements in Virtual Try-On (VTO) have demonstrated exceptional efficacy in generating realistic images and preserving garment details, largely attributed to the robust generative capabilities of text-to-image (T2I) diffusion backbones. However, the T2I models that underpin these methods have become outdated, thereby limiting the potential for further improvement in VTO. Additionally, current methods face notable challenges in accurately rendering text on garments without distortion and preserving fine-grained details, such as textures and material fidelity. The emergence of Diffusion Transformer (DiT) based T2I models has showcased impressive performance and offers a promising opportunity for advancing VTO. Directly applying existing VTO techniques to transformer-based T2I models is ineffective due to substantial architectural differences, which hinder their ability to fully leverage the models' advanced capabilities for improved text generation. To address these challenges and unlock the full potential of DiT-based T2I models for VTO, we propose TED-VITON, a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features, a Text Preservation Loss to ensure accurate and distortion-free text rendering, and a constraint mechanism to generate prompts by optimizing Large Language Model (LLM). These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity, establishing a new benchmark for VTO task.

Title: RED: Robust Environmental Design

Authors: Jinghan Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17026
Pdf URL: https://arxiv.org/pdf/2411.17026
Copy Paste: [[2411.17026]] RED: Robust Environmental Design(https://arxiv.org/abs/2411.17026)
Keywords: attack, robust
Abstract: The classification of road signs by autonomous systems, especially those reliant on visual inputs, is highly susceptible to adversarial attacks. Traditional approaches to mitigating such vulnerabilities have focused on enhancing the robustness of classification models. In contrast, this paper adopts a fundamentally different strategy aimed at increasing robustness through the redesign of road signs themselves. We propose an attacker-agnostic learning scheme to automatically design road signs that are robust to a wide array of patch-based attacks. Empirical tests conducted in both digital and physical environments demonstrate that our approach significantly reduces vulnerability to patch attacks, outperforming existing techniques.

Title: Multimodal Alignment and Fusion: A Survey

Authors: Songtao Li, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17040
Pdf URL: https://arxiv.org/pdf/2411.17040
Copy Paste: [[2411.17040]] Multimodal Alignment and Fusion: A Survey(https://arxiv.org/abs/2411.17040)
Keywords: robust
Abstract: This survey offers a comprehensive review of recent advancements in multimodal alignment and fusion within machine learning, spurred by the growing diversity of data types such as text, images, audio, and video. Multimodal integration enables improved model accuracy and broader applicability by leveraging complementary information across different modalities, as well as facilitating knowledge transfer in situations with limited data. We systematically categorize and analyze existing alignment and fusion techniques, drawing insights from an extensive review of more than 200 relevant papers. Furthermore, this survey addresses the challenges of multimodal data integration - including alignment issues, noise resilience, and disparities in feature representation - while focusing on applications in domains like social media analysis, medical imaging, and emotion recognition. The insights provided are intended to guide future research towards optimizing multimodal learning systems to enhance their scalability, robustness, and generalizability across various applications.

Title: Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

Authors: Jaemin Kim, Bryan S Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17041
Pdf URL: https://arxiv.org/pdf/2411.17041
Copy Paste: [[2411.17041]] Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models(https://arxiv.org/abs/2411.17041)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free$^2$Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free$^2$Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free$^2$Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.

Title: Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation

Authors: Minh-Tuan Tran, Trung Le, Xuan-May Le, Jianfei Cai, Mehrtash Harandi, Dinh Phung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17046
Pdf URL: https://arxiv.org/pdf/2411.17046
Copy Paste: [[2411.17046]] Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation(https://arxiv.org/abs/2411.17046)
Keywords: data-free
Abstract: Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data. While DFKD methods have achieved success on smaller datasets like CIFAR10 and CIFAR100, they encounter challenges on larger, high-resolution datasets such as ImageNet. A primary issue with previous approaches is their generation of synthetic images at high resolutions (e.g., $224 \times 224$) without leveraging information from real images, often resulting in noisy images that lack essential class-specific features in large datasets. Additionally, the computational cost of generating the extensive data needed for effective knowledge transfer can be prohibitive. In this paper, we introduce MUlti-reSolution data-freE (MUSE) to address these limitations. MUSE generates images at lower resolutions while using Class Activation Maps (CAMs) to ensure that the generated images retain critical, class-specific features. To further enhance model diversity, we propose multi-resolution generation and embedding diversity techniques that strengthen latent space representations, leading to significant performance improvements. Experimental results demonstrate that MUSE achieves state-of-the-art performance across both small- and large-scale datasets, with notable performance gains of up to two digits in nearly all ImageNet and subset experiments. Code is available at this https URL.

Title: PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

Authors: Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yujie Wei, Zekun Li, Yingya Zhang, Boxi Wu, Deng Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17048
Pdf URL: https://arxiv.org/pdf/2411.17048
Copy Paste: [[2411.17048]] PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation(https://arxiv.org/abs/2411.17048)
Keywords: robust
Abstract: The current text-to-video (T2V) generation has made significant progress in synthesizing realistic general videos, but it is still under-explored in identity-specific human video generation with customized ID images. The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection. Current video identity customization methods mainly rely on reconstructing given identity images on text-to-image models, which have a divergent distribution with the T2V model. This process introduces a tuning-inference gap, leading to dynamic and semantic degradation. To tackle this problem, we propose a novel framework, dubbed \textbf{PersonalVideo}, that applies direct supervision on videos synthesized by the T2V model to bridge the gap. Specifically, we introduce a learnable Isolated Identity Adapter to customize the specific identity non-intrusively, which does not comprise the original T2V model's abilities (e.g., motion dynamic and semantic following). With the non-reconstructive identity loss, we further employ simulated prompt augmentation to reduce overfitting by supervising generated results in more semantic scenarios, gaining good robustness even with only a single reference image available. Extensive experiments demonstrate our method's superiority in delivering high identity faithfulness while preserving the inherent video generation qualities of the original T2V model, outshining prior approaches. Notably, our PersonalVideo seamlessly integrates with pre-trained SD components, such as ControlNet and style LoRA, requiring no extra tuning overhead.

Title: ThreatModeling-LLM: Automating Threat Modeling using Large Language Models for Banking System

Authors: Shuiqiao Yang, Tingmin Wu, Shigang Liu, David Nguyen, Seung Jang, Alsharif Abuadbba
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17058
Pdf URL: https://arxiv.org/pdf/2411.17058
Copy Paste: [[2411.17058]] ThreatModeling-LLM: Automating Threat Modeling using Large Language Models for Banking System(https://arxiv.org/abs/2411.17058)
Keywords: security, large language model
Abstract: Threat modeling is a crucial component of cybersecurity, particularly for industries such as banking, where the security of financial data is paramount. Traditional threat modeling approaches require expert intervention and manual effort, often leading to inefficiencies and human error. The advent of Large Language Models (LLMs) offers a promising avenue for automating these processes, enhancing both efficiency and efficacy. However, this transition is not straightforward due to three main challenges: (1) the lack of publicly available, domain-specific datasets, (2) the need for tailored models to handle complex banking system architectures, and (3) the requirement for real-time, adaptive mitigation strategies that align with compliance standards like NIST 800-53. In this paper, we introduce ThreatModeling-LLM, a novel and adaptable framework that automates threat modeling for banking systems using LLMs. ThreatModeling-LLM operates in three stages: 1) dataset creation, 2) prompt engineering and 3) model fine-tuning. We first generate a benchmark dataset using Microsoft Threat Modeling Tool (TMT). Then, we apply Chain of Thought (CoT) and Optimization by PROmpting (OPRO) on the pre-trained LLMs to optimize the initial prompt. Lastly, we fine-tune the LLM using Low-Rank Adaptation (LoRA) based on the benchmark dataset and the optimized prompt to improve the threat identification and mitigation generation capabilities of pre-trained LLMs.

Title: A generalised novel loss function for computational fluid dynamics

Authors: Zachary Cooper-Baldock, Paulo E. Santos, Russell S.A. Brinkworth, Karl Sammut
Subjects: cs.LG, cs.CV, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2411.17059
Pdf URL: https://arxiv.org/pdf/2411.17059
Copy Paste: [[2411.17059]] A generalised novel loss function for computational fluid dynamics(https://arxiv.org/abs/2411.17059)
Keywords: generative
Abstract: Computational fluid dynamics (CFD) simulations are crucial in automotive, aerospace, maritime and medical applications, but are limited by the complexity, cost and computational requirements of directly calculating the flow, often taking days of compute time. Machine-learning architectures, such as controlled generative adversarial networks (cGANs) hold significant potential in enhancing or replacing CFD investigations, due to cGANs ability to approximate the underlying data distribution of a dataset. Unlike traditional cGAN applications, where the entire image carries information, CFD data contains small regions of highly variant data, immersed in a large context of low variance that is of minimal importance. This renders most existing deep learning techniques that give equal importance to every portion of the data during training, inefficient. To mitigate this, a novel loss function is proposed called Gradient Mean Squared Error (GMSE) which automatically and dynamically identifies the regions of importance on a field-by-field basis, assigning appropriate weights according to the local variance. To assess the effectiveness of the proposed solution, three identical networks were trained; optimised with Mean Squared Error (MSE) loss, proposed GMSE loss and a dynamic variant of GMSE (DGMSE). The novel loss function resulted in faster loss convergence, correlating to reduced training time, whilst also displaying an 83.6% reduction in structural similarity error between the generated field and ground truth simulations, a 76.6% higher maximum rate of loss and an increased ability to fool a discriminator network. It is hoped that this loss function will enable accelerated machine learning within computational fluid dynamics.

Title: SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

Authors: Guoan Xu, Jiaming Chen, Wenfeng Huang, Wenjing Jia, Guangwei Gao, Guo-Jun Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17061
Pdf URL: https://arxiv.org/pdf/2411.17061
Copy Paste: [[2411.17061]] SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation(https://arxiv.org/abs/2411.17061)
Keywords: transformer, segmentation
Abstract: The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation. However, designed as general-purpose visual encoders, ViT backbones often overlook the specific needs of task decoders, revealing opportunities to design decoders tailored to efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head explicitly designed for semantic segmentation. Instead of relying on the simple conventional skip connections, we employ lateral connections between the encoder and decoder stages, using encoder features as Queries for the cross-attention modules. Additionally, we introduce a Cross-Layer Block that blends hierarchical feature maps from different encoder and decoder stages to create a unified representation for Keys and Values. To further boost computational efficiency, SCASeg compresses queries and keys into strip-like patterns to optimize memory usage and inference speed over the traditional vanilla cross-attention. Moreover, the Cross-Layer Block incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers. This approach facilitates effective feature interaction at different scales, improving the overall performance. Experiments show that the adaptable decoder of SCASeg produces competitive performance across different setups, surpassing leading segmentation architectures on all benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under varying computational limitations.

Title: Graph Structure Learning with Bi-level Optimization

Authors: Nan Yin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17062
Pdf URL: https://arxiv.org/pdf/2411.17062
Copy Paste: [[2411.17062]] Graph Structure Learning with Bi-level Optimization(https://arxiv.org/abs/2411.17062)
Keywords: robust, extraction
Abstract: Currently, most Graph Structure Learning (GSL) methods, as a means of learning graph structure, improve the robustness of GNN merely from a local view by considering the local information related to each edge and indiscriminately applying the mechanism across edges, which may suffer from the local structure heterogeneity of the graph (\ie the uneven distribution of inter-class connections over nodes). To overcome the cons, we extract the graph structure as a learnable parameter and jointly learn the structure and common parameters of GNN from the global view. Excitingly, the common parameters contain the global information for nodes features mapping, which is also crucial for structure optimization (\ie optimizing the structure relies on global mapping information). Mathematically, we apply a generic structure extractor to abstract the graph structure and transform GNNs in the form of learning structure and common parameters. Then, we model the learning process as a novel bi-level optimization, \ie \textit{Generic Structure Extraction with Bi-level Optimization for Graph Structure Learning (GSEBO)}, which optimizes GNN parameters in the upper level to obtain the global mapping information and graph structure is optimized in the lower level with the global information learned from the upper level. We instantiate the proposed GSEBO on classical GNNs and compare it with the state-of-the-art GSL methods. Extensive experiments validate the effectiveness of the proposed GSEBO on four real-world datasets.

Title: Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models

Authors: Colin Conwell, Rupert Tawiah-Quashie, Tomer Ullman
Subjects: cs.CV, cs.CL, cs.SC
Abstract URL: https://arxiv.org/abs/2411.17066
Pdf URL: https://arxiv.org/pdf/2411.17066
Copy Paste: [[2411.17066]] Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models(https://arxiv.org/abs/2411.17066)
Keywords: diffusion, generative
Abstract: Despite remarkable progress in multi-modal AI research, there is a salient domain in which modern AI continues to lag considerably behind even human children: the reliable deployment of logical operators. Here, we examine three forms of logical operators: relations, negations, and discrete numbers. We asked human respondents (N=178 in total) to evaluate images generated by a state-of-the-art image-generating AI (DALL-E 3) prompted with these `logical probes', and find that none reliably produce human agreement scores greater than 50\%. The negation probes and numbers (beyond 3) fail most frequently. In a 4th experiment, we assess a `grounded diffusion' pipeline that leverages targeted prompt engineering and structured intermediate representations for greater compositional control, but find its performance is judged even worse than that of DALL-E 3 across prompts. To provide further clarity on potential sources of success and failure in these text-to-image systems, we supplement our 4 core experiments with multiple auxiliary analyses and schematic diagrams, directly quantifying, for example, the relationship between the N-gram frequency of relational prompts and the average match to generated images; the success rates for 3 different prompt modification strategies in the rendering of negation prompts; and the scalar variability / ratio dependence (`approximate numeracy') of prompts involving integers. We conclude by discussing the limitations inherent to `grounded' multimodal learning systems whose grounding relies heavily on vector-based semantics (e.g. DALL-E 3), or under-specified syntactical constraints (e.g. `grounded diffusion'), and propose minimal modifications (inspired by development, based in imagery) that could help to bridge the lingering compositional gap between scale and structure. All data and code is available at this https URL

Title: Don't Command, Cultivate: An Exploratory Study of System-2 Alignment

Authors: Yuhang Wang, Jitao Sang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17075
Pdf URL: https://arxiv.org/pdf/2411.17075
Copy Paste: [[2411.17075]] Don't Command, Cultivate: An Exploratory Study of System-2 Alignment(https://arxiv.org/abs/2411.17075)
Keywords: attack, robust
Abstract: The o1 system card identifies the o1 models as the most robust within OpenAI, with their defining characteristic being the progression from rapid, intuitive thinking to slower, more deliberate reasoning. This observation motivated us to investigate the influence of System-2 thinking patterns on model safety. In our preliminary research, we conducted safety evaluations of the o1 model, including complex jailbreak attack scenarios using adversarial natural language prompts and mathematical encoding prompts. Our findings indicate that the o1 model demonstrates relatively improved safety performance; however, it still exhibits vulnerabilities, particularly against jailbreak attacks employing mathematical encoding. Through detailed case analysis, we identified specific patterns in the o1 model's responses. We also explored the alignment of System-2 safety in open-source models using prompt engineering and supervised fine-tuning techniques. Experimental results show that some simple methods to encourage the model to carefully scrutinize user requests are beneficial for model safety. Additionally, we proposed a implementation plan for process supervision to enhance safety alignment. The implementation details and experimental results will be provided in future versions.

Title: Contrastive CFG: Improving CFG in Diffusion Models by Contrasting Positive and Negative Concepts

Authors: Jinho Chang, Hyungjin Chung, Jong Chul Ye
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17077
Pdf URL: https://arxiv.org/pdf/2411.17077
Copy Paste: [[2411.17077]] Contrastive CFG: Improving CFG in Diffusion Models by Contrasting Positive and Negative Concepts(https://arxiv.org/abs/2411.17077)
Keywords: diffusion
Abstract: As Classifier-Free Guidance (CFG) has proven effective in conditional diffusion model sampling for improved condition alignment, many applications use a negated CFG term to filter out unwanted features from samples. However, simply negating CFG guidance creates an inverted probability distribution, often distorting samples away from the marginal distribution. Inspired by recent advances in conditional diffusion models for inverse problems, here we present a novel method to enhance negative CFG guidance using contrastive loss. Specifically, our guidance term aligns or repels the denoising direction based on the given condition through contrastive loss, achieving a nearly identical guiding direction to traditional CFG for positive guidance while overcoming the limitations of existing negative guidance methods. Experimental results demonstrate that our approach effectively removes undesirable concepts while maintaining sample quality across diverse scenarios, from simple class conditions to complex and overlapping text prompts.

Title: {\Omega}SFormer: Dual-Modal {\Omega}-like Super-Resolution Transformer Network for Cross-scale and High-accuracy Terraced Field Vectorization Extraction

Authors: Chang Li, Yu Wang, Ce Zhang, Yongjun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17088
Pdf URL: https://arxiv.org/pdf/2411.17088
Copy Paste: [[2411.17088]] {\Omega}SFormer: Dual-Modal {\Omega}-like Super-Resolution Transformer Network for Cross-scale and High-accuracy Terraced Field Vectorization Extraction(https://arxiv.org/abs/2411.17088)
Keywords: extraction, transformer, segmentation
Abstract: Terraced field is a significant engineering practice for soil and water conservation (SWC). Terraced field extraction from remotely sensed imagery is the foundation for monitoring and evaluating SWC. This study is the first to propose a novel dual-modal {\Omega}-like super-resolution Transformer network for intelligent TFVE, offering the following advantages: (1) reducing edge segmentation error from conventional multi-scale downsampling encoder, through fusing original high-resolution features with downsampling features at each step of encoder and leveraging a multi-head attention mechanism; (2) improving the accuracy of TFVE by proposing a {\Omega}-like network structure, which fully integrates rich high-level features from both spectral and terrain data to form cross-scale super-resolution features; (3) validating an optimal fusion scheme for cross-modal and cross-scale (i.e., inconsistent spatial resolution between remotely sensed imagery and DEM) super-resolution feature extraction; (4) mitigating uncertainty between segmentation edge pixels by a coarse-to-fine and spatial topological semantic relationship optimization (STSRO) segmentation strategy; (5) leveraging contour vibration neural network to continuously optimize parameters and iteratively vectorize terraced fields from semantic segmentation results. Moreover, a DMRVD for deep-learning-based TFVE was created for the first time, which covers nine study areas in four provinces of China, with a total coverage area of 22441 square kilometers. To assess the performance of {\Omega}SFormer, classic and SOTA networks were compared. The mIOU of {\Omega}SFormer has improved by 0.165, 0.297 and 0.128 respectively, when compared with best accuracy single-modal remotely sensed imagery, single-modal DEM and dual-modal result.

Title: Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

Authors: Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2411.17089
Pdf URL: https://arxiv.org/pdf/2411.17089
Copy Paste: [[2411.17089]] Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation(https://arxiv.org/abs/2411.17089)
Keywords: large language model
Abstract: Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) caching is used to store intermediate activations, enabling GPUs to perform only the incremental computation required for each new token. This approach significantly lowers the computational overhead for token generation. However, the memory required for KV caching grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. In this paper, we introduce an efficient CPU-GPU I/O-aware LLM inference method that avoids transferring the entire KV cache from CPU to GPU by recomputing partial KV cache from activations while concurrently transferring the remaining KV cache via PCIe bus. This approach overlaps GPU recomputation with data transfer to minimize idle GPU time and maximize inference performance. Our method is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that our method achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.

Title: LESS: Efficient Log Storage System Based on Learned Model and Minimum Attribute Tree

Authors: Zhiyang Cheng, Zizhen Zhu, Haoran Dang, Hai Wan, Xibin Zhao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.17091
Pdf URL: https://arxiv.org/pdf/2411.17091
Copy Paste: [[2411.17091]] LESS: Efficient Log Storage System Based on Learned Model and Minimum Attribute Tree(https://arxiv.org/abs/2411.17091)
Keywords: defense, attack
Abstract: In recent years, cyber attacks have become increasingly sophisticated and persistent. Detection and investigation based on the provenance graph can effectively mitigate cyber intrusion. However, in the long time span of defenses, the sheer size of the provenance graph will pose significant challenges to the storage systems. Faced with long-term storage tasks, existing methods are unable to simultaneously achieve lossless information, efficient compression, and fast query support. In this paper, we propose a novel provenance graph storage system, LESS, which consumes smaller storage space and supports faster storage and queries compared to current approaches. We innovatively partition the provenance graph into two distinct components, the graph structure and attribute, and store them separately. Based on their respective characteristics, we devise two appropriate storage schemes: the provenance graph structure storage method based on machine learning and the use of the minimal spanning tree to store the graph attributes. Compared with the state-of-the-art approach, LEONARD, LESS reduces 6.29 times in storage time, while also achieving a 5.24 times reduction in disk usage and an 18.3 times faster query speed while using only 11.5% of the memory on DARPA TC dataset.

Title: PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution

Authors: Libo Zhu, Jianze Li, Haotong Qin, Yulun Zhang, Yong Guo, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17106
Pdf URL: https://arxiv.org/pdf/2411.17106
Copy Paste: [[2411.17106]] PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution(https://arxiv.org/abs/2411.17106)
Keywords: diffusion
Abstract: Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps. However, even though the denoising step has been reduced to one, they require high computational costs and storage requirements, making it difficult for deployment on hardware devices. To address these issues, we propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR. First, we simplify OSD model to two core components, UNet and Variational Autoencoder (VAE) by removing the CLIPEncoder. Secondly, we propose Learnable Boundary Quantizer (LBQ) and Learnable Equivalent Transformation (LET) to optimize the quantization process and manipulate activation distributions for better quantization. Finally, we design a Distributed Quantization Calibration (DQC) strategy that stabilizes the training of quantized parameters for rapid convergence. Comprehensive experiments demonstrate that PassionSR with 8-bit and 6-bit obtains comparable visual results with full-precision model. Moreover, our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR. Our code will be at this https URL.

Title: Learning from Noisy Labels via Conditional Distributionally Robust Optimization

Authors: Hui Guo, Grace Y. Yi, Boyu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17113
Pdf URL: https://arxiv.org/pdf/2411.17113
Copy Paste: [[2411.17113]] Learning from Noisy Labels via Conditional Distributionally Robust Optimization(https://arxiv.org/abs/2411.17113)
Keywords: robust
Abstract: While crowdsourcing has emerged as a practical solution for labeling large datasets, it presents a significant challenge in learning accurate models due to noisy labels from annotators with varying levels of expertise. Existing methods typically estimate the true label posterior, conditioned on the instance and noisy annotations, to infer true labels or adjust loss functions. These estimates, however, often overlook potential misspecification in the true label posterior, which can degrade model performances, especially in high-noise scenarios. To address this issue, we investigate learning from noisy annotations with an estimated true label posterior through the framework of conditional distributionally robust optimization (CDRO). We propose formulating the problem as minimizing the worst-case risk within a distance-based ambiguity set centered around a reference distribution. By examining the strong duality of the formulation, we derive upper bounds for the worst-case risk and develop an analytical solution for the dual robust risk for each data point. This leads to a novel robust pseudo-labeling algorithm that leverages the likelihood ratio test to construct a pseudo-empirical distribution, providing a robust reference probability distribution in CDRO. Moreover, to devise an efficient algorithm for CDRO, we derive a closed-form expression for the empirical robust risk and the optimal Lagrange multiplier of the dual problem, facilitating a principled balance between robustness and model fitting. Our experimental results on both synthetic and real-world datasets demonstrate the superiority of our method.

Title: Star Attention: Efficient LLM Inference over Long Sequences

Authors: Shantanu Acharya, Fei Jia, Boris Ginsburg
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17116
Pdf URL: https://arxiv.org/pdf/2411.17116
Copy Paste: [[2411.17116]] Star Attention: Efficient LLM Inference over Long Sequences(https://arxiv.org/abs/2411.17116)
Keywords: transformer, large language model
Abstract: Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

Title: Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos

Authors: Nouar AlDahoul, Myles Joshua Toledo Tan, Harishwar Reddy Kasireddy, Yasir Zaki
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17123
Pdf URL: https://arxiv.org/pdf/2411.17123
Copy Paste: [[2411.17123]] Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos(https://arxiv.org/abs/2411.17123)
Keywords: large language model
Abstract: The widespread dissemination of hate speech, harassment, harmful and sexual content, and violence across websites and media platforms presents substantial challenges and provokes widespread concern among different sectors of society. Governments, educators, and parents are often at odds with media platforms about how to regulate, control, and limit the spread of such content. Technologies for detecting and censoring the media contents are a key solution to addressing these challenges. Techniques from natural language processing and computer vision have been used widely to automatically identify and filter out sensitive content such as offensive languages, violence, nudity, and addiction in both text, images, and videos, enabling platforms to enforce content policies at scale. However, existing methods still have limitations in achieving high detection accuracy with fewer false positives and false negatives. Therefore, more sophisticated algorithms for understanding the context of both text and image may open rooms for improvement in content censorship to build a more efficient censorship system. In this paper, we evaluate existing LLM-based content moderation solutions such as OpenAI moderation model and Llama-Guard3 and study their capabilities to detect sensitive contents. Additionally, we explore recent LLMs such as GPT, Gemini, and Llama in identifying inappropriate contents across media outlets. Various textual and visual datasets like X tweets, Amazon reviews, news articles, human photos, cartoons, sketches, and violence videos have been utilized for evaluation and comparison. The results demonstrate that LLMs outperform traditional techniques by achieving higher accuracy and lower false positive and false negative rates. This highlights the potential to integrate LLMs into websites, social media platforms, and video-sharing services for regulatory and content moderation purposes.

Title: DOGE: Towards Versatile Visual Document Grounding and Referring

Authors: Yinan Zhou, Yuxin Chen, Haokun Lin, Shuyu Yang, Li Zhu, Zhongang Qi, Chen Ma, Ying Shan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17125
Pdf URL: https://arxiv.org/pdf/2411.17125
Copy Paste: [[2411.17125]] DOGE: Towards Versatile Visual Document Grounding and Referring(https://arxiv.org/abs/2411.17125)
Keywords: large language model
Abstract: In recent years, Multimodal Large Language Models (MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Eferring data engine (DOGE-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and instruction-tuning data to activate MLLM's grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct DOGE-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model, DOGE. This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images. Our code, data, and model will be open-sourced for community development.

Title: From Machine Learning to Machine Unlearning: Complying with GDPR's Right to be Forgotten while Maintaining Business Value of Predictive Models

Authors: Yuncong Yang, Xiao Han, Yidong Chai, Reza Ebrahimi, Rouzbeh Behnia, Balaji Padmanabhan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17126
Pdf URL: https://arxiv.org/pdf/2411.17126
Copy Paste: [[2411.17126]] From Machine Learning to Machine Unlearning: Complying with GDPR's Right to be Forgotten while Maintaining Business Value of Predictive Models(https://arxiv.org/abs/2411.17126)
Keywords: privacy
Abstract: Recent privacy regulations (e.g., GDPR) grant data subjects the `Right to Be Forgotten' (RTBF) and mandate companies to fulfill data erasure requests from data subjects. However, companies encounter great challenges in complying with the RTBF regulations, particularly when asked to erase specific training data from their well-trained predictive models. While researchers have introduced machine unlearning methods aimed at fast data erasure, these approaches often overlook maintaining model performance (e.g., accuracy), which can lead to financial losses and non-compliance with RTBF obligations. This work develops a holistic machine learning-to-unlearning framework, called Ensemble-based iTerative Information Distillation (ETID), to achieve efficient data erasure while preserving the business value of predictive models. ETID incorporates a new ensemble learning method to build an accurate predictive model that can facilitate handling data erasure requests. ETID also introduces an innovative distillation-based unlearning method tailored to the constructed ensemble model to enable efficient and effective data erasure. Extensive experiments demonstrate that ETID outperforms various state-of-the-art methods and can deliver high-quality unlearned models with efficiency. We also highlight ETID's potential as a crucial tool for fostering a legitimate and thriving market for data and predictive services.

Title: Improving Resistance to Noisy Label Fitting by Reweighting Gradient in SAM

Authors: Hoang-Chau Luong, Thuc Nguyen-Quang, Minh-Triet Tran
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17132
Pdf URL: https://arxiv.org/pdf/2411.17132
Copy Paste: [[2411.17132]] Improving Resistance to Noisy Label Fitting by Reweighting Gradient in SAM(https://arxiv.org/abs/2411.17132)
Keywords: robust
Abstract: Noisy labels pose a substantial challenge in machine learning, often resulting in overfitting and poor generalization. Sharpness-Aware Minimization (SAM), as demonstrated in Foret et al. (2021), improves generalization over traditional Stochastic Gradient Descent (SGD) in classification tasks with noisy labels by implicitly slowing noisy learning. While SAM's ability to generalize in noisy environments has been studied in several simplified settings, its full potential in more realistic training settings remains underexplored. In this work, we analyze SAM's behavior at each iteration, identifying specific components of the gradient vector that contribute significantly to its robustness against noisy labels. Based on these insights, we propose SANER (Sharpness-Aware Noise-Explicit Reweighting), an effective variant that enhances SAM's ability to manage noisy fitting rate. Our experiments on CIFAR-10, CIFAR-100, and Mini-WebVision demonstrate that SANER consistently outperforms SAM, achieving up to an 8% increase on CIFAR-100 with 50% label noise.

Title: Crack Detection in Infrastructure Using Transfer Learning, Spatial Attention, and Genetic Algorithm Optimization

Authors: Feng Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17140
Pdf URL: https://arxiv.org/pdf/2411.17140
Copy Paste: [[2411.17140]] Crack Detection in Infrastructure Using Transfer Learning, Spatial Attention, and Genetic Algorithm Optimization(https://arxiv.org/abs/2411.17140)
Keywords: extraction
Abstract: Crack detection plays a pivotal role in the maintenance and safety of infrastructure, including roads, bridges, and buildings, as timely identification of structural damage can prevent accidents and reduce costly repairs. Traditionally, manual inspection has been the norm, but it is labor-intensive, subjective, and hazardous. This paper introduces an advanced approach for crack detection in infrastructure using deep learning, leveraging transfer learning, spatial attention mechanisms, and genetic algorithm(GA) optimization. To address the challenge of the inaccessability of large amount of data, we employ ResNet50 as a pre-trained model, utilizing its strong feature extraction capabilities while reducing the need for extensive training datasets. We enhance the model with a spatial attention layer as well as a customized neural network which architecture was fine-tuned using GA. A comprehensive case study demonstrates the effectiveness of the proposed Attention-ResNet50-GA model, achieving a precision of 0.9967 and an F1 score of 0.9983, outperforming conventional methods. The results highlight the model's ability to accurately detect cracks in various conditions, making it highly suitable for real-world applications where large annotated datasets are scarce.

Title: Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation

Authors: Xu Zheng, Haiwei Xue, Jialei Chen, Yibo Yan, Lutao Jiang, Yuanhuiyi Lyu, Kailun Yang, Linfeng Zhang, Xuming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17141
Pdf URL: https://arxiv.org/pdf/2411.17141
Copy Paste: [[2411.17141]] Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation(https://arxiv.org/abs/2411.17141)
Keywords: robust, segmentation
Abstract: Simultaneously using multimodal inputs from multiple sensors to train segmentors is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where multimodal segmentors over rely on certain modalities, causing performance drops when others are missing, common in real world applications. To this end, we develop the first framework for learning robust segmentor that can handle any combinations of visual modalities. Specifically, we first introduce a parallel multimodal learning strategy for learning a strong teacher. The cross-modal and unimodal distillation is then achieved in the multi scale representation space by transferring the feature level knowledge from multimodal to anymodal segmentors, aiming at addressing the unimodal bias and avoiding over-reliance on specific modalities. Moreover, a prediction level modality agnostic semantic distillation is proposed to achieve semantic knowledge transferring for segmentation. Extensive experiments on both synthetic and real-world multi-sensor benchmarks demonstrate that our method achieves superior performance.

Title: Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Authors: Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17150
Pdf URL: https://arxiv.org/pdf/2411.17150
Copy Paste: [[2411.17150]] Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation(https://arxiv.org/abs/2411.17150)
Keywords: segmentation
Abstract: Open-Vocabulary Semantic Segmentation (OVSS) has advanced with recent vision-language models (VLMs), enabling segmentation beyond predefined categories through various learning schemes. Notably, training-free methods offer scalable, easily deployable solutions for handling unseen data, a key goal of OVSS. Yet, a critical issue persists: lack of object-level context consideration when segmenting complex objects in the challenging environment of OVSS based on arbitrary query prompts. This oversight limits models' ability to group semantically consistent elements within object and map them precisely to user-defined arbitrary classes. In this work, we introduce a novel approach that overcomes this limitation by incorporating object-level contextual knowledge within images. Specifically, our model enhances intra-object consistency by distilling spectral-driven features from vision foundation models into the attention mechanism of the visual encoder, enabling semantically coherent components to form a single object mask. Additionally, we refine the text embeddings with zero-shot object presence likelihood to ensure accurate alignment with the specific objects represented in the images. By leveraging object-level contextual knowledge, our proposed approach achieves state-of-the-art performance with strong generalizability across diverse datasets.

Title: Enhancing Lane Segment Perception and Topology Reasoning with Crowdsourcing Trajectory Priors

Authors: Peijin Jia, Ziang Luo, Tuopu Wen, Mengmeng Yang, Kun Jiang, Le Cui, Diange Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17161
Pdf URL: https://arxiv.org/pdf/2411.17161
Copy Paste: [[2411.17161]] Enhancing Lane Segment Perception and Topology Reasoning with Crowdsourcing Trajectory Priors(https://arxiv.org/abs/2411.17161)
Keywords: robust
Abstract: In autonomous driving, recent advances in lane segment perception provide autonomous vehicles with a comprehensive understanding of driving scenarios. Moreover, incorporating prior information input into such perception model represents an effective approach to ensure the robustness and accuracy. However, utilizing diverse sources of prior information still faces three key challenges: the acquisition of high-quality prior information, alignment between prior and online perception, efficient integration. To address these issues, we investigate prior augmentation from a novel perspective of trajectory priors. In this paper, we initially extract crowdsourcing trajectory data from Argoverse2 motion forecasting dataset and encode trajectory data into rasterized heatmap and vectorized instance tokens, then we incorporate such prior information into the online mapping model through different ways. Besides, with the purpose of mitigating the misalignment between prior and online perception, we design a confidence-based fusion module that takes alignment into account during the fusion process. We conduct extensive experiments on OpenLane-V2 dataset. The results indicate that our method's performance significantly outperforms the current state-of-the-art methods.

Title: OSDFace: One-Step Diffusion Model for Face Restoration

Authors: Jingkai Wang, Jue Gong, Lin Zhang, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17163
Pdf URL: https://arxiv.org/pdf/2411.17163
Copy Paste: [[2411.17163]] OSDFace: One-Step Diffusion Model for Face Restoration(https://arxiv.org/abs/2411.17163)
Keywords: diffusion, generative
Abstract: Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject's identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency. The code and model will be released at this https URL.

Title: MRIFE: A Mask-Recovering and Interactive-Feature-Enhancing Semantic Segmentation Network For Relic Landslide Detection

Authors: Juefei He, Yuexing Peng, Wei Li, Junchuan Yu, Daqing Ge, Wei Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17167
Pdf URL: https://arxiv.org/pdf/2411.17167
Copy Paste: [[2411.17167]] MRIFE: A Mask-Recovering and Interactive-Feature-Enhancing Semantic Segmentation Network For Relic Landslide Detection(https://arxiv.org/abs/2411.17167)
Keywords: extraction, segmentation
Abstract: Relic landslide, formed over a long period, possess the potential for reactivation, making them a hazardous geological phenomenon. While reliable relic landslide detection benefits the effective monitoring and prevention of landslide disaster, semantic segmentation using high-resolution remote sensing images for relic landslides faces many challenges, including the object visual blur problem, due to the changes of appearance caused by prolonged natural evolution and human activities, and the small-sized dataset problem, due to difficulty in recognizing and labelling the samples. To address these challenges, a semantic segmentation model, termed mask-recovering and interactive-feature-enhancing (MRIFE), is proposed for more efficient feature extraction and separation. Specifically, a contrastive learning and mask reconstruction method with locally significant feature enhancement is proposed to improve the ability to distinguish between the target and background and represent landslide semantic features. Meanwhile, a dual-branch interactive feature enhancement architecture is used to enrich the extracted features and address the issue of visual ambiguity. Self-distillation learning is introduced to leverage the feature diversity both within and between samples for contrastive learning, improving sample utilization, accelerating model convergence, and effectively addressing the problem of the small-sized dataset. The proposed MRIFE is evaluated on a real relic landslide dataset, and experimental results show that it greatly improves the performance of relic landslide detection. For the semantic segmentation task, compared to the baseline, the precision increases from 0.4226 to 0.5347, the mean intersection over union (IoU) increases from 0.6405 to 0.6680, the landslide IoU increases from 0.3381 to 0.3934, and the F1-score increases from 0.5054 to 0.5646.

Title: Learning Monotonic Attention in Transducer for Streaming Generation

Authors: Zhengrui Ma, Yang Feng, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17170
Pdf URL: https://arxiv.org/pdf/2411.17170
Copy Paste: [[2411.17170]] Learning Monotonic Attention in Transducer for Streaming Generation(https://arxiv.org/abs/2411.17170)
Keywords: robust
Abstract: Streaming generation models are increasingly utilized across various fields, with the Transducer architecture being particularly popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation, leading to suboptimal performance in these contexts. In this research, we address this issue by tightly integrating Transducer's decoding with the history of input stream via a learnable monotonic attention mechanism. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention in training. This allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution for Transducer-based frameworks to tackle more complex streaming generation tasks.

Title: ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Authors: Chengyou Jia, Changliang Xia, Zhuohang Dang, Weijia Wu, Hangwei Qian, Minnan Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17176
Pdf URL: https://arxiv.org/pdf/2411.17176
Copy Paste: [[2411.17176]] ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting(https://arxiv.org/abs/2411.17176)
Keywords: generative
Abstract: Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code, and models will be available in \url{this https URL}

Title: LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization

Authors: Rui Xie, Tianchen Zhao, Zhihang Yuan, Rui Wan, Wenxi Gao, Zhenhua Zhu, Xuefei Ning, Yu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17178
Pdf URL: https://arxiv.org/pdf/2411.17178
Copy Paste: [[2411.17178]] LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization(https://arxiv.org/abs/2411.17178)
Keywords: diffusion
Abstract: Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.

Title: An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models

Authors: Yunzhe Hu, Difan Zou, Dong Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17182
Pdf URL: https://arxiv.org/pdf/2411.17182
Copy Paste: [[2411.17182]] An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models(https://arxiv.org/abs/2411.17182)
Keywords: transformer
Abstract: Deep neural networks have long been criticized for being black-box. To unveil the inner workings of modern neural architectures, a recent work \cite{yu2024white} proposed an information-theoretic objective function called Sparse Rate Reduction (SRR) and interpreted its unrolled optimization as a Transformer-like model called Coding Rate Reduction Transformer (CRATE). However, the focus of the study was primarily on the basic implementation, and whether this objective is optimized in practice and its causal relationship to generalization remain elusive. Going beyond this study, we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. To reveal the predictive power of SRR on generalization, we collect a set of model variants induced by varied implementations and hyperparameters and evaluate SRR as a complexity measure based on its correlation with generalization. Surprisingly, we find out that SRR has a positive correlation coefficient and outperforms other baseline measures, such as path-norm and sharpness-based ones. Furthermore, we show that generalization can be improved using SRR as regularization on benchmark image classification datasets. We hope this paper can shed light on leveraging SRR to design principled models and study their generalization ability.

Title: E-Trojans: Ransomware, Tracking, DoS, and Data Leaks on Battery-powered Embedded Systems

Authors: Marco Casagrande, Riccardo Cestaro, Eleonora Losiouk, Mauro Conti, Daniele Antonioli
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.17184
Pdf URL: https://arxiv.org/pdf/2411.17184
Copy Paste: [[2411.17184]] E-Trojans: Ransomware, Tracking, DoS, and Data Leaks on Battery-powered Embedded Systems(https://arxiv.org/abs/2411.17184)
Keywords: security, privacy, attack
Abstract: Battery-powered embedded systems (BESs) have become ubiquitous. Their internals include a battery management system (BMS), a radio interface, and a motor controller. Despite their associated risk, there is little research on BES internal attack surfaces. To fill this gap, we present the first security and privacy assessment of e-scooters internals. We cover Xiaomi M365 (2016) and ES3 (2023) e-scooters and their interactions with Mi Home (their companion app). We extensively RE their internals and uncover four critical design vulnerabilities, including a remote code execution issue with their BMS. Based on our RE findings, we develop E-Trojans, four novel attacks targeting BES internals. The attacks can be conducted remotely or in wireless proximity. They have a widespread real-world impact as they violate the Xiaomi e-scooter ecosystem safety, security, availability, and privacy. For instance, one attack allows the extortion of money from a victim via a BMS undervoltage battery ransomware. A second one enables user tracking by fingerprinting the BES internals. With extra RE efforts, the attacks can be ported to other BES featuring similar vulnerabilities. We implement our attacks and RE findings in E-Trojans, a modular and low-cost toolkit to test BES internals. Our toolkit binary patches BMS firmware by adding malicious capabilities. It also implements our undervoltage battery ransomware in an Android app with a working backend. We successfully test our four attacks on M365 and ES3, empirically confirming their effectiveness and practicality. We propose four practical countermeasures to fix our attacks and improve the Xiaomi e-scooter ecosystem security and privacy.

Title: PhysMotion: Physics-Grounded Dynamics From a Single Image

Authors: Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, Chenfanfu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17189
Pdf URL: https://arxiv.org/pdf/2411.17189
Copy Paste: [[2411.17189]] PhysMotion: Physics-Grounded Dynamics From a Single Image(https://arxiv.org/abs/2411.17189)
Keywords: diffusion, generative
Abstract: We introduce PhysMotion, a novel framework that leverages principled physics-based simulations to guide intermediate 3D representations generated from a single image and input conditions (e.g., applied force and torque), producing high-quality, physically plausible video generation. By utilizing continuum mechanics-based simulations as a prior knowledge, our approach addresses the limitations of traditional data-driven generative models and result in more consistent physically plausible motions. Our framework begins by reconstructing a feed-forward 3D Gaussian from a single image through geometry optimization. This representation is then time-stepped using a differentiable Material Point Method (MPM) with continuum mechanics-based elastoplasticity models, which provides a strong foundation for realistic dynamics, albeit at a coarse level of detail. To enhance the geometry, appearance and ensure spatiotemporal consistency, we refine the initial simulation using a text-to-image (T2I) diffusion model with cross-frame attention, resulting in a physically plausible video that retains intricate details comparable to the input image. We conduct comprehensive qualitative and quantitative evaluations to validate the efficacy of our method. Our project page is available at: \url{this https URL}.

Title: Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks

Authors: Ratnesh Kumar Joshi, Priyanshu Priya, Vishesh Desai, Saurav Dudhate, Siddhant Senapati, Asif Ekbal, Roshni Ramnani, Anutosh Maitra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17204
Pdf URL: https://arxiv.org/pdf/2411.17204
Copy Paste: [[2411.17204]] Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks(https://arxiv.org/abs/2411.17204)
Keywords: large language model
Abstract: Given the advancements in conversational artificial intelligence, the evaluation and assessment of Large Language Models (LLMs) play a crucial role in ensuring optimal performance across various conversational tasks. In this paper, we present a comprehensive study that thoroughly evaluates the capabilities and limitations of five prevalent LLMs: Llama, OPT, Falcon, Alpaca, and MPT. The study encompasses various conversational tasks, including reservation, empathetic response generation, mental health and legal counseling, persuasion, and negotiation. To conduct the evaluation, an extensive test setup is employed, utilizing multiple evaluation criteria that span from automatic to human evaluation. This includes using generic and task-specific metrics to gauge the LMs' performance accurately. From our evaluation, no single model emerges as universally optimal for all tasks. Instead, their performance varies significantly depending on the specific requirements of each task. While some models excel in certain tasks, they may demonstrate comparatively poorer performance in others. These findings emphasize the importance of considering task-specific requirements and characteristics when selecting the most suitable LM for conversational applications.

Title: LampMark: Proactive Deepfake Detection via Training-Free Landmark Perceptual Watermarks

Authors: Tianyi Wang, Mengxiao Huang, Harry Cheng, Xiao Zhang, Zhiqi Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17209
Pdf URL: https://arxiv.org/pdf/2411.17209
Copy Paste: [[2411.17209]] LampMark: Proactive Deepfake Detection via Training-Free Landmark Perceptual Watermarks(https://arxiv.org/abs/2411.17209)
Keywords: secure, privacy, protect, attack, robust, watermark
Abstract: Deepfake facial manipulation has garnered significant public attention due to its impacts on enhancing human experiences and posing privacy threats. Despite numerous passive algorithms that have been attempted to thwart malicious Deepfake attacks, they mostly struggle with the generalizability challenge when confronted with hyper-realistic synthetic facial images. To tackle the problem, this paper proposes a proactive Deepfake detection approach by introducing a novel training-free landmark perceptual watermark, LampMark for short. We first analyze the structure-sensitive characteristics of Deepfake manipulations and devise a secure and confidential transformation pipeline from the structural representations, i.e. facial landmarks, to binary landmark perceptual watermarks. Subsequently, we present an end-to-end watermarking framework that imperceptibly and robustly embeds and extracts watermarks concerning the images to be protected. Relying on promising watermark recovery accuracies, Deepfake detection is accomplished by assessing the consistency between the content-matched landmark perceptual watermark and the robustly recovered watermark of the suspect image. Experimental results demonstrate the superior performance of our approach in watermark recovery and Deepfake detection compared to state-of-the-art methods across in-dataset, cross-dataset, and cross-manipulation scenarios.

Title: Scaling nnU-Net for CBCT Segmentation

Authors: Fabian Isensee, Yannick Kirchhoff, Lars Kraemer, Maximilian Rokuss, Constantin Ulrich, Klaus H. Maier-Hein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17213
Pdf URL: https://arxiv.org/pdf/2411.17213
Copy Paste: [[2411.17213]] Scaling nnU-Net for CBCT Segmentation(https://arxiv.org/abs/2411.17213)
Keywords: fair, segmentation
Abstract: This paper presents our approach to scaling the nnU-Net framework for multi-structure segmentation on Cone Beam Computed Tomography (CBCT) images, specifically in the scope of the ToothFairy2 Challenge. We leveraged the nnU-Net ResEnc L model, introducing key modifications to patch size, network topology, and data augmentation strategies to address the unique challenges of dental CBCT imaging. Our method achieved a mean Dice coefficient of 0.9253 and HD95 of 18.472 on the test set, securing a mean rank of 4.6 and with it the first place in the ToothFairy2 challenge. The source code is publicly available, encouraging further research and development in the field.

Title: MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution

Authors: Chengxing Xie, Xiaoming Zhang, Kai Zhang, Linze Li, Yuqian Fu, Biao Gong, Tianrui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17214
Pdf URL: https://arxiv.org/pdf/2411.17214
Copy Paste: [[2411.17214]] MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution(https://arxiv.org/abs/2411.17214)
Keywords: extraction, transformer
Abstract: Recent advances in image super-resolution (SR) have significantly benefited from the incorporation of Transformer architectures. However, conventional techniques aimed at enlarging the self-attention window to capture broader contexts come with inherent drawbacks, especially the significantly increased computational demands. Moreover, the feature perception within a fixed-size window of existing models restricts the effective receptive fields and the intermediate feature diversity. This study demonstrates that a flexible integration of attention across diverse spatial extents can yield significant performance enhancements. In line with this insight, we introduce Multi-Range Attention Transformer (MAT) tailored for SR tasks. MAT leverages the computational advantages inherent in dilation operation, in conjunction with self-attention mechanism, to facilitate both multi-range attention (MA) and sparse multi-range attention (SMA), enabling efficient capture of both regional and sparse global features. Further coupled with local feature extraction, MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations. We also introduce the MSConvStar module, which augments the model's ability for multi-range representation learning. Comprehensive experiments show that our MAT exhibits superior performance to existing state-of-the-art SR models with remarkable efficiency (~3.3 faster than SRFormer-light).

Title: Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning

Authors: Hui-Yue Yang, Hui Chen, Ao Wang, Kai Chen, Zijia Lin, Yongliang Tang, Pengcheng Gao, Yuming Quan, Jungong Han, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17217
Pdf URL: https://arxiv.org/pdf/2411.17217
Copy Paste: [[2411.17217]] Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning(https://arxiv.org/abs/2411.17217)
Keywords: segmentation
Abstract: Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. Parameter-Efficient Fine-Tuning (PEFT) offers a promising solution, but it may yield suboptimal performance by not adequately addressing the perception challenges during adaptation to anomaly images. In this paper, we propose a novel Self-Perceptinon Tuning (SPT) method, aiming to enhance SAM's perception capability for anomaly segmentation. The SPT method incorporates a self-drafting tuning strategy, which generates an initial coarse draft of the anomaly mask, followed by a refinement process. Additionally, a visual-relation-aware adapter is introduced to improve the perception of discriminative relational information for mask generation. Extensive experimental results on several benchmark datasets demonstrate that our SPT method can significantly outperform baseline methods, validating its effectiveness. Models and codes will be available online.

Title: GraphSubDetector: Time Series Subsequence Anomaly Detection via Density-Aware Adaptive Graph Neural Network

Authors: Weiqi Chen, Zhiqiang Zhou, Qingsong Wen, Liang Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17218
Pdf URL: https://arxiv.org/pdf/2411.17218
Copy Paste: [[2411.17218]] GraphSubDetector: Time Series Subsequence Anomaly Detection via Density-Aware Adaptive Graph Neural Network(https://arxiv.org/abs/2411.17218)
Keywords: robust
Abstract: Time series subsequence anomaly detection is an important task in a large variety of real-world applications ranging from health monitoring to AIOps, and is challenging due to the following reasons: 1) how to effectively learn complex dynamics and dependencies in time series; 2) diverse and complicated anomalous subsequences as well as the inherent variance and noise of normal patterns; 3) how to determine the proper subsequence length for effective detection, which is a required parameter for many existing algorithms. In this paper, we present a novel approach to subsequence anomaly detection, namely GraphSubDetector. First, it adaptively learns the appropriate subsequence length with a length selection mechanism that highlights the characteristics of both normal and anomalous patterns. Second, we propose a density-aware adaptive graph neural network (DAGNN), which can generate further robust representations against variance of normal data for anomaly detection by message passing between subsequences. The experimental results demonstrate the effectiveness of the proposed algorithm, which achieves superior performance on multiple time series anomaly benchmark datasets compared to state-of-the-art algorithms.

Title: DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

Authors: Yicheng Yang, Pengxiang Li, Lu Zhang, Liqian Ma, Ping Hu, Siyu Du, Yunzhi Zhuge, Xu Jia, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17223
Pdf URL: https://arxiv.org/pdf/2411.17223
Copy Paste: [[2411.17223]] DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting(https://arxiv.org/abs/2411.17223)
Keywords: diffusion, generative
Abstract: Subject-driven image inpainting has emerged as a popular task in image editing alongside recent advancements in diffusion models. Previous methods primarily focus on identity preservation but struggle to maintain the editability of inserted objects. In response, this paper introduces DreamMix, a diffusion-based generative model adept at inserting target objects into given scenes at user-specified locations while concurrently enabling arbitrary text-driven modifications to their attributes. In particular, we leverage advanced foundational inpainting models and introduce a disentangled local-global inpainting framework to balance precise local object insertion with effective global visual coherence. Additionally, we propose an Attribute Decoupling Mechanism (ADM) and a Textual Attribute Substitution (TAS) module to improve the diversity and discriminative capability of the text-based attribute guidance, respectively. Extensive experiments demonstrate that DreamMix effectively balances identity preservation and attribute editability across various application scenarios, including object insertion, attribute editing, and small object inpainting. Our code is publicly available at this https URL.

Title: MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers

Authors: Ruoxi Zhu, Zhengzhong Tu, Jiaming Liu, Alan C. Bovik, Yibo Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17226
Pdf URL: https://arxiv.org/pdf/2411.17226
Copy Paste: [[2411.17226]] MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers(https://arxiv.org/abs/2411.17226)
Keywords: transformer
Abstract: Restoring images captured under adverse weather conditions is a fundamental task for many computer vision applications. However, most existing weather restoration approaches are only capable of handling a specific type of degradation, which is often insufficient in real-world scenarios, such as rainy-snowy or rainy-hazy weather. Towards being able to address these situations, we propose a multi-weather Transformer, or MWFormer for short, which is a holistic vision Transformer that aims to solve multiple weather-induced degradations using a single, unified architecture. MWFormer uses hyper-networks and feature-wise linear modulation blocks to restore images degraded by various weather types using the same set of learned parameters. We first employ contrastive learning to train an auxiliary network that extracts content-independent, distortion-aware feature embeddings that efficiently represent predicted weather types, of which more than one may occur. Guided by these weather-informed predictions, the image restoration Transformer adaptively modulates its parameters to conduct both local and global feature processing, in response to multiple possible weather. Moreover, MWFormer allows for a novel way of tuning, during application, to either a single type of weather restoration or to hybrid weather restoration without any retraining, offering greater controllability than existing methods. Our experimental results on multi-weather restoration benchmarks show that MWFormer achieves significant performance improvements compared to existing state-of-the-art methods, without requiring much computational cost. Moreover, we demonstrate that our methodology of using hyper-networks can be integrated into various network architectures to further boost their performance. The code is available at: this https URL

Title: MLI-NeRF: Multi-Light Intrinsic-Aware Neural Radiance Fields

Authors: Yixiong Yang, Shilin Hu, Haoyu Wu, Ramon Baldrich, Dimitris Samaras, Maria Vanrell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17235
Pdf URL: https://arxiv.org/pdf/2411.17235
Copy Paste: [[2411.17235]] MLI-NeRF: Multi-Light Intrinsic-Aware Neural Radiance Fields(https://arxiv.org/abs/2411.17235)
Keywords: robust
Abstract: Current methods for extracting intrinsic image components, such as reflectance and shading, primarily rely on statistical priors. These methods focus mainly on simple synthetic scenes and isolated objects and struggle to perform well on challenging real-world data. To address this issue, we propose MLI-NeRF, which integrates \textbf{M}ultiple \textbf{L}ight information in \textbf{I}ntrinsic-aware \textbf{Ne}ural \textbf{R}adiance \textbf{F}ields. By leveraging scene information provided by different light source positions complementing the multi-view information, we generate pseudo-label images for reflectance and shading to guide intrinsic image decomposition without the need for ground truth data. Our method introduces straightforward supervision for intrinsic component separation and ensures robustness across diverse scene types. We validate our approach on both synthetic and real-world datasets, outperforming existing state-of-the-art methods. Additionally, we demonstrate its applicability to various image editing tasks. The code and data are publicly available.

Title: From Graph Diffusion to Graph Classification

Authors: Jia Jun Cheng Xian, Sadegh Mahdavi, Renjie Liao, Oliver Schulte
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17236
Pdf URL: https://arxiv.org/pdf/2411.17236
Copy Paste: [[2411.17236]] From Graph Diffusion to Graph Classification(https://arxiv.org/abs/2411.17236)
Keywords: diffusion, generative
Abstract: Generative models such as diffusion models have achieved remarkable success in state-of-the-art image and text tasks. Recently, score-based diffusion models have extended their success beyond image generation, showing competitive performance with discriminative methods in image {\em classification} tasks~\cite{zimmermann2021score}. However, their application to classification in the {\em graph} domain, which presents unique challenges such as complex topologies, remains underexplored. We show how graph diffusion models can be applied for graph classification. We find that to achieve competitive classification accuracy, score-based graph diffusion models should be trained with a novel training objective that is tailored to graph classification. In experiments with a sampling-based inference method, our discriminative training objective achieves state-of-the-art graph classification accuracy.

Title: Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment

Authors: Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fenglong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17237
Pdf URL: https://arxiv.org/pdf/2411.17237
Copy Paste: [[2411.17237]] Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment(https://arxiv.org/abs/2411.17237)
Keywords: large language model
Abstract: The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, grounding-IQA. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark comprehensively evaluates the model grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application. Code: this https URL.

Title: Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration

Authors: Junyuan Deng, Wei Yin, Xiaoyang Guo, Qian Zhang, Xiaotao Hu, Weiqiang Ren, Xiaoxiao Long, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17240
Pdf URL: https://arxiv.org/pdf/2411.17240
Copy Paste: [[2411.17240]] Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration(https://arxiv.org/abs/2411.17240)
Keywords: diffusion
Abstract: In this paper, we present DM-Calib, a diffusion-based approach for estimating pin- hole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, re- sulting in poor generalization across diverse real-world images. Recent advance- ments in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence in- dicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Im- age, which losslessly encodes the numerical camera intrinsics and integrates seam- lessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Im- age conditioned on an input image. By fine-tuning a stable diffusion model to gen- erate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot metric depth estimation, 3D metrology, pose estimation and sparse-view reconstruction. Extensive experiments on multiple public datasets show that our approach signifi- cantly outperforms baselines and provides broad benefits to 3D vision tasks. Code is available at this https URL.

Title: DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

Authors: JiHwan Moon, Jihoon Park, Jungeun Kim, Jongseong Bae, Hyeongwoo Jeon, Ha Young Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17248
Pdf URL: https://arxiv.org/pdf/2411.17248
Copy Paste: [[2411.17248]] DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model(https://arxiv.org/abs/2411.17248)
Keywords: diffusion
Abstract: Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverages a diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which fully utilizes the multi-level spatiotemporal information of the visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over previous gloss-free SLT methods and achieve state-of-the-art performance on two SLT datasets, thereby markedly improving translation quality.

Title: DGNN-YOLO: Dynamic Graph Neural Networks with YOLO11 for Small Object Detection and Tracking in Traffic Surveillance

Authors: Shahriar Soudeep, M. F. Mridha, Md Abrar Jahin, Nilanjan Dey
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17251
Pdf URL: https://arxiv.org/pdf/2411.17251
Copy Paste: [[2411.17251]] DGNN-YOLO: Dynamic Graph Neural Networks with YOLO11 for Small Object Detection and Tracking in Traffic Surveillance(https://arxiv.org/abs/2411.17251)
Keywords: robust, extraction
Abstract: Accurate detection and tracking of small objects such as pedestrians, cyclists, and motorbikes are critical for traffic surveillance systems, which are crucial in improving road safety and decision-making in intelligent transportation systems. However, traditional methods struggle with challenges such as occlusion, low resolution, and dynamic traffic conditions, necessitating innovative approaches to address these limitations. This paper introduces DGNN-YOLO, a novel framework integrating dynamic graph neural networks (DGNN) with YOLO11 to enhance small object detection and tracking in traffic surveillance systems. The framework leverages YOLO11's advanced spatial feature extraction capabilities for precise object detection and incorporates DGNN to model spatial-temporal relationships for robust real-time tracking dynamically. By constructing and updating graph structures, DGNN-YOLO effectively represents objects as nodes and their interactions as edges, ensuring adaptive and accurate tracking in complex and dynamic environments. Extensive experiments demonstrate that DGNN-YOLO consistently outperforms state-of-the-art methods in detecting and tracking small objects under diverse traffic conditions, achieving the highest precision (0.8382), recall (0.6875), and mAP@0.5:0.95 (0.6476), showcasing its robustness and scalability, particularly in challenging scenarios involving small and occluded objects. This work provides a scalable, real-time traffic surveillance and analysis solution, significantly contributing to intelligent transportation systems.

Title: APT: Architectural Planning and Text-to-Blueprint Construction Using Large Language Models for Open-World Agents

Authors: Jun Yu Chen, Tao Gao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17255
Pdf URL: https://arxiv.org/pdf/2411.17255
Copy Paste: [[2411.17255]] APT: Architectural Planning and Text-to-Blueprint Construction Using Large Language Models for Open-World Agents(https://arxiv.org/abs/2411.17255)
Keywords: diffusion, large language model
Abstract: We present APT, an advanced Large Language Model (LLM)-driven framework that enables autonomous agents to construct complex and creative structures within the Minecraft environment. Unlike previous approaches that primarily concentrate on skill-based open-world tasks or rely on image-based diffusion models for generating voxel-based structures, our method leverages the intrinsic spatial reasoning capabilities of LLMs. By employing chain-of-thought decomposition along with multimodal inputs, the framework generates detailed architectural layouts and blueprints that the agent can execute under zero-shot or few-shot learning scenarios. Our agent incorporates both memory and reflection modules to facilitate lifelong learning, adaptive refinement, and error correction throughout the building process. To rigorously evaluate the agent's performance in this emerging research area, we introduce a comprehensive benchmark consisting of diverse construction tasks designed to test creativity, spatial reasoning, adherence to in-game rules, and the effective integration of multimodal instructions. Experimental results using various GPT-based LLM backends and agent configurations demonstrate the agent's capacity to accurately interpret extensive instructions involving numerous items, their positions, and orientations. The agent successfully produces complex structures complete with internal functionalities such as Redstone-powered systems. A/B testing indicates that the inclusion of a memory module leads to a significant increase in performance, emphasizing its role in enabling continuous learning and the reuse of accumulated experience. Additionally, the agent's unexpected emergence of scaffolding behavior highlights the potential of future LLM-driven agents to utilize subroutine planning and leverage the emergence ability of LLMs to autonomously develop human-like problem-solving techniques.

Title: Disentangled Interpretable Representation for Efficient Long-term Time Series Forecasting

Authors: Yuang Zhao, Tianyu Li, Jiadong Chen, Shenrong Ye, Fuxin Jiang, Tieying Zhang, Xiaofeng Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17257
Pdf URL: https://arxiv.org/pdf/2411.17257
Copy Paste: [[2411.17257]] Disentangled Interpretable Representation for Efficient Long-term Time Series Forecasting(https://arxiv.org/abs/2411.17257)
Keywords: interpretability
Abstract: Industry 5.0 introduces new challenges for Long-term Time Series Forecasting (LTSF), characterized by high-dimensional, high-resolution data and high-stakes application scenarios. Against this backdrop, developing efficient and interpretable models for LTSF becomes a key challenge. Existing deep learning and linear models often suffer from excessive parameter complexity and lack intuitive interpretability. To address these issues, we propose DiPE-Linear, a Disentangled interpretable Parameter-Efficient Linear network. DiPE-Linear incorporates three temporal components: Static Frequential Attention (SFA), Static Temporal Attention (STA), and Independent Frequential Mapping (IFM). These components alternate between learning in the frequency and time domains to achieve disentangled interpretability. The decomposed model structure reduces parameter complexity from quadratic in fully connected networks (FCs) to linear and computational complexity from quadratic to log-linear. Additionally, a Low-Rank Weight Sharing policy enhances the model's ability to handle multivariate series. Despite operating within a subspace of FCs with limited expressive capacity, DiPE-Linear demonstrates comparable or superior performance to both FCs and nonlinear models across multiple open-source and real-world LTSF datasets, validating the effectiveness of its sophisticatedly designed structure. The combination of efficiency, accuracy, and interpretability makes DiPE-Linear a strong candidate for advancing LTSF in both research and real-world applications. The source code is available at this https URL.

Title: HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator

Authors: Fan Yang, Ru Zhen, Jianing Wang, Yanhao Zhang, Haoxiang Chen, Haonan Lu, Sicheng Zhao, Guiguang Ding
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17261
Pdf URL: https://arxiv.org/pdf/2411.17261
Copy Paste: [[2411.17261]] HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator(https://arxiv.org/abs/2411.17261)
Keywords: interpretability, explainability, large language model
Abstract: AIGC images are prevalent across various fields, yet they frequently suffer from quality issues like artifacts and unnatural textures. Specialized models aim to predict defect region heatmaps but face two primary challenges: (1) lack of explainability, failing to provide reasons and analyses for subtle defects, and (2) inability to leverage common sense and logical reasoning, leading to poor generalization. Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details; and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation. To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable image Implausibility Evaluator. We introduce the CoT-Driven Explainable Trinity Evaluator, which integrates heatmaps, scores, and explanation outputs, using CoT to decompose complex tasks into subtasks of increasing difficulty and enhance interpretability. Our Adaptive Hierarchical Implausibility Mapper synergizes low-level image features with high-level mapper tokens from LLMs, enabling precise local-to-global hierarchical heatmap predictions through an uncertainty-based adaptive token approach. Moreover, we propose a new dataset: Expl-AIGI-Eval, designed to facilitate interpretable implausibility evaluation of AIGC images. Our method demonstrates state-of-the-art performance through extensive experiments.

Title: A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs

Authors: Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, Lu Sheng
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17265
Pdf URL: https://arxiv.org/pdf/2411.17265
Copy Paste: [[2411.17265]] A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs(https://arxiv.org/abs/2411.17265)
Keywords: robust, large language model
Abstract: Aligning the behaviors of Multimodal Large Language Models (MLLMs) with human preferences is crucial for developing robust and trustworthy AI systems. While recent attempts have employed human experts or powerful auxiliary AI systems to provide more accurate preference feedback, such as determining the preferable responses from MLLMs or directly rewriting hallucination-free responses, extensive resource overhead compromise the scalability of the feedback collection. In this work, we introduce Topic-level Preference Overwriting (TPO), a self-correctional approach that guide the model itself to mitigate its own hallucination at the topic level. Through a deconfounded strategy that replaces each topic within the response with the best or worst alternatives generated by the model itself, TPO creates more contrasting pairwise preference feedback, enhancing the feedback quality without human or proprietary model intervention. Notably, the experimental results demonstrate proposed TPO achieves state-of-the-art performance in trustworthiness, significantly reducing the object hallucinations by 92% and overall hallucinations by 38%. Code, model and data will be released.

Title: BadScan: An Architectural Backdoor Attack on Visual State Space Models

Authors: Om Suhas Deshmukh, Sankalp Nagaonkar, Achyut Mani Tripathi, Ashish Mishra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17283
Pdf URL: https://arxiv.org/pdf/2411.17283
Copy Paste: [[2411.17283]] BadScan: An Architectural Backdoor Attack on Visual State Space Models(https://arxiv.org/abs/2411.17283)
Keywords: attack, robust, transformer
Abstract: The newly introduced Visual State Space Model (VMamba), which employs \textit{State Space Mechanisms} (SSM) to interpret images as sequences of patches, has shown exceptional performance compared to Vision Transformers (ViT) across various computer vision tasks. However, recent studies have highlighted that deep models are susceptible to adversarial attacks. One common approach is to embed a trigger in the training data to retrain the model, causing it to misclassify data samples into a target class, a phenomenon known as a backdoor attack. In this paper, we first evaluate the robustness of the VMamba model against existing backdoor attacks. Based on this evaluation, we introduce a novel architectural backdoor attack, termed BadScan, designed to deceive the VMamba model. This attack utilizes bit plane slicing to create visually imperceptible backdoored images. During testing, if a trigger is detected by performing XOR operations between the $k^{th}$ bit planes of the modified triggered patches, the traditional 2D selective scan (SS2D) mechanism in the visual state space (VSS) block of VMamba is replaced with our newly designed BadScan block, which incorporates four newly developed scanning patterns. We demonstrate that the BadScan backdoor attack represents a significant threat to visual state space models and remains effective even after complete retraining from scratch. Experimental results on two widely used image classification datasets, CIFAR-10, and ImageNet-1K, reveal that while visual state space models generally exhibit robustness against current backdoor attacks, the BadScan attack is particularly effective, achieving a higher Triggered Accuracy Ratio (TAR) in misleading the VMamba model and its variants.

Title: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling

Authors: Alexander Capstick, Rahul G. Krishnan, Payam Barnaghi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.17284
Pdf URL: https://arxiv.org/pdf/2411.17284
Copy Paste: [[2411.17284]] Using Large Language Models for Expert Prior Elicitation in Predictive Modelling(https://arxiv.org/abs/2411.17284)
Keywords: large language model
Abstract: Large language models (LLMs), trained on diverse data effectively acquire a breadth of information across various domains. However, their computational complexity, cost, and lack of transparency hinder their direct application for specialised tasks. In fields such as clinical research, acquiring expert annotations or prior knowledge about predictive models is often costly and time-consuming. This study proposes using LLMs to elicit expert prior distributions for predictive models. This approach also provides an alternative to in-context learning, where language models are tasked with making predictions directly. We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation. Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings. Applied to clinical problems, this translates to fewer required biological samples, lowering cost and resources. Prior elicitation also consistently outperforms and proves more reliable than in-context learning at a lower cost, making it a preferred alternative in our setting. We demonstrate the utility of this method across various use cases, including clinical applications. For infection prediction, using LLM-elicited priors reduced the number of required labels to achieve the same accuracy as an uninformative prior by 55%, at 200 days earlier in the study.

Title: Privacy Preserving Federated Unsupervised Domain Adaptation with Application to Age Prediction from DNA Methylation Data

Authors: Cem Ata Baykara, Ali Burak Ünal, Nico Pfeifer, Mete Akgün
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17287
Pdf URL: https://arxiv.org/pdf/2411.17287
Copy Paste: [[2411.17287]] Privacy Preserving Federated Unsupervised Domain Adaptation with Application to Age Prediction from DNA Methylation Data(https://arxiv.org/abs/2411.17287)
Keywords: secure, privacy, protect, federate
Abstract: In computational biology, predictive models are widely used to address complex tasks, but their performance can suffer greatly when applied to data from different distributions. The current state-of-the-art domain adaptation method for high-dimensional data aims to mitigate these issues by aligning the input dependencies between training and test data. However, this approach requires centralized access to both source and target domain data, raising concerns about data privacy, especially when the data comes from multiple sources. In this paper, we introduce a privacy-preserving federated framework for unsupervised domain adaptation in high-dimensional settings. Our method employs federated training of Gaussian processes and weighted elastic nets to effectively address the problem of distribution shift between domains, while utilizing secure aggregation and randomized encoding to protect the local data of participating data owners. We evaluate our framework on the task of age prediction using DNA methylation data from multiple tissues, demonstrating that our approach performs comparably to existing centralized methods while maintaining data privacy, even in distributed environments where data is spread across multiple institutions. Our framework is the first privacy-preserving solution for high-dimensional domain adaptation in federated environments, offering a promising tool for fields like computational biology and medicine, where protecting sensitive data is essential.

Title: Task Progressive Curriculum Learning for Robust Visual Question Answering

Authors: Ahmed Akl, Abdelwahed Khamis, Zhe Wang, Ali Cheraghian, Sara Khalifa, Kewen Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17292
Pdf URL: https://arxiv.org/pdf/2411.17292
Copy Paste: [[2411.17292]] Task Progressive Curriculum Learning for Robust Visual Question Answering(https://arxiv.org/abs/2411.17292)
Keywords: robust
Abstract: Visual Question Answering (VQA) systems are known for their poor performance in out-of-distribution datasets. An issue that was addressed in previous works through ensemble learning, answer re-ranking, or artificially growing the training set. In this work, we show for the first time that robust Visual Question Answering is attainable by simply enhancing the training strategy. Our proposed approach, Task Progressive Curriculum Learning (TPCL), breaks the main VQA problem into smaller, easier tasks based on the question type. Then, it progressively trains the model on a (carefully crafted) sequence of tasks. We further support the method by a novel distributional-based difficulty measurer. Our approach is conceptually simple, model-agnostic, and easy to implement. We demonstrate TPCL effectiveness through a comprehensive evaluation on standard datasets. Without either data augmentation or explicit debiasing mechanism, it achieves state-of-the-art on VQA-CP v2, VQA-CP v1 and VQA v2 datasets. Extensive experiments demonstrate that TPCL outperforms the most competitive robust VQA approaches by more than 5% and 7% on VQA-CP v2 and VQA-CP v1; respectively. TPCL also can boost VQA baseline backbone performance by up to 28.5%.

Title: GrokFormer: Graph Fourier Kolmogorov-Arnold Transformers

Authors: Guoguo Ai, Guansong Pang, Hezhe Qiao, Yuan Gao, Hui Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17296
Pdf URL: https://arxiv.org/pdf/2411.17296
Copy Paste: [[2411.17296]] GrokFormer: Graph Fourier Kolmogorov-Arnold Transformers(https://arxiv.org/abs/2411.17296)
Keywords: transformer
Abstract: Graph Transformers (GTs) have demonstrated remarkable performance in incorporating various graph structure information, e.g., long-range structural dependency, into graph representation learning. However, self-attention -- the core module of GTs -- preserves only low-frequency signals on graph features, retaining only homophilic patterns that capture similar features among the connected nodes. Consequently, it has insufficient capacity in modeling complex node label patterns, such as the opposite of homophilic patterns -- heterophilic patterns. Some improved GTs deal with the problem by learning polynomial filters or performing self-attention over the first-order graph spectrum. However, these GTs either ignore rich information contained in the whole spectrum or neglect higher-order spectrum information, resulting in limited flexibility and frequency response in their spectral filters. To tackle these challenges, we propose a novel GT network, namely Graph Fourier Kolmogorov-Arnold Transformers (GrokFormer), to go beyond the self-attention in GTs. GrokFormer leverages learnable activation functions in order-$K$ graph spectrum through Fourier series modeling to i) learn eigenvalue-targeted filter functions producing learnable base that can capture a broad range of frequency signals flexibly, and ii) extract first- and higher-order graph spectral information adaptively. In doing so, GrokFormer can effectively capture intricate patterns hidden across different orders and levels of frequency signals, learning expressive, order-and-frequency-adaptive graph representations. Comprehensive experiments conducted on 10 node classification datasets across various domains, scales, and levels of graph heterophily, as well as 5 graph classification datasets, demonstrate that GrokFormer outperforms state-of-the-art GTs and other advanced graph neural networks.

Title: ER2Score: LLM-based Explainable and Customizable Metric for Assessing Radiology Reports with Reward-Control Loss

Authors: Yunyi Liu, Yingshu Li, Zhanyu Wang, Xinyu Liang, Lingqiao Liu, Lei Wang, Luping Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17301
Pdf URL: https://arxiv.org/pdf/2411.17301
Copy Paste: [[2411.17301]] ER2Score: LLM-based Explainable and Customizable Metric for Assessing Radiology Reports with Reward-Control Loss(https://arxiv.org/abs/2411.17301)
Keywords: interpretability
Abstract: Automated radiology report generation (R2Gen) has advanced significantly, introducing challenges in accurate evaluation due to its complexity. Traditional metrics often fall short by relying on rigid word-matching or focusing only on pathological entities, leading to inconsistencies with human assessments. To bridge this gap, we introduce ER2Score, an automatic evaluation metric designed specifically for R2Gen. Our metric utilizes a reward model, guided by our margin-based reward enforcement loss, along with a tailored training data design that enables customization of evaluation criteria to suit user-defined needs. It not only scores reports according to user-specified criteria but also provides detailed sub-scores, enhancing interpretability and allowing users to adjust the criteria between different aspects of reports. Leveraging GPT-4, we designed an easy-to-use data generation pipeline, enabling us to produce extensive training data based on two distinct scoring systems, each containing reports of varying quality along with corresponding scores. These GPT-generated reports are then paired as accepted and rejected samples through our pairing rule to train an LLM towards our fine-grained reward model, which assigns higher rewards to the report with high quality. Our reward-control loss enables this model to simultaneously output multiple individual rewards corresponding to the number of evaluation criteria, with their summation as our final ER2Score. Our experiments demonstrate ER2Score's heightened correlation with human judgments and superior performance in model selection compared to traditional metrics. Notably, our model provides both an overall score and individual scores for each evaluation item, enhancing interpretability. We also demonstrate its flexible training across various evaluation systems.

Title: Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning

Authors: Milena Chadimová, Eduard Jurášek, Tomáš Kliegr
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17304
Pdf URL: https://arxiv.org/pdf/2411.17304
Copy Paste: [[2411.17304]] Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning(https://arxiv.org/abs/2411.17304)
Keywords: extraction, large language model
Abstract: This paper introduces a novel method, referred to as "hashing", which involves masking potentially bias-inducing words in large language models (LLMs) with hash-like meaningless identifiers to reduce cognitive biases and reliance on external knowledge. The method was tested across three sets of experiments involving a total of 490 prompts. Statistical analysis using chi-square tests showed significant improvements in all tested scenarios, which covered LLama, ChatGPT, Copilot, Gemini and Mixtral models. In the first experiment, hashing decreased the fallacy rate in a modified version of the "Linda" problem aimed at evaluating susceptibility to cognitive biases. In the second experiment, it improved LLM results on the frequent itemset extraction task. In the third experiment, we found hashing is also effective when the Linda problem is presented in a tabular format rather than text, indicating that the technique works across various input representations. Overall, the method was shown to improve bias reduction and incorporation of external knowledge. Despite bias reduction, hallucination rates were inconsistently reduced across types of LLM models. These findings suggest that masking bias-inducing terms can improve LLM performance, although its effectiveness is model- and task-dependent.

Title: in-Car Biometrics (iCarB) Datasets for Driver Recognition: Face, Fingerprint, and Voice

Authors: Vedrana Krivokuca Hahn, Jeremy Maceiras, Alain Komaty, Philip Abbet, Sebastien Marcel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17305
Pdf URL: https://arxiv.org/pdf/2411.17305
Copy Paste: [[2411.17305]] in-Car Biometrics (iCarB) Datasets for Driver Recognition: Face, Fingerprint, and Voice(https://arxiv.org/abs/2411.17305)
Keywords: attack, biometric
Abstract: We present three biometric datasets (iCarB-Face, iCarB-Fingerprint, iCarB-Voice) containing face videos, fingerprint images, and voice samples, collected inside a car from 200 consenting volunteers. The data was acquired using a near-infrared camera, two fingerprint scanners, and two microphones, while the volunteers were seated in the driver's seat of the car. The data collection took place while the car was parked both indoors and outdoors, and different "noises" were added to simulate non-ideal biometric data capture that may be encountered in real-life driver recognition. Although the datasets are specifically tailored to in-vehicle biometric recognition, their utility is not limited to the automotive environment. The iCarB datasets, which are available to the research community, can be used to: (i) evaluate and benchmark face, fingerprint, and voice recognition systems (we provide several evaluation protocols); (ii) create multimodal pseudo-identities, to train/test multimodal fusion algorithms; (iii) create Presentation Attacks from the biometric data, to evaluate Presentation Attack Detection algorithms; (iv) investigate demographic and environmental biases in biometric systems, using the provided metadata. To the best of our knowledge, ours are the largest and most diverse publicly available in-vehicle biometric datasets. Most other datasets contain only one biometric modality (usually face), while our datasets consist of three modalities, all acquired in the same automotive environment. Moreover, iCarB-Fingerprint seems to be the first publicly available in-vehicle fingerprint dataset. Finally, the iCarB datasets boast a rare level of demographic diversity among the 200 data subjects, including a 50/50 gender split, skin colours across the whole Fitzpatrick-scale spectrum, and a wide age range (18-60+). So, these datasets will be valuable for advancing biometrics research.

Title: Reward Incremental Learning in Text-to-Image Generation

Authors: Maorong Wang, Jiafeng Mao, Xueting Wang, Toshihiko Yamasaki
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17310
Pdf URL: https://arxiv.org/pdf/2411.17310
Copy Paste: [[2411.17310]] Reward Incremental Learning in Text-to-Image Generation(https://arxiv.org/abs/2411.17310)
Keywords: diffusion
Abstract: The recent success of denoising diffusion models has significantly advanced text-to-image generation. While these large-scale pretrained models show excellent performance in general image synthesis, downstream objectives often require fine-tuning to meet specific criteria such as aesthetics or human preference. Reward gradient-based strategies are promising in this context, yet existing methods are limited to single-reward tasks, restricting their applicability in real-world scenarios that demand adapting to multiple objectives introduced incrementally over time. In this paper, we first define this more realistic and unexplored problem, termed Reward Incremental Learning (RIL), where models are desired to adapt to multiple downstream objectives incrementally. Additionally, while the models adapt to the ever-emerging new objectives, we observe a unique form of catastrophic forgetting in diffusion model fine-tuning, affecting both metric-wise and visual structure-wise image quality. To address this catastrophic forgetting challenge, we propose Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead, enabling stable performance across sequential reward tasks. The experimental results demonstrate the efficacy of RID in achieving consistent, high-quality generation in RIL scenarios. The source code of our work will be publicly available upon acceptance.

Title: A Framework for the Security and Privacy of Biometric System Constructions under Defined Computational Assumptions

Authors: Sam Grierson, William J Buchanan, Craig Thomson, Baraq Galeb, Chris Eckl
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.17321
Pdf URL: https://arxiv.org/pdf/2411.17321
Copy Paste: [[2411.17321]] A Framework for the Security and Privacy of Biometric System Constructions under Defined Computational Assumptions(https://arxiv.org/abs/2411.17321)
Keywords: secure, security, privacy, biometric
Abstract: Biometric systems, while offering convenient authentication, often fall short in providing rigorous security assurances. A primary reason is the ad-hoc design of protocols and components, which hinders the establishment of comprehensive security proofs. This paper introduces a formal framework for constructing secure and privacy-preserving biometric systems. By leveraging the principles of universal composability, we enable the modular analysis and verification of individual system components. This approach allows us to derive strong security and privacy properties for the entire system, grounded in well-defined computational assumptions.

Title: InsightEdit: Towards Better Instruction Following for Image Editing

Authors: Yingjing Xu, Jie Kong, Jiazhi Wang, Xiao Pan, Bo Lin, Qiang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17323
Pdf URL: https://arxiv.org/pdf/2411.17323
Copy Paste: [[2411.17323]] InsightEdit: Towards Better Instruction Following for Image Editing(https://arxiv.org/abs/2411.17323)
Keywords: diffusion, large language model
Abstract: In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.

Title: MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension

Authors: Zeyu Ling, Bo Han, Shiyang Li, Hongdeng Shen, Jikang Cheng, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17335
Pdf URL: https://arxiv.org/pdf/2411.17335
Copy Paste: [[2411.17335]] MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension(https://arxiv.org/abs/2411.17335)
Keywords: large language model
Abstract: This paper introduces MotionLLaMA, a unified framework for motion synthesis and comprehension, along with a novel full-body motion tokenizer called the HoMi Tokenizer. MotionLLaMA is developed based on three core principles. First, it establishes a powerful unified representation space through the HoMi Tokenizer. Using a single codebook, the HoMi Tokenizer in MotionLLaMA achieves reconstruction accuracy comparable to residual vector quantization tokenizers utilizing six codebooks, outperforming all existing single-codebook tokenizers. Second, MotionLLaMA integrates a large language model to tackle various motion-related tasks. This integration bridges various modalities, facilitating both comprehensive and intricate motion synthesis and comprehension. Third, MotionLLaMA introduces the MotionHub dataset, currently the most extensive multimodal, multitask motion dataset, which enables fine-tuning of large language models. Extensive experimental results demonstrate that MotionLLaMA not only covers the widest range of motion-related tasks but also achieves state-of-the-art (SOTA) performance in motion completion, interaction dual-person text-to-motion, and all comprehension tasks while reaching performance comparable to SOTA in the remaining tasks. The code and MotionHub dataset are publicly available.

Title: Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach

Authors: Changgeon Ko, Jisu Shin, Hoyun Song, Jeongyeon Seo, Jong C. Park
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2411.17338
Pdf URL: https://arxiv.org/pdf/2411.17338
Copy Paste: [[2411.17338]] Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach(https://arxiv.org/abs/2411.17338)
Keywords: large language model
Abstract: Large language models (LLMs) often reflect real-world biases, leading to efforts to mitigate these effects and make the models unbiased. Achieving this goal requires defining clear criteria for an unbiased state, with any deviation from these criteria considered biased. Some studies define an unbiased state as equal treatment across diverse demographic groups, aiming for balanced outputs from LLMs. However, differing perspectives on equality and the importance of pluralism make it challenging to establish a universal standard. Alternatively, other approaches propose using fact-based criteria for more consistent and objective evaluations, though these methods have not yet been fully applied to LLM bias assessments. Thus, there is a need for a metric with objective criteria that offers a distinct perspective from equality-based approaches. Motivated by this need, we introduce a novel metric to assess bias using fact-based criteria and real-world statistics. In this paper, we conducted a human survey demonstrating that humans tend to perceive LLM outputs more positively when they align closely with real-world demographic distributions. Evaluating various LLMs with our proposed metric reveals that model bias varies depending on the criteria used, highlighting the need for multi-perspective assessment.

Title: Assessing Vulnerability in Smart Contracts: The Role of Code Complexity Metrics in Security Analysis

Authors: Masoud Jamshidiyan Tehrani, Sattar Hashemi
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2411.17343
Pdf URL: https://arxiv.org/pdf/2411.17343
Copy Paste: [[2411.17343]] Assessing Vulnerability in Smart Contracts: The Role of Code Complexity Metrics in Security Analysis(https://arxiv.org/abs/2411.17343)
Keywords: security
Abstract: Codes with specific characteristics are more exposed to security vulnerabilities. Studies have revealed that codes that do not adhere to best practices are more challenging to verify and maintain, increasing the likelihood of unnoticed or unintentionally introduced vulnerabilities. Given the crucial role of smart contracts in blockchain systems, ensuring their security and conducting thorough vulnerability analysis is critical. This study investigates the use of code complexity metrics as indicators of vulnerable code in Solidity smart contracts. We highlight the significance of complexity metrics as valuable complementary features for vulnerability assessment and provide insights into the individual power of each metric. By analyzing 21 complexity metrics, we explored their interrelation, association with vulnerability, discriminative power, and mean values in vulnerable versus neutral codes. The results revealed some high correlations and potential redundancies among certain metrics, but weak correlations between each independent metric and vulnerability. Nevertheless, we found that all metrics can effectively discriminate between vulnerable and neutral codes, and most complexity metrics, except for three, exhibited higher values in vulnerable codes.

Title: Real-Time Multimodal Signal Processing for HRI in RoboCup: Understanding a Human Referee

Authors: Filippo Ansalone, Flavio Maiorana, Daniele Affinita, Flavio Volpi, Eugenio Bugli, Francesco Petri, Michele Brienza, Valerio Spagnoli, Vincenzo Suriani, Daniele Nardi, Domenico D. Bloisi
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2411.17347
Pdf URL: https://arxiv.org/pdf/2411.17347
Copy Paste: [[2411.17347]] Real-Time Multimodal Signal Processing for HRI in RoboCup: Understanding a Human Referee(https://arxiv.org/abs/2411.17347)
Keywords: extraction
Abstract: Advancing human-robot communication is crucial for autonomous systems operating in dynamic environments, where accurate real-time interpretation of human signals is essential. RoboCup provides a compelling scenario for testing these capabilities, requiring robots to understand referee gestures and whistle with minimal network reliance. Using the NAO robot platform, this study implements a two-stage pipeline for gesture recognition through keypoint extraction and classification, alongside continuous convolutional neural networks (CCNNs) for efficient whistle detection. The proposed approach enhances real-time human-robot interaction in a competitive setting like RoboCup, offering some tools to advance the development of autonomous systems capable of cooperating with humans.

Title: Joint Combinatorial Node Selection and Resource Allocations in the Lightning Network using Attention-based Reinforcement Learning

Authors: Mahdi Salahshour, Amirahmad Shafiee, Mojtaba Tefagh
Subjects: cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2411.17353
Pdf URL: https://arxiv.org/pdf/2411.17353
Copy Paste: [[2411.17353]] Joint Combinatorial Node Selection and Resource Allocations in the Lightning Network using Attention-based Reinforcement Learning(https://arxiv.org/abs/2411.17353)
Keywords: transformer
Abstract: The Lightning Network (LN) has emerged as a second-layer solution to Bitcoin's scalability challenges. The rise of Payment Channel Networks (PCNs) and their specific mechanisms incentivize individuals to join the network for profit-making opportunities. According to the latest statistics, the total value locked within the Lightning Network is approximately \$500 million. Meanwhile, joining the LN with the profit-making incentives presents several obstacles, as it involves solving a complex combinatorial problem that encompasses both discrete and continuous control variables related to node selection and resource allocation, respectively. Current research inadequately captures the critical role of resource allocation and lacks realistic simulations of the LN routing mechanism. In this paper, we propose a Deep Reinforcement Learning (DRL) framework, enhanced by the power of transformers, to address the Joint Combinatorial Node Selection and Resource Allocation (JCNSRA) problem. We have improved upon an existing environment by introducing modules that enhance its routing mechanism, thereby narrowing the gap with the actual LN routing system and ensuring compatibility with the JCNSRA problem. We compare our model against several baselines and heuristics, demonstrating its superior performance across various settings. Additionally, we address concerns regarding centralization in the LN by deploying our agent within the network and monitoring the centrality measures of the evolved graph. Our findings suggest not only an absence of conflict between LN's decentralization goals and individuals' revenue-maximization incentives but also a positive association between the two.

Title: DWCL: Dual-Weighted Contrastive Learning for Multi-View Clustering

Authors: Zhihui Zhang, Xiaoshuai Hao, Hanning Yuan, Lianhua Chi, Qi Guo, Qi Li, Ziqiang Yuan, Jinhui Pang, Yexin Li, Sijie Ruan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17354
Pdf URL: https://arxiv.org/pdf/2411.17354
Copy Paste: [[2411.17354]] DWCL: Dual-Weighted Contrastive Learning for Multi-View Clustering(https://arxiv.org/abs/2411.17354)
Keywords: robust
Abstract: Multi-view contrastive clustering (MVCC) has gained significant attention for generating consistent clustering structures from multiple views through contrastive learning. However, most existing MVCC methods create cross-views by combining any two views, leading to a high volume of unreliable pairs. Furthermore, these approaches often overlook discrepancies in multi-view representations, resulting in representation degeneration. To address these challenges, we introduce a novel model called Dual-Weighted Contrastive Learning (DWCL) for Multi-View Clustering. Specifically, to reduce the impact of unreliable cross-views, we introduce an innovative Best-Other (B-O) contrastive mechanism that enhances the representation of individual views at a low computational cost. Furthermore, we develop a dual weighting strategy that combines a view quality weight, reflecting the quality of each view, with a view discrepancy weight. This approach effectively mitigates representation degeneration by downplaying cross-views that are both low in quality and high in discrepancy. We theoretically validate the efficiency of the B-O contrastive mechanism and the effectiveness of the dual weighting strategy. Extensive experiments demonstrate that DWCL outperforms previous methods across eight multi-view datasets, showcasing superior performance and robustness in MVCC. Specifically, our method achieves absolute accuracy improvements of 5.4\% and 5.6\% compared to state-of-the-art methods on the Caltech6V7 and MSRCv1 datasets, respectively.

Title: SAM-MPA: Applying SAM to Few-shot Medical Image Segmentation using Mask Propagation and Auto-prompting

Authors: Jie Xu, Xiaokang Li, Chengyu Yue, Yuanyuan Wang, Yi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17363
Pdf URL: https://arxiv.org/pdf/2411.17363
Copy Paste: [[2411.17363]] SAM-MPA: Applying SAM to Few-shot Medical Image Segmentation using Mask Propagation and Auto-prompting(https://arxiv.org/abs/2411.17363)
Keywords: segmentation
Abstract: Medical image segmentation often faces the challenge of prohibitively expensive annotation costs. While few-shot learning offers a promising solution to alleviate this burden, conventional approaches still rely heavily on pre-training with large volumes of labeled data from known categories. To address this issue, we propose leveraging the Segment Anything Model (SAM), pre-trained on over 1 billion masks, thus circumventing the need for extensive domain-specific annotated data. In light of this, we developed SAM-MPA, an innovative SAM-based framework for few-shot medical image segmentation using Mask Propagation-based Auto-prompting. Initially, we employ k-centroid clustering to select the most representative examples for labelling to construct the support set. These annotated examples are registered to other images yielding deformation fields that facilitate the propagation of the mask knowledge to obtain coarse masks across the dataset. Subsequently, we automatically generate visual prompts based on the region and boundary expansion of the coarse mask, including points, box and a coarse mask. Finally, we can obtain the segmentation predictions by inputting these prompts into SAM and refine the results by post refinement module. We validate the performance of the proposed framework through extensive experiments conducted on two medical image datasets with different modalities. Our method achieves Dices of 74.53%, 94.36% on Breast US, Chest X-ray, respectively. Experimental results substantiate that SAM-MPA yields high-accuracy segmentations within 10 labeled examples, outperforming other state-of-the-art few-shot auto-segmentation methods. Our method enables the customization of SAM for any medical image dataset with a small number of labeled examples.

Title: Fairness And Performance In Harmony: Data Debiasing Is All You Need

Authors: Junhua Liu, Wendy Wan Yee Hui, Roy Ka-Wei Lee, Kwan Hui Lim
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2411.17374
Pdf URL: https://arxiv.org/pdf/2411.17374
Copy Paste: [[2411.17374]] Fairness And Performance In Harmony: Data Debiasing Is All You Need(https://arxiv.org/abs/2411.17374)
Keywords: fair
Abstract: Fairness in both machine learning (ML) predictions and human decisions is critical, with ML models prone to algorithmic and data bias, and human decisions affected by subjectivity and cognitive bias. This study investigates fairness using a real-world university admission dataset with 870 profiles, leveraging three ML models, namely XGB, Bi-LSTM, and KNN. Textual features are encoded with BERT embeddings. For individual fairness, we assess decision consistency among experts with varied backgrounds and ML models, using a consistency score. Results show ML models outperform humans in fairness by 14.08% to 18.79%. For group fairness, we propose a gender-debiasing pipeline and demonstrate its efficacy in removing gender-specific language without compromising prediction performance. Post-debiasing, all models maintain or improve their classification accuracy, validating the hypothesis that fairness and performance can coexist. Our findings highlight ML's potential to enhance fairness in admissions while maintaining high accuracy, advocating a hybrid approach combining human judgement and ML models.

Title: The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations

Authors: Theodora Worledge, Tatsunori Hashimoto, Carlos Guestrin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17375
Pdf URL: https://arxiv.org/pdf/2411.17375
Copy Paste: [[2411.17375]] The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations(https://arxiv.org/abs/2411.17375)
Keywords: large language model
Abstract: Across all fields of academic study, experts cite their sources when sharing information. While large language models (LLMs) excel at synthesizing information, they do not provide reliable citation to sources, making it difficult to trace and verify the origins of the information they present. In contrast, search engines make sources readily accessible to users and place the burden of synthesizing information on the user. Through a survey, we find that users prefer search engines over LLMs for high-stakes queries, where concerns regarding information provenance outweigh the perceived utility of LLM responses. To examine the interplay between verifiability and utility of information-sharing tools, we introduce the extractive-abstractive spectrum, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points. Search engines are extractive because they respond to queries with snippets of sources with links (citations) to the original webpages. LLMs are abstractive because they address queries with answers that synthesize and logically transform relevant information from training and in-context sources without reliable citation. We define five operating points that span the extractive-abstractive spectrum and conduct human evaluations on seven systems across four diverse query distributions that reflect real-world QA settings: web search, language simplification, multi-step reasoning, and medical advice. As outputs become more abstractive, we find that perceived utility improves by as much as 200%, while the proportion of properly cited sentences decreases by as much as 50% and users take up to 3 times as long to verify cited information. Our findings recommend distinct operating points for domain-specific LLM systems and our failure analysis informs approaches to high-utility LLM systems that empower users to verify information.

Title: RealTraj: Towards Real-World Pedestrian Trajectory Forecasting

Authors: Ryo Fujii, Hideo Saito, Ryo Hachiuma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17376
Pdf URL: https://arxiv.org/pdf/2411.17376
Copy Paste: [[2411.17376]] RealTraj: Towards Real-World Pedestrian Trajectory Forecasting(https://arxiv.org/abs/2411.17376)
Keywords: robust
Abstract: This paper jointly addresses three key limitations in conventional pedestrian trajectory forecasting: pedestrian perception errors, real-world data collection costs, and person ID annotation costs. We propose a novel framework, RealTraj, that enhances the real-world applicability of trajectory forecasting. Our approach includes two training phases--self-supervised pretraining on synthetic data and weakly-supervised fine-tuning with limited real-world data--to minimize data collection efforts. To improve robustness to real-world errors, we focus on both model design and training objectives. Specifically, we present Det2TrajFormer, a trajectory forecasting model that remains invariant in tracking noise by using past detections as inputs. Additionally, we pretrain the model using multiple pretext tasks, which enhance robustness and improve forecasting performance based solely on detection data. Unlike previous trajectory forecasting methods, our approach fine-tunes the model using only ground-truth detections, significantly reducing the need for costly person ID annotations. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art trajectory forecasting methods on multiple datasets.

Title: MFF-FTNet: Multi-scale Feature Fusion across Frequency and Temporal Domains for Time Series Forecasting

Authors: Yangyang Shi, Qianqian Ren, Yong Liu, Jianguo Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17382
Pdf URL: https://arxiv.org/pdf/2411.17382
Copy Paste: [[2411.17382]] MFF-FTNet: Multi-scale Feature Fusion across Frequency and Temporal Domains for Time Series Forecasting(https://arxiv.org/abs/2411.17382)
Keywords: robust, extraction
Abstract: Time series forecasting is crucial in many fields, yet current deep learning models struggle with noise, data sparsity, and capturing complex multi-scale patterns. This paper presents MFF-FTNet, a novel framework addressing these challenges by combining contrastive learning with multi-scale feature extraction across both frequency and time domains. MFF-FTNet introduces an adaptive noise augmentation strategy that adjusts scaling and shifting factors based on the statistical properties of the original time series data, enhancing model resilience to noise. The architecture is built around two complementary modules: a Frequency-Aware Contrastive Module (FACM) that refines spectral representations through frequency selection and contrastive learning, and a Complementary Time Domain Contrastive Module (CTCM) that captures both short- and long-term dependencies using multi-scale convolutions and feature fusion. A unified feature representation strategy enables robust contrastive learning across domains, creating an enriched framework for accurate forecasting. Extensive experiments on five real-world datasets demonstrate that MFF-FTNet significantly outperforms state-of-the-art models, achieving a 7.7% MSE improvement on multivariate tasks. These findings underscore MFF-FTNet's effectiveness in modeling complex temporal patterns and managing noise and sparsity, providing a comprehensive solution for both long- and short-term forecasting.

Title: AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation

Authors: Ziyi Xu, Ziyao Huang, Juan Cao, Yong Zhang, Xiaodong Cun, Qing Shuai, Yuchen Wang, Linchao Bao, Jintao Li, Fan Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17383
Pdf URL: https://arxiv.org/pdf/2411.17383
Copy Paste: [[2411.17383]] AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation(https://arxiv.org/abs/2411.17383)
Keywords: diffusion
Abstract: The automatic generation of anchor-style product promotion videos presents promising opportunities in online commerce, advertising, and consumer engagement. However, this remains a challenging task despite significant advancements in pose-guided human video generation. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Additionally, we introduce the HOI-region reweighting loss, a training objective that enhances the learning of object details. Extensive experiments demonstrate that our proposed system outperforms existing methods in preserving object appearance and shape awareness, while simultaneously maintaining consistency in human appearance and motion. Project page: this https URL

Title: Robust Bayesian Optimization via Localized Online Conformal Prediction

Authors: Dongwon Kim, Matteo Zecchin, Sangwoo Park, Joonhyuk Kang, Osvaldo Simeone
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.17387
Pdf URL: https://arxiv.org/pdf/2411.17387
Copy Paste: [[2411.17387]] Robust Bayesian Optimization via Localized Online Conformal Prediction(https://arxiv.org/abs/2411.17387)
Keywords: robust
Abstract: Bayesian optimization (BO) is a sequential approach for optimizing black-box objective functions using zeroth-order noisy observations. In BO, Gaussian processes (GPs) are employed as probabilistic surrogate models to estimate the objective function based on past observations, guiding the selection of future queries to maximize utility. However, the performance of BO heavily relies on the quality of these probabilistic estimates, which can deteriorate significantly under model misspecification. To address this issue, we introduce localized online conformal prediction-based Bayesian optimization (LOCBO), a BO algorithm that calibrates the GP model through localized online conformal prediction (CP). LOCBO corrects the GP likelihood based on predictive sets produced by LOCBO, and the corrected GP likelihood is then denoised to obtain a calibrated posterior distribution on the objective function. The likelihood calibration step leverages an input-dependent calibration threshold to tailor coverage guarantees to different regions of the input space. Under minimal noise assumptions, we provide theoretical performance guarantees for LOCBO's iterates that hold for the unobserved objective function. These theoretical findings are validated through experiments on synthetic and real-world optimization tasks, demonstrating that LOCBO consistently outperforms state-of-the-art BO algorithms in the presence of model misspecification.

Title: Can LLMs be Good Graph Judger for Knowledge Graph Construction?

Authors: Haoyu Huang, Chong Chen, Conghui He, Yang Li, Jiawei Jiang, Wentao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17388
Pdf URL: https://arxiv.org/pdf/2411.17388
Copy Paste: [[2411.17388]] Can LLMs be Good Graph Judger for Knowledge Graph Construction?(https://arxiv.org/abs/2411.17388)
Keywords: large language model
Abstract: In real-world scenarios, most of the data obtained from information retrieval (IR) system is unstructured. Converting natural language sentences into structured Knowledge Graphs (KGs) remains a critical challenge. The quality of constructed KGs may also impact the performance of some KG-dependent domains like GraphRAG systems and recommendation systems. Recently, Large Language Models (LLMs) have demonstrated impressive capabilities in addressing a wide range of natural language processing tasks. However, there are still challenges when utilizing LLMs to address the task of generating structured KGs. And we have identified three limitations with respect to existing KG construction methods. (1)There is a large amount of information and excessive noise in real-world documents, which could result in extracting messy information. (2)Native LLMs struggle to effectively extract accuracy knowledge from some domain-specific documents. (3)Hallucinations phenomenon cannot be overlooked when utilizing LLMs directly as an unsupervised method for constructing KGs. In this paper, we propose GraphJudger, a knowledge graph construction framework to address the aforementioned challenges. We introduce three innovative modules in our method, which are entity-centric iterative text denoising, knowledge aware instruction tuning and graph judgement, respectively. We seek to utilize the capacity of LLMs to function as a graph judger, a capability superior to their role only as a predictor for KG construction problems. Experiments conducted on two general text-graph pair datasets and one domain-specific text-graph pair dataset show superior performances compared to baseline methods. The code of our proposed method is available at this https URL.

Title: NumGrad-Pull: Numerical Gradient Guided Tri-plane Representation for Surface Reconstruction from Point Clouds

Authors: Ruikai Cui, Shi Qiu, Jiawei Liu, Saeed Anwar, Nick Barnes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17392
Pdf URL: https://arxiv.org/pdf/2411.17392
Copy Paste: [[2411.17392]] NumGrad-Pull: Numerical Gradient Guided Tri-plane Representation for Surface Reconstruction from Point Clouds(https://arxiv.org/abs/2411.17392)
Keywords: robust
Abstract: Reconstructing continuous surfaces from unoriented and unordered 3D points is a fundamental challenge in computer vision and graphics. Recent advancements address this problem by training neural signed distance functions to pull 3D location queries to their closest points on a surface, following the predicted signed distances and the analytical gradients computed by the network. In this paper, we introduce NumGrad-Pull, leveraging the representation capability of tri-plane structures to accelerate the learning of signed distance functions and enhance the fidelity of local details in surface reconstruction. To further improve the training stability of grid-based tri-planes, we propose to exploit numerical gradients, replacing conventional analytical computations. Additionally, we present a progressive plane expansion strategy to facilitate faster signed distance function convergence and design a data sampling strategy to mitigate reconstruction artifacts. Our extensive experiments across a variety of benchmarks demonstrate the effectiveness and robustness of our approach. Code is available at this https URL

Title: One Mind, Many Tongues: A Deep Dive into Language-Agnostic Knowledge Neurons in Large Language Models

Authors: Pengfei Cao, Yuheng Chen, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17401
Pdf URL: https://arxiv.org/pdf/2411.17401
Copy Paste: [[2411.17401]] One Mind, Many Tongues: A Deep Dive into Language-Agnostic Knowledge Neurons in Large Language Models(https://arxiv.org/abs/2411.17401)
Keywords: large language model
Abstract: Large language models (LLMs) have learned vast amounts of factual knowledge through self-supervised pre-training on large-scale corpora. Meanwhile, LLMs have also demonstrated excellent multilingual capabilities, which can express the learned knowledge in multiple languages. However, the knowledge storage mechanism in LLMs still remains mysterious. Some researchers attempt to demystify the factual knowledge in LLMs from the perspective of knowledge neurons, and subsequently discover language-agnostic knowledge neurons that store factual knowledge in a form that transcends language barriers. However, the preliminary finding suffers from two limitations: 1) High Uncertainty in Localization Results. Existing study only uses a prompt-based probe to localize knowledge neurons for each fact, while LLMs cannot provide consistent answers for semantically equivalent queries. Thus, it leads to inaccurate localization results with high uncertainty. 2) Lack of Analysis in More Languages. The study only analyzes language-agnostic knowledge neurons on English and Chinese data, without exploring more language families and languages. Naturally, it limits the generalizability of the findings. To address aforementioned problems, we first construct a new benchmark called Rephrased Multilingual LAMA (RML-LAMA), which contains high-quality cloze-style multilingual parallel queries for each fact. Then, we propose a novel method named Multilingual Integrated Gradients with Uncertainty Estimation (MATRICE), which quantifies the uncertainty across queries and languages during knowledge localization. Extensive experiments show that our method can accurately localize language-agnostic knowledge neurons. We also further investigate the role of language-agnostic knowledge neurons in cross-lingual knowledge editing, knowledge enhancement and new knowledge injection.

Title: CoA: Chain-of-Action for Generative Semantic Labels

Authors: Meng Wei, Zhongnian Li, Peng Ying, Xinzheng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17406
Pdf URL: https://arxiv.org/pdf/2411.17406
Copy Paste: [[2411.17406]] CoA: Chain-of-Action for Generative Semantic Labels(https://arxiv.org/abs/2411.17406)
Keywords: generative
Abstract: Recent advances in vision-language models (VLM) have demonstrated remarkable capability in image classification. These VLMs leverage a predefined set of categories to construct text prompts for zero-shot reasoning. However, in more open-ended domains like autonomous driving, using a predefined set of labels becomes impractical, as the semantic label space is unknown and constantly evolving. Additionally, fixed embedding text prompts often tend to predict a single label (while in reality, multiple labels commonly exist per image). In this paper, we introduce CoA, an innovative Chain-of-Action (CoA) method that generates labels aligned with all contextually relevant features of an image. CoA is designed based on the observation that enriched and valuable contextual information improves generative performance during inference. Traditional vision-language models tend to output singular and redundant responses. Therefore, we employ a tailored CoA to alleviate this problem. We first break down the generative labeling task into detailed actions and construct an CoA leading to the final generative objective. Each action extracts and merges key information from the previous action and passes the enriched information as context to the next action, ultimately improving the VLM in generating comprehensive and accurate semantic labels. We assess the effectiveness of CoA through comprehensive evaluations on widely-used benchmark datasets and the results demonstrate significant improvements across key performance metrics.

Title: Multimodal Outer Arithmetic Block Dual Fusion of Whole Slide Images and Omics Data for Precision Oncology

Authors: Omnia Alwazzan, Amaya Gallagher-Syed, Thomas Millner, Ioannis Patras, Silvia Marino, Gregory Slabaugh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17418
Pdf URL: https://arxiv.org/pdf/2411.17418
Copy Paste: [[2411.17418]] Multimodal Outer Arithmetic Block Dual Fusion of Whole Slide Images and Omics Data for Precision Oncology(https://arxiv.org/abs/2411.17418)
Keywords: interpretability
Abstract: Developing a central nervous system (CNS) tumor classifier by integrating DNA methylation data with Whole Slide Images (WSI) offers significant potential for enhancing diagnostic precision in neuropathology. Existing approaches typically integrate encoded omic data with histology only once - either at an early or late fusion stage - while reintroducing encoded omic data to create a dual fusion variant remains unexplored. Nevertheless, reintroduction of omic embeddings during early and late fusion enables the capture of complementary information from localized patch-level and holistic slide-level interactions, allowing boosted performance through advanced multimodal integration. To achieve this, we propose a dual fusion framework that integrates omic data at both early and late stages, fully leveraging its diagnostic strength. In the early fusion stage, omic embeddings are projected into a patch-wise latent space, generating omic-WSI embeddings that encapsulate per-patch molecular and morphological insights, effectively incorporating this information into the spatial representation of histology. These embeddings are refined with a multiple instance learning gated attention mechanism to attend to critical patches. In the late fusion stage, we reintroduce the omic data by fusing it with slide-level omic-WSI embeddings using a Multimodal Outer Arithmetic Block (MOAB), which richly intermingles features from both modalities, capturing their global correlations and complementarity. We demonstrate accurate CNS tumor subtyping across 20 fine-grained subtypes and validate our approach on benchmark datasets, achieving improved survival prediction on TCGA-BLCA and competitive performance on TCGA-BRCA compared to state-of-the-art methods. This dual fusion strategy enhances interpretability and classification performance, highlighting its potential for clinical diagnostics.

Title: DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters

Authors: Mingze Sun, Junhao Chen, Junting Dong, Yurun Chen, Xinyu Jiang, Shiwei Mao, Puhua Jiang, Jingbo Wang, Bo Dai, Ruqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17423
Pdf URL: https://arxiv.org/pdf/2411.17423
Copy Paste: [[2411.17423]] DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters(https://arxiv.org/abs/2411.17423)
Keywords: diffusion, generative
Abstract: Recent advances in generative models have enabled high-quality 3D character reconstruction from multi-modal. However, animating these generated characters remains a challenging task, especially for complex elements like garments and hair, due to the lack of large-scale datasets and effective rigging methods. To address this gap, we curate AnimeRig, a large-scale dataset with detailed skeleton and skinning annotations. Building upon this, we propose DRiVE, a novel framework for generating and rigging 3D human characters with intricate structures. Unlike existing methods, DRiVE utilizes a 3D Gaussian representation, facilitating efficient animation and high-quality rendering. We further introduce GSDiff, a 3D Gaussian-based diffusion module that predicts joint positions as spatial distributions, overcoming the limitations of regression-based approaches. Extensive experiments demonstrate that DRiVE achieves precise rigging results, enabling realistic dynamics for clothing and hair, and surpassing previous methods in both quality and versatility. The code and dataset will be made public for academic use upon acceptance.

Title: Self-supervised Video Instance Segmentation Can Boost Geographic Entity Alignment in Historical Maps

Authors: Xue Xia, Randall Balestriero, Tao Zhang, Lorenz Hurni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17425
Pdf URL: https://arxiv.org/pdf/2411.17425
Copy Paste: [[2411.17425]] Self-supervised Video Instance Segmentation Can Boost Geographic Entity Alignment in Historical Maps(https://arxiv.org/abs/2411.17425)
Keywords: segmentation
Abstract: Tracking geographic entities from historical maps, such as buildings, offers valuable insights into cultural heritage, urbanization patterns, environmental changes, and various historical research endeavors. However, linking these entities across diverse maps remains a persistent challenge for researchers. Traditionally, this has been addressed through a two-step process: detecting entities within individual maps and then associating them via a heuristic-based post-processing step. In this paper, we propose a novel approach that combines segmentation and association of geographic entities in historical maps using video instance segmentation (VIS). This method significantly streamlines geographic entity alignment and enhances automation. However, acquiring high-quality, video-format training data for VIS models is prohibitively expensive, especially for historical maps that often contain hundreds or thousands of geographic entities. To mitigate this challenge, we explore self-supervised learning (SSL) techniques to enhance VIS performance on historical maps. We evaluate the performance of VIS models under different pretraining configurations and introduce a novel method for generating synthetic videos from unlabeled historical map images for pretraining. Our proposed self-supervised VIS method substantially reduces the need for manual annotation. Experimental results demonstrate the superiority of the proposed self-supervised VIS approach, achieving a 24.9\% improvement in AP and a 0.23 increase in F1 score compared to the model trained from scratch.

Title: Rewiring Techniques to Mitigate Oversquashing and Oversmoothing in GNNs: A Survey

Authors: Hugo Attali, Davide Buscaldi, Nathalie Pernelle
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17429
Pdf URL: https://arxiv.org/pdf/2411.17429
Copy Paste: [[2411.17429]] Rewiring Techniques to Mitigate Oversquashing and Oversmoothing in GNNs: A Survey(https://arxiv.org/abs/2411.17429)
Keywords: diffusion
Abstract: Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their effectiveness is often constrained by two critical challenges: oversquashing, where the excessive compression of information from distant nodes results in significant information loss, and oversmoothing, where repeated message-passing iterations homogenize node representations, obscuring meaningful distinctions. These issues, intrinsically linked to the underlying graph structure, hinder information flow and constrain the expressiveness of GNNs. In this survey, we examine graph rewiring techniques, a class of methods designed to address these structural bottlenecks by modifying graph topology to enhance information diffusion. We provide a comprehensive review of state-of-the-art rewiring approaches, delving into their theoretical underpinnings, practical implementations, and performance trade-offs.

Title: Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Authors: Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2411.17440
Pdf URL: https://arxiv.org/pdf/2411.17440
Copy Paste: [[2411.17440]] Identity-Preserving Text-to-Video Generation by Frequency Decomposition(https://arxiv.org/abs/2411.17440)
Keywords: diffusion, transformer, generative
Abstract: Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.

Title: Maximally Separated Active Learning

Authors: Tejaswi Kasarla, Abhishek Jha, Faye Tervoort, Rita Cucchiara, Pascal Mettes
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17444
Pdf URL: https://arxiv.org/pdf/2411.17444
Copy Paste: [[2411.17444]] Maximally Separated Active Learning(https://arxiv.org/abs/2411.17444)
Keywords: robust
Abstract: Active Learning aims to optimize performance while minimizing annotation costs by selecting the most informative samples from an unlabelled pool. Traditional uncertainty sampling often leads to sampling bias by choosing similar uncertain samples. We propose an active learning method that utilizes fixed equiangular hyperspherical points as class prototypes, ensuring consistent inter-class separation and robust feature representations. Our approach introduces Maximally Separated Active Learning (MSAL) for uncertainty sampling and a combined strategy (MSAL-D) for incorporating diversity. This method eliminates the need for costly clustering steps, while maintaining diversity through hyperspherical uniformity. We demonstrate strong performance over existing active learning techniques across five benchmark datasets, highlighting the method's effectiveness and integration ease. The code is available on GitHub.

Title: Support Vector Machine for Person Classification Using the EEG Signals

Authors: Naveenkumar G Venkataswamy, Masudul H Imtiaz
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.17446
Pdf URL: https://arxiv.org/pdf/2411.17446
Copy Paste: [[2411.17446]] Support Vector Machine for Person Classification Using the EEG Signals(https://arxiv.org/abs/2411.17446)
Keywords: security, attack, biometric
Abstract: User authentication is a pivotal element in security systems. Conventional methods including passwords, personal identification numbers, and identification tags are increasingly vulnerable to cyber-attacks. This paper suggests a paradigm shift towards biometric identification technology that leverages unique physiological or behavioral characteristics for user authenticity verification. Nevertheless, biometric solutions like fingerprints, iris patterns, facial and voice recognition are also susceptible to forgery and deception. We propose using Electroencephalogram (EEG) signals for individual identification to address this challenge. Derived from unique brain activities, these signals offer promising authentication potential and provide a novel means for liveness detection, thereby mitigating spoofing attacks. This study employs a public dataset initially compiled for fatigue analysis, featuring EEG data from 12 subjects recorded via an eight-channel OpenBCI helmet. This dataset extracts salient features from the EEG signals and trains a supervised multiclass Support Vector Machine classifier. Upon evaluation, the classifier model achieves a maximum accuracy of 92.9\%, leveraging ten features from each channel. Collectively, these findings highlight the viability of machine learning in implementing real-world, EEG-based biometric identification systems, thereby advancing user authentication technology.

Title: A Graph Neural Network deep-dive into successful counterattacks

Authors: Joris Bekkers, Amod Sahasrabudhe
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2411.17450
Pdf URL: https://arxiv.org/pdf/2411.17450
Copy Paste: [[2411.17450]] A Graph Neural Network deep-dive into successful counterattacks(https://arxiv.org/abs/2411.17450)
Keywords: attack
Abstract: A counterattack in soccer is a high speed, high intensity direct attack that can occur when a team transitions from a defensive state to an attacking state after regaining possession of the ball. The aim is to create a goal-scoring opportunity by convering a lot of ground with minimal passes before the opposing team can recover their defensive shape. The purpose of this research is to build gender-specific Graph Neural Networks to model the likelihood of a counterattack being successful and uncover what factors make them successful in professional soccer. These models are trained on a total of 20863 frames of synchronized on-ball event and spatiotemporal (broadcast) tracking data. This dataset is derived from 632 games of MLS (2022), NWSL (2022) and international soccer (2020-2022). With this data we demonstrate that gender-specific Graph Neural Networks outperform architecturally identical gender-ambiguous models in predicting the successful outcome of counterattacks. We show, using Permutation Feature Importance, that byline to byline speed, angle to the goal, angle to the ball and sideline to sideline speed are the node features with the highest impact on model performance. Additionally, we offer some illustrative examples on how to navigate the infinite solution search space to aid in identifying improvements for player decision making. This research is accompanied by an open-source repository containing all data and code, and it is also accompanied by an open-source Python package which simplifies converting spatiotemporal data into graphs. This package also facilitates testing, validation, training and prediction with this data. This should allow the reader to replicate and improve upon our research more easily.

Title: VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Authors: Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17451
Pdf URL: https://arxiv.org/pdf/2411.17451
Copy Paste: [[2411.17451]] VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models(https://arxiv.org/abs/2411.17451)
Keywords: generative
Abstract: Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

Title: PEFTGuard: Detecting Backdoor Attacks Against Parameter-Efficient Fine-Tuning

Authors: Zhen Sun, Tianshuo Cong, Yule Liu, Chenhao Lin, Xinlei He, Rongmao Chen, Xingshuo Han, Xinyi Huang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.17453
Pdf URL: https://arxiv.org/pdf/2411.17453
Copy Paste: [[2411.17453]] PEFTGuard: Detecting Backdoor Attacks Against Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2411.17453)
Keywords: security, defense, attack, robust, large language model
Abstract: Fine-tuning is an essential process to improve the performance of Large Language Models (LLMs) in specific domains, with Parameter-Efficient Fine-Tuning (PEFT) gaining popularity due to its capacity to reduce computational demands through the integration of low-rank adapters. These lightweight adapters, such as LoRA, can be shared and utilized on open-source platforms. However, adversaries could exploit this mechanism to inject backdoors into these adapters, resulting in malicious behaviors like incorrect or harmful outputs, which pose serious security risks to the community. Unfortunately, few of the current efforts concentrate on analyzing the backdoor patterns or detecting the backdoors in the adapters. To fill this gap, we first construct (and will release) PADBench, a comprehensive benchmark that contains 13,300 benign and backdoored adapters fine-tuned with various datasets, attack strategies, PEFT methods, and LLMs. Moreover, we propose PEFTGuard, the first backdoor detection framework against PEFT-based adapters. Extensive evaluation upon PADBench shows that PEFTGuard outperforms existing detection methods, achieving nearly perfect detection accuracy (100%) in most cases. Notably, PEFTGuard exhibits zero-shot transferability on three aspects, including different attacks, PEFT methods, and adapter ranks. In addition, we consider various adaptive attacks to demonstrate the high robustness of PEFTGuard. We further explore several possible backdoor mitigation defenses, finding fine-mixing to be the most effective method. We envision our benchmark and method can shed light on future LLM backdoor detection research.

Title: Spatially Visual Perception for End-to-End Robotic Learning

Authors: Travis Davies, Jiahuan Yan, Xiang Chen, Yu Tian, Yueting Zhuang, Yiqi Huang, Luhui Hu
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2411.17458
Pdf URL: https://arxiv.org/pdf/2411.17458
Copy Paste: [[2411.17458]] Spatially Visual Perception for End-to-End Robotic Learning(https://arxiv.org/abs/2411.17458)
Keywords: robust
Abstract: Recent advances in imitation learning have shown significant promise for robotic control and embodied intelligence. However, achieving robust generalization across diverse mounted camera observations remains a critical challenge. In this paper, we introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data. Together, these components form a cohesive system designed to enhance robustness and adaptability in dynamic scenarios. Our results demonstrate that our approach significantly boosts the success rate across diverse camera exposures, where previous models experience performance collapse. Our findings highlight the potential of video-based spatial perception models in advancing robustness for end-to-end robotic learning, paving the way for scalable, low-cost solutions in embodied intelligence.

Title: WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Authors: Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, Li Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17459
Pdf URL: https://arxiv.org/pdf/2411.17459
Copy Paste: [[2411.17459]] WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model(https://arxiv.org/abs/2411.17459)
Keywords: diffusion
Abstract: Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2x higher throughput and 4x lower memory consumption while maintaining competitive reconstruction quality. Our code and models are available at this https URL.

Title: SoK: Decentralized AI (DeAI)

Authors: Zhipeng Wang, Rui Sun, Elizabeth Lui, Vatsal Shah, Xihan Xiong, Jiahao Sun, Davide Crapis, William Knottenbelt
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2411.17461
Pdf URL: https://arxiv.org/pdf/2411.17461
Copy Paste: [[2411.17461]] SoK: Decentralized AI (DeAI)(https://arxiv.org/abs/2411.17461)
Keywords: security, privacy, fair, large language model
Abstract: The centralization of Artificial Intelligence (AI) poses significant challenges, including single points of failure, inherent biases, data privacy concerns, and scalability issues. These problems are especially prevalent in closed-source large language models (LLMs), where user data is collected and used without transparency. To mitigate these issues, blockchain-based decentralized AI (DeAI) has emerged as a promising solution. DeAI combines the strengths of both blockchain and AI technologies to enhance the transparency, security, decentralization, and trustworthiness of AI systems. However, a comprehensive understanding of state-of-the-art DeAI development, particularly for active industry solutions, is still lacking. In this work, we present a Systematization of Knowledge (SoK) for blockchain-based DeAI solutions. We propose a taxonomy to classify existing DeAI protocols based on the model lifecycle. Based on this taxonomy, we provide a structured way to clarify the landscape of DeAI protocols and identify their similarities and differences. We analyze the functionalities of blockchain in DeAI, investigating how blockchain features contribute to enhancing the security, transparency, and trustworthiness of AI processes, while also ensuring fair incentives for AI data and model contributors. In addition, we identify key insights and research gaps in developing DeAI protocols, highlighting several critical avenues for future research.

Title: Learning 3D Representations from Procedural 3D Programs

Authors: Xuweiyi Chen, Zezhou Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17467
Pdf URL: https://arxiv.org/pdf/2411.17467
Copy Paste: [[2411.17467]] Learning 3D Representations from Procedural 3D Programs(https://arxiv.org/abs/2411.17467)
Keywords: segmentation
Abstract: Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations. Remarkably, despite lacking semantic content, the 3D representations learned from this synthesized dataset perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, including shape classification, part segmentation, and masked point cloud completion. Our analysis further suggests that current self-supervised learning methods primarily capture geometric structures rather than high-level semantics.

Title: Adversarial Bounding Boxes Generation (ABBG) Attack against Visual Object Trackers

Authors: Fatemeh Nourilenjan Nokabadi, Jean-Francois Lalonde, Christian Gagné
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17468
Pdf URL: https://arxiv.org/pdf/2411.17468
Copy Paste: [[2411.17468]] Adversarial Bounding Boxes Generation (ABBG) Attack against Visual Object Trackers(https://arxiv.org/abs/2411.17468)
Keywords: attack, robust, transformer
Abstract: Adversarial perturbations aim to deceive neural networks into predicting inaccurate results. For visual object trackers, adversarial attacks have been developed to generate perturbations by manipulating the outputs. However, transformer trackers predict a specific bounding box instead of an object candidate list, which limits the applicability of many existing attack scenarios. To address this issue, we present a novel white-box approach to attack visual object trackers with transformer backbones using only one bounding box. From the tracker predicted bounding box, we generate a list of adversarial bounding boxes and compute the adversarial loss for those bounding boxes. Experimental results demonstrate that our simple yet effective attack outperforms existing attacks against several robust transformer trackers, including TransT-M, ROMTrack, and MixFormer, on popular benchmark tracking datasets such as GOT-10k, UAV123, and VOT2022STS.

Title: Towards Precise Scaling Laws for Video Diffusion Transformers

Authors: Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang, Kun Gai
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17470
Pdf URL: https://arxiv.org/pdf/2411.17470
Copy Paste: [[2411.17470]] Towards Precise Scaling Laws for Video Diffusion Transformers(https://arxiv.org/abs/2411.17470)
Keywords: diffusion, transformer
Abstract: Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.

Title: Learning New Concepts, Remembering the Old: A Novel Continual Learning

Authors: Songning Lai, Mingqian Liao, Zhangyi Hu, Jiayu Yang, Wenshuo Chen, Yutao Yue
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17471
Pdf URL: https://arxiv.org/pdf/2411.17471
Copy Paste: [[2411.17471]] Learning New Concepts, Remembering the Old: A Novel Continual Learning(https://arxiv.org/abs/2411.17471)
Keywords: interpretability
Abstract: Concept Bottleneck Models (CBMs) enhance model interpretability by introducing human-understandable concepts within the architecture. However, existing CBMs assume static datasets, limiting their ability to adapt to real-world, continuously evolving data streams. To address this, we define a novel concept-incremental and class-incremental continual learning task for CBMs, enabling models to accumulate new concepts and classes over time while retaining previously learned knowledge. To achieve this, we propose CONceptual Continual Incremental Learning (CONCIL), a framework that prevents catastrophic forgetting by reformulating concept and decision layer updates as linear regression problems, thus eliminating the need for gradient-based updates. CONCIL requires only recursive matrix operations, making it computationally efficient and suitable for real-time and large-scale data applications. Experimental results demonstrate that CONCIL achieves "absolute knowledge memory" and outperforms traditional CBM methods in concept- and class-incremental settings, establishing a new benchmark for continual learning in CBMs.

Title: Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Authors: Eric Hanchen Jiang, Yasi Zhang, Zhi Zhang, Yixin Wan, Andrew Lizarraga, Shufan Li, Ying Nian Wu
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.17472
Pdf URL: https://arxiv.org/pdf/2411.17472
Copy Paste: [[2411.17472]] Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory(https://arxiv.org/abs/2411.17472)
Keywords: robust, diffusion, generative
Abstract: Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images from textual prompts. Despite these advances, existing models struggle with complex prompts involving multiple objects and attributes, often misaligning modifiers with their corresponding nouns or neglecting certain elements. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding and a lack of robust generalization guarantees. Leveraging the PAC-Bayes framework, we propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment. We demonstrate the effectiveness of our method on standard benchmarks, achieving state-of-the-art results across multiple metrics. By integrating custom priors into the denoising process, our method enhances image quality and addresses long-standing challenges in T2I diffusion models, paving the way for more reliable and interpretable generative models.

Title: TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

Authors: Xiaowen Ma, Zhenliang Ni, Xinghao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17473
Pdf URL: https://arxiv.org/pdf/2411.17473
Copy Paste: [[2411.17473]] TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba(https://arxiv.org/abs/2411.17473)
Keywords: transformer, segmentation
Abstract: Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. We observe that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic segmentation, object detection and instance segmentation. In particular, TinyViM outperforms Convolution, Transformer and Mamba-based models with similar scales, and the throughput is about 2-3 times higher than that of other Mamba-based models. Code is available at this https URL.

Title: COBRA: A Continual Learning Approach to Vision-Brain Understanding

Authors: Xuan-Bac Nguyen, Arabinda Kumar Choudhary, Pawan Sinha, Xin Li, Khoa Luu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17475
Pdf URL: https://arxiv.org/pdf/2411.17475
Copy Paste: [[2411.17475]] COBRA: A Continual Learning Approach to Vision-Brain Understanding(https://arxiv.org/abs/2411.17475)
Keywords: transformer
Abstract: Vision-Brain Understanding (VBU) aims to extract visual information perceived by humans from brain activity recorded through functional Magnetic Resonance Imaging (fMRI). Despite notable advancements in recent years, existing studies in VBU continue to face the challenge of catastrophic forgetting, where models lose knowledge from prior subjects as they adapt to new ones. Addressing continual learning in this field is, therefore, essential. This paper introduces a novel framework called Continual Learning for Vision-Brain (COBRA) to address continual learning in VBU. Our approach includes three novel modules: a Subject Commonality (SC) module, a Prompt-based Subject Specific (PSS) module, and a transformer-based module for fMRI, denoted as MRIFormer module. The SC module captures shared vision-brain patterns across subjects, preserving this knowledge as the model encounters new subjects, thereby reducing the impact of catastrophic forgetting. On the other hand, the PSS module learns unique vision-brain patterns specific to each subject. Finally, the MRIFormer module contains a transformer encoder and decoder that learns the fMRI features for VBU from common and specific patterns. In a continual learning setup, COBRA is trained in new PSS and MRIFormer modules for new subjects, leaving the modules of previous subjects unaffected. As a result, COBRA effectively addresses catastrophic forgetting and achieves state-of-the-art performance in both continual learning and vision-brain reconstruction tasks, surpassing previous methods.

Title: Time-Series Forecasting in Smart Manufacturing Systems: An Experimental Evaluation of the State-of-the-art Algorithms

Authors: Mojtaba A. Farahani, Fadi El Kalach, Austin Harper, M. R. McCormick, Ramy Harik, Thorsten Wuest
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17499
Pdf URL: https://arxiv.org/pdf/2411.17499
Copy Paste: [[2411.17499]] Time-Series Forecasting in Smart Manufacturing Systems: An Experimental Evaluation of the State-of-the-art Algorithms(https://arxiv.org/abs/2411.17499)
Keywords: robust, transformer
Abstract: TSF is growing in various domains including manufacturing. Although numerous TSF algorithms have been developed recently, the validation and evaluation of algorithms hold substantial value for researchers and practitioners and are missing. This study aims to fill this gap by evaluating the SoTA TSF algorithms on thirteen manufacturing datasets, focusing on their applicability in manufacturing. Each algorithm was selected based on its TSF category to ensure a representative set of algorithms. The evaluation includes different scenarios to evaluate the models using two problem categories and two forecasting horizons. To evaluate the performance, the WAPE was calculated, and additional post hoc analyses were conducted to assess the significance of observed differences. Only algorithms with codes from open-source libraries were utilized, and no hyperparameter tuning was done. This allowed us to evaluate the algorithms as "out-of-the-box" solutions that can be easily implemented, ensuring their usability within the manufacturing by practitioners with limited technical knowledge. This aligns to facilitate the adoption of these techniques in smart manufacturing systems. Based on the results, transformer and MLP-based architectures demonstrated the best performance with MLP-based architecture winning the most scenarios. For univariate TSF, PatchTST emerged as the most robust, particularly for long-term horizons, while for multivariate problems, MLP-based architectures like N-HITS and TiDE showed superior results. The study revealed that simpler algorithms like XGBoost could outperform complex algorithms in certain tasks. These findings challenge the assumption that more sophisticated models produce better results. Additionally, the research highlighted the importance of computational resource considerations, showing variations in runtime and memory usage across different algorithms.

Title: SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates

Authors: Yijia Hong, Yuan-Chen Guo, Ran Yi, Yulong Chen, Yan-Pei Cao, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17515
Pdf URL: https://arxiv.org/pdf/2411.17515
Copy Paste: [[2411.17515]] SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates(https://arxiv.org/abs/2411.17515)
Keywords: diffusion
Abstract: Decomposing physically-based materials from images into their constituent properties remains challenging, particularly when maintaining both computational efficiency and physical consistency. While recent diffusion-based approaches have shown promise, they face substantial computational overhead due to multiple denoising steps and separate models for different material properties. We present SuperMat, a single-step framework that achieves high-quality material decomposition with one-step inference. This enables end-to-end training with perceptual and re-render losses while decomposing albedo, metallic, and roughness maps at millisecond-scale speeds. We further extend our framework to 3D objects through a UV refinement network, enabling consistent material estimation across viewpoints while maintaining efficiency. Experiments demonstrate that SuperMat achieves state-of-the-art PBR material decomposition quality while reducing inference time from seconds to milliseconds per image, and completes PBR material estimation for 3D objects in approximately 3 seconds.

Title: Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

Authors: Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, Dan Alistarh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17525
Pdf URL: https://arxiv.org/pdf/2411.17525
Copy Paste: [[2411.17525]] Pushing the Limits of Large Language Model Quantization via the Linearity Theorem(https://arxiv.org/abs/2411.17525)
Keywords: data-free, large language model
Abstract: Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a "linearity theorem" establishing a direct relationship between the layer-wise $\ell_2$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels which match a given compression constraint in the medium-bitwidth regime, obtained by reduction to dynamic programming. On the practical side, we demonstrate improved accuracy-compression trade-offs on Llama-3.1 and 3.2-family models, as well as on Qwen-family models. Further, we show that our method can be efficiently supported in terms of GPU kernels at various batch sizes, advancing both data-free and non-uniform quantization for LLMs.

Title: HSI-Drive v2.0: More Data for New Challenges in Scene Understanding for Autonomous Driving

Authors: Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe, M. Victoria Martínez, Unai Martínez-Corral
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.17530
Pdf URL: https://arxiv.org/pdf/2411.17530
Copy Paste: [[2411.17530]] HSI-Drive v2.0: More Data for New Challenges in Scene Understanding for Autonomous Driving(https://arxiv.org/abs/2411.17530)
Keywords: robust, segmentation
Abstract: We present the updated version of the HSI-Drive dataset aimed at developing automated driving systems (ADS) using hyperspectral imaging (HSI). The v2.0 version includes new annotated images from videos recorded during winter and fall in real driving scenarios. Added to the spring and summer images included in the previous v1.1 version, the new dataset contains 752 images covering the four seasons. In this paper, we show the improvements achieved over previously published results obtained on the v1.1 dataset, showcasing the enhanced performance of models trained on the new v2.0 dataset. We also show the progress made in comprehensive scene understanding by experimenting with more capable image segmentation models. These models include new segmentation categories aimed at the identification of essential road safety objects such as the presence of vehicles and road signs, as well as highly vulnerable groups like pedestrians and cyclists. In addition, we provide evidence of the performance and robustness of the models when applied to segmenting HSI video sequences captured in various environments and conditions. Finally, for a correct assessment of the results described in this work, the constraints imposed by the processing platforms that can sensibly be deployed in vehicles for ADS must be taken into account. Thus, and although implementation details are out of the scope of this paper, we focus our research on the development of computationally efficient, lightweight ML models that can eventually operate at high throughput rates. The dataset and some examples of segmented videos are available in this https URL.

Title: FTMoMamba: Motion Generation with Frequency and Text State Space Models

Authors: Chengjian Li, Xiangbo Shu, Qiongjie Cui, Yazhou Yao, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17532
Pdf URL: https://arxiv.org/pdf/2411.17532
Copy Paste: [[2411.17532]] FTMoMamba: Motion Generation with Frequency and Text State Space Models(https://arxiv.org/abs/2411.17532)
Keywords: diffusion
Abstract: Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble), respectively. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level, aligning textual semantics with sequential features. Extensive experiments show that FTMoMamba achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID of 0.181 (rather lower than 0.421 of MLD) on the HumanML3D dataset.

Title: IMPROVE: Improving Medical Plausibility without Reliance on HumanValidation - An Enhanced Prototype-Guided Diffusion Framework

Authors: Anurag Shandilya, Swapnil Bhat, Akshat Gautam, Subhash Yadav, Siddharth Bhatt, Deval Mehta, Kshitij Jadhav
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17535
Pdf URL: https://arxiv.org/pdf/2411.17535
Copy Paste: [[2411.17535]] IMPROVE: Improving Medical Plausibility without Reliance on HumanValidation - An Enhanced Prototype-Guided Diffusion Framework(https://arxiv.org/abs/2411.17535)
Keywords: diffusion, generative
Abstract: Generative models have proven to be very effective in generating synthetic medical images and find applications in downstream tasks such as enhancing rare disease datasets, long-tailed dataset augmentation, and scaling machine learning algorithms. For medical applications, the synthetically generated medical images by such models are still reasonable in quality when evaluated based on traditional metrics such as FID score, precision, and recall. However, these metrics fail to capture the medical/biological plausibility of the generated images. Human expert feedback has been used to get biological plausibility which demonstrates that these generated images have very low plausibility. Recently, the research community has further integrated this human feedback through Reinforcement Learning from Human Feedback(RLHF), which generates more medically plausible images. However, incorporating human feedback is a costly and slow process. In this work, we propose a novel approach to improve the medical plausibility of generated images without the need for human feedback. We introduce IMPROVE:Improving Medical Plausibility without Reliance on Human Validation - An Enhanced Prototype-Guided Diffusion Framework, a prototype-guided diffusion process for medical image generation and show that it substantially enhances the biological plausibility of the generated medical images without the need for any human feedback. We perform experiments on Bone Marrow and HAM10000 datasets and show that medical accuracy can be substantially increased without human feedback.

Title: Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning

Authors: Hoàng-Ân Lê, Paul Berg, Minh-Tan Pham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17536
Pdf URL: https://arxiv.org/pdf/2411.17536
Copy Paste: [[2411.17536]] Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning(https://arxiv.org/abs/2411.17536)
Keywords: segmentation
Abstract: Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel-wise class labels. Making use of one task's information to train the other would be beneficial for multi-task partially supervised learning where each training example is annotated only for a single task, having the potential to expand training sets with different-task datasets. This paper studies various weak losses for partially annotated data in combination with existing supervised losses. We propose Box-for-Mask and Mask-for-Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other. Ablation studies and experimental results on VOC and COCO datasets show favorable results for the proposed idea. Source code and data splits can be found at this https URL.

Title: AI-Augmented Ethical Hacking: A Practical Examination of Manual Exploitation and Privilege Escalation in Linux Environments

Authors: Haitham S. Al-Sinani, Chris J. Mitchell
Subjects: cs.CR, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2411.17539
Pdf URL: https://arxiv.org/pdf/2411.17539
Copy Paste: [[2411.17539]] AI-Augmented Ethical Hacking: A Practical Examination of Manual Exploitation and Privilege Escalation in Linux Environments(https://arxiv.org/abs/2411.17539)
Keywords: security, privacy, attack, generative
Abstract: This study explores the application of generative AI (GenAI) within manual exploitation and privilege escalation tasks in Linux-based penetration testing environments, two areas critical to comprehensive cybersecurity assessments. Building on previous research into the role of GenAI in the ethical hacking lifecycle, this paper presents a hands-on experimental analysis conducted in a controlled virtual setup to evaluate the utility of GenAI in supporting these crucial, often manual, tasks. Our findings demonstrate that GenAI can streamline processes, such as identifying potential attack vectors and parsing complex outputs for sensitive data during privilege escalation. The study also identifies key benefits and challenges associated with GenAI, including enhanced efficiency and scalability, alongside ethical concerns related to data privacy, unintended discovery of vulnerabilities, and potential for misuse. This work contributes to the growing field of AI-assisted cybersecurity by emphasising the importance of human-AI collaboration, especially in contexts requiring careful decision-making, rather than the complete replacement of human input.

Title: Rapid Deployment of Domain-specific Hyperspectral Image Processors with Application to Autonomous Driving

Authors: Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe, Óscar Mata-Carballeira, M. Victoria Martínez
Subjects: cs.CV, cs.AI, cs.AR, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.17543
Pdf URL: https://arxiv.org/pdf/2411.17543
Copy Paste: [[2411.17543]] Rapid Deployment of Domain-specific Hyperspectral Image Processors with Application to Autonomous Driving(https://arxiv.org/abs/2411.17543)
Keywords: segmentation
Abstract: The article discusses the use of low cost System-On-Module (SOM) platforms for the implementation of efficient hyperspectral imaging (HSI) processors for application in autonomous driving. The work addresses the challenges of shaping and deploying multiple layer fully convolutional networks (FCN) for low-latency, on-board image semantic segmentation using resource- and power-constrained processing devices. The paper describes in detail the steps followed to redesign and customize a successfully trained HSI segmentation lightweight FCN that was previously tested on a high-end heterogeneous multiprocessing system-on-chip (MPSoC) to accommodate it to the constraints imposed by a low-cost SOM. This SOM features a lower-end but much cheaper MPSoC suitable for the deployment of automatic driving systems (ADS). In particular the article reports the data- and hardware-specific quantization techniques utilized to fit the FCN into a commercial fixed-point programmable AI coprocessor IP, and proposes a full customized post-training quantization scheme to reduce computation and storage costs without compromising segmentation accuracy.

Title: A Bilayer Segmentation-Recombination Network for Accurate Segmentation of Overlapping C. elegans

Authors: Mengqian Dinga, Jun Liua, Yang Luo, Jinshan Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17557
Pdf URL: https://arxiv.org/pdf/2411.17557
Copy Paste: [[2411.17557]] A Bilayer Segmentation-Recombination Network for Accurate Segmentation of Overlapping C. elegans(https://arxiv.org/abs/2411.17557)
Keywords: segmentation
Abstract: Caenorhabditis elegans (C. elegans) is an excellent model organism because of its short lifespan and high degree of homology with human genes, and it has been widely used in a variety of human health and disease models. However, the segmentation of C. elegans remains challenging due to the following reasons: 1) the activity trajectory of C. elegans is uncontrollable, and multiple nematodes often overlap, resulting in blurred boundaries of C. elegans. This makes it impossible to clearly study the life trajectory of a certain nematode; and 2) in the microscope images of overlapping C. elegans, the translucent tissues at the edges obscure each other, leading to inaccurate boundary segmentation. To solve these problems, a Bilayer Segmentation-Recombination Network (BR-Net) for the segmentation of C. elegans instances is proposed. The network consists of three parts: A Coarse Mask Segmentation Module (CMSM), a Bilayer Segmentation Module (BSM), and a Semantic Consistency Recombination Module (SCRM). The CMSM is used to extract the coarse mask, and we introduce a Unified Attention Module (UAM) in CMSM to make CMSM better aware of nematode instances. The Bilayer Segmentation Module (BSM) segments the aggregated C. elegans into overlapping and non-overlapping regions. This is followed by integration by the SCRM, where semantic consistency regularization is introduced to segment nematode instances more accurately. Finally, the effectiveness of the method is verified on the C. elegans dataset. The experimental results show that BR-Net exhibits good competitiveness and outperforms other recently proposed instance segmentation methods in processing C. elegans occlusion images.

Title: Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey

Authors: Jiayi Kuang, Jingyou Xie, Haohao Luo, Ronghao Li, Zhe Xu, Xianfeng Cheng, Yinghui Li, Xika Lin, Ying Shen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17558
Pdf URL: https://arxiv.org/pdf/2411.17558
Copy Paste: [[2411.17558]] Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey(https://arxiv.org/abs/2411.17558)
Keywords: extraction, large language model
Abstract: Visual Question Answering (VQA) is a challenge task that combines natural language processing and computer vision techniques and gradually becomes a benchmark test task in multimodal large language models (MLLMs). The goal of our survey is to provide an overview of the development of VQA and a detailed description of the latest models with high timeliness. This survey gives an up-to-date synthesis of natural language understanding of images and text, as well as the knowledge reasoning module based on image-question information on the core VQA tasks. In addition, we elaborate on recent advances in extracting and fusing modal information with vision-language pretraining models and multimodal large language models in VQA. We also exhaustively review the progress of knowledge reasoning in VQA by detailing the extraction of internal knowledge and the introduction of external knowledge. Finally, we present the datasets of VQA and different evaluation metrics and discuss possible directions for future work.

Title: RTL-Breaker: Assessing the Security of LLMs against Backdoor Attacks on HDL Code Generation

Authors: Lakshmi Likhitha Mankali, Jitendra Bhandari, Manaar Alam, Ramesh Karri, Michail Maniatakos, Ozgur Sinanoglu, Johann Knechtel
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2411.17569
Pdf URL: https://arxiv.org/pdf/2411.17569
Copy Paste: [[2411.17569]] RTL-Breaker: Assessing the Security of LLMs against Backdoor Attacks on HDL Code Generation(https://arxiv.org/abs/2411.17569)
Keywords: security, attack, robust, large language model
Abstract: Large language models (LLMs) have demonstrated remarkable potential with code generation/completion tasks for hardware design. In fact, LLM-based hardware description language (HDL) code generation has enabled the industry to realize complex designs more quickly, reducing the time and effort required in the development cycle. However, the increased reliance on such automation introduces critical security risks. Notably, given that LLMs have to be trained on vast datasets of codes that are typically sourced from publicly available repositories (often without thorough validation), LLMs are susceptible to so-called data poisoning or backdoor attacks. Here, attackers inject malicious code for the training data, which can be carried over into the HDL code generated by LLMs. This threat vector can compromise the security and integrity of entire hardware systems. In this work, we propose RTL-Breaker, a novel backdoor attack framework on LLM-based HDL code generation. RTL-Breaker provides an in-depth analysis for essential aspects of this novel problem: 1) various trigger mechanisms versus their effectiveness for inserting malicious modifications, and 2) side-effects by backdoor attacks on code generation in general, i.e., impact on code quality. RTL-Breaker emphasizes the urgent need for more robust measures to safeguard against such attacks. Toward that end, we open-source our framework and all data.

Title: Learning Explainable Treatment Policies with Clinician-Informed Representations: A Practical Approach

Authors: Johannes O. Ferstad, Emily B. Fox, David Scheinker, Ramesh Johari
Subjects: cs.LG, cs.AI, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2411.17570
Pdf URL: https://arxiv.org/pdf/2411.17570
Copy Paste: [[2411.17570]] Learning Explainable Treatment Policies with Clinician-Informed Representations: A Practical Approach(https://arxiv.org/abs/2411.17570)
Keywords: interpretability
Abstract: Digital health interventions (DHIs) and remote patient monitoring (RPM) have shown great potential in improving chronic disease management through personalized care. However, barriers like limited efficacy and workload concerns hinder adoption of existing DHIs; while limited sample sizes and lack of interpretability limit the effectiveness and adoption of purely black-box algorithmic DHIs. In this paper, we address these challenges by developing a pipeline for learning explainable treatment policies for RPM-enabled DHIs. We apply our approach in the real-world setting of RPM using a DHI to improve glycemic control of youth with type 1 diabetes. Our main contribution is to reveal the importance of clinical domain knowledge in developing state and action representations for effective, efficient, and interpretable targeting policies. We observe that policies learned from clinician-informed representations are significantly more efficacious and efficient than policies learned from black-box representations. This work emphasizes the importance of collaboration between ML researchers and clinicians for developing effective DHIs in the real world.

Title: A Distractor-Aware Memory for Visual Object Tracking with SAM2

Authors: Jovana Videnovic, Alan Lukezic, Matej Kristan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17576
Pdf URL: https://arxiv.org/pdf/2411.17576
Copy Paste: [[2411.17576]] A Distractor-Aware Memory for Visual Object Tracking with SAM2(https://arxiv.org/abs/2411.17576)
Keywords: robust, segmentation
Abstract: Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.

Title: From Fairness to Infinity: Outcome-Indistinguishable (Omni)Prediction in Evolving Graphs

Authors: Cynthia Dwork, Chris Hays, Nicole Immorlica, Juan C. Perdomo, Pranay Tankala
Subjects: cs.LG, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2411.17582
Pdf URL: https://arxiv.org/pdf/2411.17582
Copy Paste: [[2411.17582]] From Fairness to Infinity: Outcome-Indistinguishable (Omni)Prediction in Evolving Graphs(https://arxiv.org/abs/2411.17582)
Keywords: fair
Abstract: Professional networks provide invaluable entree to opportunity through referrals and introductions. A rich literature shows they also serve to entrench and even exacerbate a status quo of privilege and disadvantage. Hiring platforms, equipped with the ability to nudge link formation, provide a tantalizing opening for beneficial structural change. We anticipate that key to this prospect will be the ability to estimate the likelihood of edge formation in an evolving graph. Outcome-indistinguishable prediction algorithms ensure that the modeled world is indistinguishable from the real world by a family of statistical tests. Omnipredictors ensure that predictions can be post-processed to yield loss minimization competitive with respect to a benchmark class of predictors for many losses simultaneously, with appropriate post- processing. We begin by observing that, by combining a slightly modified form of the online K29 star algorithm of Vovk (2007) with basic facts from the theory of reproducing kernel Hilbert spaces, one can derive simple and efficient online algorithms satisfying outcome indistinguishability and omniprediction, with guarantees that improve upon, or are complementary to, those currently known. This is of independent interest. We apply these techniques to evolving graphs, obtaining online outcome-indistinguishable omnipredictors for rich -- possibly infinite -- sets of distinguishers that capture properties of pairs of nodes, and their neighborhoods. This yields, inter alia, multicalibrated predictions of edge formation with respect to pairs of demographic groups, and the ability to simultaneously optimize loss as measured by a variety of social welfare functions.

Title: Pre-training for Action Recognition with Automatically Generated Fractal Datasets

Authors: Davyd Svyezhentsev, George Retsinas, Petros Maragos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17584
Pdf URL: https://arxiv.org/pdf/2411.17584
Copy Paste: [[2411.17584]] Pre-training for Action Recognition with Automatically Generated Fractal Datasets(https://arxiv.org/abs/2411.17584)
Keywords: privacy, generative
Abstract: In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at this https URL .

Title: Multi-Objective Reinforcement Learning for Automated Resilient Cyber Defence

Authors: Ross O'Driscoll, Claudia Hagen, Joe Bater, James M. Adams
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.17585
Pdf URL: https://arxiv.org/pdf/2411.17585
Copy Paste: [[2411.17585]] Multi-Objective Reinforcement Learning for Automated Resilient Cyber Defence(https://arxiv.org/abs/2411.17585)
Keywords: security, protect, attack
Abstract: Cyber-attacks pose a security threat to military command and control networks, Intelligence, Surveillance, and Reconnaissance (ISR) systems, and civilian critical national infrastructure. The use of artificial intelligence and autonomous agents in these attacks increases the scale, range, and complexity of this threat and the subsequent disruption they cause. Autonomous Cyber Defence (ACD) agents aim to mitigate this threat by responding at machine speed and at the scale required to address the problem. Sequential decision-making algorithms such as Deep Reinforcement Learning (RL) provide a promising route to create ACD agents. These algorithms focus on a single objective such as minimizing the intrusion of red agents on the network, by using a handcrafted weighted sum of rewards. This approach removes the ability to adapt the model during inference, and fails to address the many competing objectives present when operating and protecting these networks. Conflicting objectives, such as restoring a machine from a back-up image, must be carefully balanced with the cost of associated down-time, or the disruption to network traffic or services that might result. Instead of pursing a Single-Objective RL (SORL) approach, here we present a simple example of a multi-objective network defence game that requires consideration of both defending the network against red-agents and maintaining critical functionality of green-agents. Two Multi-Objective Reinforcement Learning (MORL) algorithms, namely Multi-Objective Proximal Policy Optimization (MOPPO), and Pareto-Conditioned Networks (PCN), are used to create two trained ACD agents whose performance is compared on our Multi-Objective Cyber Defence game. The benefits and limitations of MORL ACD agents in comparison to SORL ACD agents are discussed based on the investigations of this game.

Title: VideoDirector: Precise Video Editing via Text-to-Video Models

Authors: Yukun Wang, Longguang Wang, Zhiyuan Ma, Qibin Hu, Kai Xu, Yulan Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17592
Pdf URL: https://arxiv.org/pdf/2411.17592
Copy Paste: [[2411.17592]] VideoDirector: Precise Video Editing via Text-to-Video Models(https://arxiv.org/abs/2411.17592)
Keywords: diffusion, generative
Abstract: Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.

Title: What Differentiates Educational Literature? A Multimodal Fusion Approach of Transformers and Computational Linguistics

Authors: Jordan J. Bird
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17593
Pdf URL: https://arxiv.org/pdf/2411.17593
Copy Paste: [[2411.17593]] What Differentiates Educational Literature? A Multimodal Fusion Approach of Transformers and Computational Linguistics(https://arxiv.org/abs/2411.17593)
Keywords: transformer
Abstract: The integration of new literature into the English curriculum remains a challenge since educators often lack scalable tools to rapidly evaluate readability and adapt texts for diverse classroom needs. This study proposes to address this gap through a multimodal approach that combines transformer-based text classification with linguistic feature analysis to align texts with UK Key Stages. Eight state-of-the-art Transformers were fine-tuned on segmented text data, with BERT achieving the highest unimodal F1 score of 0.75. In parallel, 500 deep neural network topologies were searched for the classification of linguistic characteristics, achieving an F1 score of 0.392. The fusion of these modalities shows a significant improvement, with every multimodal approach outperforming all unimodal models. In particular, the ELECTRA Transformer fused with the neural network achieved an F1 score of 0.996. The proposed approach is finally encapsulated in a stakeholder-facing web application, providing non-technical stakeholder access to real-time insights on text complexity, reading difficulty, curriculum alignment, and recommendations for learning age range. The application empowers data-driven decision making and reduces manual workload by integrating AI-based recommendations into lesson planning for English literature.

Title: Can artificial intelligence predict clinical trial outcomes?

Authors: Shuyi Jin, Lu Chen, Hongru Ding, Meijie Wang, Lun Yu
Subjects: cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2411.17595
Pdf URL: https://arxiv.org/pdf/2411.17595
Copy Paste: [[2411.17595]] Can artificial intelligence predict clinical trial outcomes?(https://arxiv.org/abs/2411.17595)
Keywords: robust, large language model
Abstract: The increasing complexity and cost of clinical trials, particularly in the context of oncology and advanced therapies, pose significant challenges for drug development. This study evaluates the predictive capabilities of large language models (LLMs) such as GPT-3.5, GPT-4, and HINT in determining clinical trial outcomes. By leveraging a curated dataset of trials from this http URL, we compare the models' performance using metrics including balanced accuracy, specificity, recall, and Matthews Correlation Coefficient (MCC). Results indicate that GPT-4o demonstrates robust performance in early trial phases, achieving high recall but facing limitations in specificity. Conversely, the HINT model excels in recognizing negative outcomes, particularly in later trial phases, offering a balanced approach across diverse endpoints. Oncology trials, characterized by high complexity, remain challenging for all models. Additionally, trial duration and disease categories influence predictive performance, with longer durations and complex diseases such as neoplasms reducing accuracy. This study highlights the complementary strengths of LLMs and HINT, providing insights into optimizing predictive tools for clinical trial design and risk management. Future advancements in LLMs are essential to address current gaps in handling negative outcomes and complex domains.

Title: HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Authors: Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17606
Pdf URL: https://arxiv.org/pdf/2411.17606
Copy Paste: [[2411.17606]] HyperSeg: Towards Universal Visual Segmentation with Large Language Model(https://arxiv.org/abs/2411.17606)
Keywords: large language model, segmentation
Abstract: This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.

Title: Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Authors: Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang, Yuxiao Dong, Jie Tang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.17607
Pdf URL: https://arxiv.org/pdf/2411.17607
Copy Paste: [[2411.17607]] Scaling Speech-Text Pre-training with Synthetic Interleaved Data(https://arxiv.org/abs/2411.17607)
Keywords: large language model
Abstract: Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower sampling rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.

Title: Modality-Incremental Learning with Disjoint Relevance Mapping Networks for Image-based Semantic Segmentation

Authors: Niharika Hegde, Shishir Muralidhara, René Schuster, Didier Stricker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17610
Pdf URL: https://arxiv.org/pdf/2411.17610
Copy Paste: [[2411.17610]] Modality-Incremental Learning with Disjoint Relevance Mapping Networks for Image-based Semantic Segmentation(https://arxiv.org/abs/2411.17610)
Keywords: robust, segmentation
Abstract: In autonomous driving, environment perception has significantly advanced with the utilization of deep learning techniques for diverse sensors such as cameras, depth sensors, or infrared sensors. The diversity in the sensor stack increases the safety and contributes to robustness against adverse weather and lighting conditions. However, the variance in data acquired from different sensors poses challenges. In the context of continual learning (CL), incremental learning is especially challenging for considerably large domain shifts, e.g. different sensor modalities. This amplifies the problem of catastrophic forgetting. To address this issue, we formulate the concept of modality-incremental learning and examine its necessity, by contrasting it with existing incremental learning paradigms. We propose the use of a modified Relevance Mapping Network (RMN) to incrementally learn new modalities while preserving performance on previously learned modalities, in which relevance maps are disjoint. Experimental results demonstrate that the prevention of shared connections in this approach helps alleviate the problem of forgetting within the constraints of a strict continual learning framework.

Title: Accelerating Vision Diffusion Transformers with Skip Branches

Authors: Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Tianlong Chen, Cheng Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17616
Pdf URL: https://arxiv.org/pdf/2411.17616
Copy Paste: [[2411.17616]] Accelerating Vision Diffusion Transformers with Skip Branches(https://arxiv.org/abs/2411.17616)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers (DiT), an emerging image and video generation model architecture, has demonstrated great potential because of its high generation quality and scalability properties. Despite the impressive performance, its practical deployment is constrained by computational complexity and redundancy in the sequential denoising process. While feature caching across timesteps has proven effective in accelerating diffusion models, its application to DiT is limited by fundamental architectural differences from U-Net-based approaches. Through empirical analysis of DiT feature dynamics, we identify that significant feature variation between DiT blocks presents a key challenge for feature reusability. To address this, we convert standard DiT into Skip-DiT with skip branches to enhance feature smoothness. Further, we introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time. We validated effectiveness of our proposal on different DiT backbones for video and image generation, showcasing skip branches to help preserve generation quality and achieve higher speedup. Experimental results indicate that Skip-DiT achieves a 1.5x speedup almost for free and a 2.2x speedup with only a minor reduction in quantitative metrics. Code is available at this https URL.

Title: Machine Learning and Multi-source Remote Sensing in Forest Carbon Stock Estimation: A Review

Authors: Autumn Nguyen, Sulagna Saha
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17624
Pdf URL: https://arxiv.org/pdf/2411.17624
Copy Paste: [[2411.17624]] Machine Learning and Multi-source Remote Sensing in Forest Carbon Stock Estimation: A Review(https://arxiv.org/abs/2411.17624)
Keywords: protect
Abstract: Quantifying forest carbon is crucial for informing decisions and policies that will protect the planet. Machine learning (ML) and remote sensing (RS) techniques have been used to do this task more effectively, yet there lacks a systematic review on the most recent ML methods and RS combinations, especially with the consideration of forest characteristics. This study systematically analyzed 25 papers meeting strict inclusion criteria from over 80 related studies, identifying 28 ML methods and key combinations of RS data. Random Forest had the most frequent appearance (88\% of studies), while Extreme Gradient Boosting showed superior performance in 75\% of the studies in which it was compared with other methods. Sentinel-1 emerged as the most utilized remote sensing source, with multi-sensor approaches (e.g., Sentinel-1, Sentinel-2, and LiDAR) proving especially effective. Our findings provide grounds for recommending best practices in integrating machine learning and remote sensing for accurate and scalable forest carbon stock estimation.

Title: Data-driven development of cycle prediction models for lithium metal batteries using multi modal mining

Authors: Jaewoong Lee, Junhee Woo, Sejin Kim, Cinthya Paulina, Hyunmin Park, Hee-Tak Kim, Steve Park, Jihan Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17625
Pdf URL: https://arxiv.org/pdf/2411.17625
Copy Paste: [[2411.17625]] Data-driven development of cycle prediction models for lithium metal batteries using multi modal mining(https://arxiv.org/abs/2411.17625)
Keywords: extraction, large language model
Abstract: Recent advances in data-driven research have shown great potential in understanding the intricate relationships between materials and their performances. Herein, we introduce a novel multi modal data-driven approach employing an Automatic Battery data Collector (ABC) that integrates a large language model (LLM) with an automatic graph mining tool, Material Graph Digitizer (MatGD). This platform enables state-of-the-art accurate extraction of battery material data and cyclability performance metrics from diverse textual and graphical data sources. From the database derived through the ABC platform, we developed machine learning models that can accurately predict the capacity and stability of lithium metal batteries, which is the first-ever model developed to achieve such predictions. Our models were also experimentally validated, confirming practical applicability and reliability of our data-driven approach.

Title: Learning Chemical Reaction Representation with Reactant-Product Alignment

Authors: Kaipeng Zeng, Xianbin Liu, Yu Zhang, Xiaokang Yang, Yaohui Jin, Yanyan Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17629
Pdf URL: https://arxiv.org/pdf/2411.17629
Copy Paste: [[2411.17629]] Learning Chemical Reaction Representation with Reactant-Product Alignment(https://arxiv.org/abs/2411.17629)
Keywords: robust
Abstract: Organic synthesis stands as a cornerstone of chemical industry. The development of robust machine learning models to support tasks associated with organic reactions is of significant interest. However, current methods rely on hand-crafted features or direct adaptations of model architectures from other domains, which lacks feasibility as data scales increase or overlook the rich chemical information inherent in reactions. To address these issues, this paper introduces {\modelname}, a novel chemical reaction representation learning model tailored for a variety of organic-reaction-related tasks. By integrating atomic correspondence between reactants and products, our model discerns the molecular transformations that occur during the reaction, thereby enhancing the comprehension of the reaction mechanism. We have designed an adapter structure to incorporate reaction conditions into the chemical reaction representation, allowing the model to handle diverse reaction conditions and adapt to various datasets and downstream tasks, e.g., reaction performance prediction. Additionally, we introduce a reaction-center aware attention mechanism that enables the model to concentrate on key functional groups, thereby generating potent representations for chemical reactions. Our model has been evaluated on a range of downstream tasks, including reaction condition prediction, reaction yield prediction, and reaction selectivity prediction. Experimental results indicate that our model markedly outperforms existing chemical reaction representation learning architectures across all tasks. Notably, our model significantly outperforms all the baselines with up to 25\% (top-1) and 16\% (top-10) increased accuracy over the strongest baseline on USPTO\_CONDITION dataset for reaction condition prediction. We plan to open-source the code contingent upon the acceptance of the paper.

Title: On Limitations of LLM as Annotator for Low Resource Languages

Authors: Suramya Jadhav, Abhay Shanbhag, Amogh Thakurdesai, Ridhima Sinare, Raviraj Joshi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17637
Pdf URL: https://arxiv.org/pdf/2411.17637
Copy Paste: [[2411.17637]] On Limitations of LLM as Annotator for Low Resource Languages(https://arxiv.org/abs/2411.17637)
Keywords: large language model
Abstract: Low-resource languages face significant challenges due to the lack of sufficient linguistic data, resources, and tools for tasks such as supervised learning, annotation, and classification. This shortage hinders the development of accurate models and datasets, making it difficult to perform critical NLP tasks like sentiment analysis or hate speech detection. To bridge this gap, Large Language Models (LLMs) present an opportunity for potential annotators, capable of generating datasets and resources for these underrepresented languages. In this paper, we focus on Marathi, a low-resource language, and evaluate the performance of both closed-source and open-source LLMs as annotators. We assess models such as GPT-4o and Gemini 1.0 Pro, Gemma 2 (2B and 9B), and Llama 3.1 (8B) on classification tasks including sentiment analysis, news classification, and hate speech detection. Our findings reveal that while LLMs excel in annotation tasks for high-resource languages like English, they still fall short when applied to Marathi. Even advanced closed models like Gemini and GPT underperform in comparison to BERT-based baselines, highlighting the limitations of LLMs as annotators for low-resource languages.

Title: A robust image encryption scheme based on new 4-D hyperchaotic system and elliptic curve

Authors: Yehia Lalili, Toufik Bouden, Morad Grimes, Abderrazek Lachouri
Subjects: cs.CR, eess.IV
Abstract URL: https://arxiv.org/abs/2411.17643
Pdf URL: https://arxiv.org/pdf/2411.17643
Copy Paste: [[2411.17643]] A robust image encryption scheme based on new 4-D hyperchaotic system and elliptic curve(https://arxiv.org/abs/2411.17643)
Keywords: security, protect, robust, diffusion
Abstract: In this work, a new 4-D hyperchaotic system for image encryption is proposed and its effectiveness is demonstrated by incorporating it into an existing Elliptic Curve Cryptography (ECC) mapping scheme. The proposed system is considered simple because it consists of eight terms with two nonlinearities. The system exhibits high sensitivity to initial conditions, which makes it suitable for encryption purposes. The two-stage encryption process, involving confusion and diffusion, is employed to protect the confidentiality of digital images. The simulation results demonstrate the effectiveness of the hyperchaotic system in terms of security and performance when combined with the ECC mapping scheme. This approach can be applied in various domains including healthcare, military, and entertainment to ensure the robust encryption of digital images.

Title: Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset

Authors: Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17645
Pdf URL: https://arxiv.org/pdf/2411.17645
Copy Paste: [[2411.17645]] Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset(https://arxiv.org/abs/2411.17645)
Keywords: fair, interpretability
Abstract: The use of machine learning and AI on electronic health records (EHRs) holds substantial potential for clinical insight. However, this approach faces significant challenges due to data heterogeneity, sparsity, temporal misalignment, and limited labeled outcomes. In this context, we leverage a linked EHR dataset of approximately one million de-identified individuals from Bristol, North Somerset, and South Gloucestershire, UK, to characterize urinary tract infections (UTIs) and develop predictive models focused on data quality, fairness and transparency. A comprehensive data pre-processing and curation pipeline transforms the raw EHR data into a structured format suitable for AI modeling. Given the limited availability and biases of ground truth UTI outcomes, we introduce a UTI risk estimation framework informed by clinical expertise to estimate UTI risk across individual patient timelines. Using this framework, we built pairwise XGBoost models to differentiate UTI risk categories with explainable AI techniques to identify key predictors while ensuring interpretability. Our findings reveal differences in clinical and demographic factors across risk groups, offering insights into UTI risk stratification and progression. This study demonstrates the added value of AI-driven insights into UTI clinical decision-making while prioritizing interpretability, transparency, and fairness, underscoring the importance of sound data practices in advancing health outcomes.

Title: SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation

Authors: Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17646
Pdf URL: https://arxiv.org/pdf/2411.17646
Copy Paste: [[2411.17646]] SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation(https://arxiv.org/abs/2411.17646)
Keywords: robust, extraction, segmentation
Abstract: Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip. Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To this end, we introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process. We further reveal the phenomenon of tracking bias in SAM2 and propose a learnable module to adjust its tracking focus when the current frame features suggest a new object more aligned with the caption. Our proposed method, SAMWISE, achieves state-of-the-art across various benchmarks, by adding a negligible overhead of just 4.2 M parameters. The code is available at this https URL

Title: DROID-Splat: Combining end-to-end SLAM with 3D Gaussian Splatting

Authors: Christian Homeyer, Leon Begiristain, Christoph Schnörr
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17660
Pdf URL: https://arxiv.org/pdf/2411.17660
Copy Paste: [[2411.17660]] DROID-Splat: Combining end-to-end SLAM with 3D Gaussian Splatting(https://arxiv.org/abs/2411.17660)
Keywords: robust
Abstract: Recent progress in scene synthesis makes standalone SLAM systems purely based on optimizing hyperprimitives with a Rendering objective possible \cite{monogs}. However, the tracking performance still lacks behind traditional \cite{orbslam} and end-to-end SLAM systems \cite{droid}. An optimal trade-off between robustness, speed and accuracy has not yet been reached, especially for monocular video. In this paper, we introduce a SLAM system based on an end-to-end Tracker and extend it with a Renderer based on recent 3D Gaussian Splatting techniques. Our framework \textbf{DroidSplat} achieves both SotA tracking and rendering results on common SLAM benchmarks. We implemented multiple building blocks of modern SLAM systems to run in parallel, allowing for fast inference on common consumer GPU's. Recent progress in monocular depth prediction and camera calibration allows our system to achieve strong results even on in-the-wild data without known camera intrinsics. Code will be available at \url{this https URL}.

Title: Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Authors: Burak Suyunu, Enes Taylan, Arzucan Özgür
Subjects: cs.CL, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.17669
Pdf URL: https://arxiv.org/pdf/2411.17669
Copy Paste: [[2411.17669]] Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods(https://arxiv.org/abs/2411.17669)
Keywords: segmentation
Abstract: Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.

Title: Synthetic Data Generation with LLM for Improved Depression Prediction

Authors: Andrea Kang, Jun Yu Chen, Zoe Lee-Youngzie, Shuhao Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17672
Pdf URL: https://arxiv.org/pdf/2411.17672
Copy Paste: [[2411.17672]] Synthetic Data Generation with LLM for Improved Depression Prediction(https://arxiv.org/abs/2411.17672)
Keywords: privacy, robust, large language model
Abstract: Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the sensitivity of such a topic. In this paper, we propose a pipeline for Large Language Models (LLMs) to generate synthetic data to improve the performance of depression prediction models. Starting from unstructured, naturalistic text data from recorded transcripts of clinical interviews, we utilize an open-source LLM to generate synthetic data through chain-of-thought prompting. This pipeline involves two key steps: the first step is the generation of the synopsis and sentiment analysis based on the original transcript and depression score, while the second is the generation of the synthetic synopsis/sentiment analysis based on the summaries generated in the first step and a new depression score. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset, thereby significantly enhancing the model's capability in predicting the intensity of the patient's depression. By leveraging LLMs to generate synthetic data that can be augmented to limited and imbalanced real-world datasets, we demonstrate a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection, all while maintaining the statistical integrity of the original dataset. This approach offers a robust framework for future mental health research and applications.

Title: SketchAgent: Language-Driven Sequential Sketch Generation

Authors: Yael Vinker, Tamar Rott Shaham, Kristine Zheng, Alex Zhao, Judith E Fan, Antonio Torralba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17673
Pdf URL: https://arxiv.org/pdf/2411.17673
Copy Paste: [[2411.17673]] SketchAgent: Language-Driven Sequential Sketch Generation(https://arxiv.org/abs/2411.17673)
Keywords: large language model
Abstract: Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.

Title: Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Authors: Liyun Zhang, Dian Ding, Yu Lu, Yi-Chao Chen, Guangtao Xue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17674
Pdf URL: https://arxiv.org/pdf/2411.17674
Copy Paste: [[2411.17674]] Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting(https://arxiv.org/abs/2411.17674)
Keywords: large language model
Abstract: Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia in- formation. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMO- CAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.

Title: Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

Authors: Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17679
Pdf URL: https://arxiv.org/pdf/2411.17679
Copy Paste: [[2411.17679]] Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning(https://arxiv.org/abs/2411.17679)
Keywords: large language model, segmentation
Abstract: Tokenization techniques such as Byte-Pair Encoding (BPE) and Byte-Level BPE (BBPE) have significantly improved the computational efficiency and vocabulary representation stability of large language models (LLMs) by segmenting text into tokens. However, this segmentation often obscures the internal character structures and sequences within tokens, preventing models from fully learning these intricate details during training. Consequently, LLMs struggle to comprehend the character compositions and positional relationships within tokens, especially when fine-tuned on downstream tasks with limited data. In this paper, we introduce Token Internal Position Awareness (TIPA), a novel approach that enhances LLMs' understanding of internal token structures by training them on reverse character prediction tasks using the tokenizer's own vocabulary. This method enables models to effectively learn and generalize character positions and internal structures. Experimental results demonstrate that LLMs trained with TIPA outperform baseline models in predicting character positions at the token level. Furthermore, when applied to the downstream task of Chinese Spelling Correction (CSC), TIPA not only accelerates model convergence but also significantly improves task performance.

Title: RealSeal: Revolutionizing Media Authentication with Real-Time Realism Scoring

Authors: Bhaktipriya Radharapu, Harish Krishna
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17684
Pdf URL: https://arxiv.org/pdf/2411.17684
Copy Paste: [[2411.17684]] RealSeal: Revolutionizing Media Authentication with Real-Time Realism Scoring(https://arxiv.org/abs/2411.17684)
Keywords: security, robust, watermark
Abstract: The growing threat of deepfakes and manipulated media necessitates a radical rethinking of media authentication. Existing methods for watermarking synthetic data fall short, as they can be easily removed or altered, and current deepfake detection algorithms do not achieve perfect accuracy. Provenance techniques, which rely on metadata to verify content origin, fail to address the fundamental problem of staged or fake media. This paper introduces a groundbreaking paradigm shift in media authentication by advocating for the watermarking of real content at its source, as opposed to watermarking synthetic data. Our innovative approach employs multisensory inputs and machine learning to assess the realism of content in real-time and across different contexts. We propose embedding a robust realism score within the image metadata, fundamentally transforming how images are trusted and circulated. By combining established principles of human reasoning about reality, rooted in firmware and hardware security, with the sophisticated reasoning capabilities of contemporary machine learning systems, we develop a holistic approach that analyzes information from multiple perspectives. This ambitious, blue sky approach represents a significant leap forward in the field, pushing the boundaries of media authenticity and trust. By embracing cutting-edge advancements in technology and interdisciplinary research, we aim to establish a new standard for verifying the authenticity of digital media.

Title: Attamba: Attending To Multi-Token States

Authors: Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17685
Pdf URL: https://arxiv.org/pdf/2411.17685
Copy Paste: [[2411.17685]] Attamba: Attending To Multi-Token States(https://arxiv.org/abs/2411.17685)
Keywords: transformer
Abstract: When predicting the next token in a sequence, vanilla transformers compute attention over all previous tokens, resulting in quadratic scaling of compute with sequence length. State-space models compress the entire sequence of tokens into a fixed-dimensional representation to improve efficiency, while other architectures achieve sub-quadratic complexity via low-rank projections or sparse attention patterns over the sequence. In this paper, we introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens and applies attention on these compressed key-value representations. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking, resulting in 24% improved perplexity with transformer of similar KV-Cache and attention footprint, and ~4 times smaller KV-Cache and Attention FLOPs for 5% perplexity trade-off. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling, offering adaptable efficiency gains.

Title: Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Authors: Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17686
Pdf URL: https://arxiv.org/pdf/2411.17686
Copy Paste: [[2411.17686]] Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration(https://arxiv.org/abs/2411.17686)
Keywords: large language model
Abstract: To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at this https URL.

Title: GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration

Authors: Sudarshan Rajagopalan, Nithin Gopalakrishnan Nair, Jay N. Paranjape, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17687
Pdf URL: https://arxiv.org/pdf/2411.17687
Copy Paste: [[2411.17687]] GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration(https://arxiv.org/abs/2411.17687)
Keywords: diffusion, generative
Abstract: Deep learning-based models for All-In-One Image Restoration (AIOR) have achieved significant advancements in recent years. However, their practical applicability is limited by poor generalization to samples outside the training distribution. This limitation arises primarily from insufficient diversity in degradation variations and scenes within existing datasets, resulting in inadequate representations of real-world scenarios. Additionally, capturing large-scale real-world paired data for degradations such as haze, low-light, and raindrops is often cumbersome and sometimes infeasible. In this paper, we leverage the generative capabilities of latent diffusion models to synthesize high-quality degraded images from their clean counterparts. Specifically, we introduce GenDeg, a degradation and intensity-aware conditional diffusion model capable of producing diverse degradation patterns on clean images. Using GenDeg, we synthesize over 550k samples across six degradation types: haze, rain, snow, motion blur, low-light, and raindrops. These generated samples are integrated with existing datasets to form the GenDS dataset, comprising over 750k samples. Our experiments reveal that image restoration models trained on the GenDS dataset exhibit significant improvements in out-of-distribution performance compared to those trained solely on existing datasets. Furthermore, we provide comprehensive analyses on the implications of diffusion model-based synthetic degradations for AIOR. The code will be made publicly available.

Title: Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

Authors: Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17691
Pdf URL: https://arxiv.org/pdf/2411.17691
Copy Paste: [[2411.17691]] Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens(https://arxiv.org/abs/2411.17691)
Keywords: large language model
Abstract: We reveal that low-bit quantization favors undertrained large language models (LLMs) by observing that models with larger sizes or fewer training tokens experience less quantization-induced degradation (QiD) when applying low-bit quantization, whereas smaller models with extensive training tokens suffer significant QiD. To gain deeper insights into this trend, we study over 1500 quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors such as the number of training tokens, model size and bit width. With the derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM's training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with 100 trillion tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over 100 trillion tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model's training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at this https URL.

Title: Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Authors: Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17693
Pdf URL: https://arxiv.org/pdf/2411.17693
Copy Paste: [[2411.17693]] Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats(https://arxiv.org/abs/2411.17693)
Keywords: large language model
Abstract: As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a "distributed threat setting" -- a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment. We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model's alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.

Title: ScribbleLight: Single Image Indoor Relighting with Scribbles

Authors: Jun Myeong Choi, Annie Wang, Pieter Peers, Anand Bhattad, Roni Sengupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17696
Pdf URL: https://arxiv.org/pdf/2411.17696
Copy Paste: [[2411.17696]] ScribbleLight: Single Image Indoor Relighting with Scribbles(https://arxiv.org/abs/2411.17696)
Keywords: diffusion, generative
Abstract: Image-based relighting of indoor rooms creates an immersive virtual understanding of the space, which is useful for interior design, virtual staging, and real estate. Relighting indoor rooms from a single image is especially challenging due to complex illumination interactions between multiple lights and cluttered objects featuring a large variety in geometrical and material complexity. Recently, generative models have been successfully applied to image-based relighting conditioned on a target image or a latent code, albeit without detailed local lighting control. In this paper, we introduce ScribbleLight, a generative model that supports local fine-grained control of lighting effects through scribbles that describe changes in lighting. Our key technical novelty is an Albedo-conditioned Stable Image Diffusion model that preserves the intrinsic color and texture of the original image after relighting and an encoder-decoder-based ControlNet architecture that enables geometry-preserving lighting effects with normal map and scribble annotations. We demonstrate ScribbleLight's ability to create different lighting effects (e.g., turning lights on/off, adding highlights, cast shadows, or indirect lighting from unseen lights) from sparse scribble annotations.

Title: StableAnimator: High-Quality Identity-Preserving Human Image Animation

Authors: Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17697
Pdf URL: https://arxiv.org/pdf/2411.17697
Copy Paste: [[2411.17697]] StableAnimator: High-Quality Identity-Preserving Human Image Animation(https://arxiv.org/abs/2411.17697)
Keywords: diffusion
Abstract: Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.