2024-12-02

Title: Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop

Authors: Zhaofang Qian, Abolfazl Sharifi, Tucker Carroll, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18644
Pdf URL: https://arxiv.org/pdf/2411.18644
Copy Paste: [[2411.18644]] Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop(https://arxiv.org/abs/2411.18644)
Keywords: large language model
Abstract: Video generation has achieved impressive quality, but it still suffers from artifacts such as temporal inconsistency and violation of physical laws. Leveraging 3D scenes can fundamentally resolve these issues by providing precise control over scene entities. To facilitate the easy generation of diverse photorealistic scenes, we propose Scene Copilot, a framework combining large language models (LLMs) with a procedural 3D scene generator. Specifically, Scene Copilot consists of Scene Codex, BlenderGPT, and Human in the loop. Scene Codex is designed to translate textual user input into commands understandable by the 3D scene generator. BlenderGPT provides users with an intuitive and direct way to precisely control the generated 3D scene and the final output video. Furthermore, users can utilize Blender UI to receive instant visual feedback. Additionally, we have curated a procedural dataset of objects in code format to further enhance our system's capabilities. Each component works seamlessly together to support users in generating desired 3D scenes. Extensive experiments demonstrate the capability of our framework in customizing 3D scenes and video generation.

Title: Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings

Authors: Jinyung Hong, Yearim Kim, Keun Hee Park, Sangyu Han, Nojun Kwak, Theodore P. Pavlic
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18645
Pdf URL: https://arxiv.org/pdf/2411.18645
Copy Paste: [[2411.18645]] Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings(https://arxiv.org/abs/2411.18645)
Keywords: interpretability, large language model
Abstract: Inner interpretability is a promising field focused on uncovering the internal mechanisms of AI systems and developing scalable, automated methods to understand these systems at a mechanistic level. While significant research has explored top-down approaches starting from high-level problems or algorithmic hypotheses and bottom-up approaches building higher-level abstractions from low-level or circuit-level descriptions, most efforts have concentrated on analyzing large language models. Moreover, limited attention has been given to applying inner interpretability to large-scale image tasks, primarily focusing on architectural and functional levels to visualize learned concepts. In this paper, we first present a conceptual framework that supports inner interpretability and multilevel analysis for large-scale image classification tasks. We introduce the Bi-directional Interaction between Concept and Input Embeddings (Bi-ICE) module, which facilitates interpretability across the computational, algorithmic, and implementation levels. This module enhances transparency by generating predictions based on human-understandable concepts, quantifying their contributions, and localizing them within the inputs. Finally, we showcase enhanced transparency in image classification, measuring concept contributions and pinpointing their locations within the inputs. Our approach highlights algorithmic interpretability by demonstrating the process of concept learning and its convergence.

Title: MADE: Graph Backdoor Defense with Masked Unlearning

Authors: Xiao Lin amd Mingjie Li, Yisen Wang
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18648
Pdf URL: https://arxiv.org/pdf/2411.18648
Copy Paste: [[2411.18648]] MADE: Graph Backdoor Defense with Masked Unlearning(https://arxiv.org/abs/2411.18648)
Keywords: security, defense, attack
Abstract: Graph Neural Networks (GNNs) have garnered significant attention from researchers due to their outstanding performance in handling graph-related tasks, such as social network analysis, protein design, and so on. Despite their widespread application, recent research has demonstrated that GNNs are vulnerable to backdoor attacks, implemented by injecting triggers into the training datasets. Trained on the poisoned data, GNNs will predict target labels when attaching trigger patterns to inputs. This vulnerability poses significant security risks for applications of GNNs in sensitive domains, such as drug discovery. While there has been extensive research into backdoor defenses for images, strategies to safeguard GNNs against such attacks remain underdeveloped. Furthermore, we point out that conventional backdoor defense methods designed for images cannot work well when directly implemented on graph data. In this paper, we first analyze the key difference between image backdoor and graph backdoor attacks. Then we tackle the graph defense problem by presenting a novel approach called MADE, which devises an adversarial mask generation mechanism that selectively preserves clean sub-graphs and further leverages masks on edge weights to eliminate the influence of triggers effectively. Extensive experiments across various graph classification tasks demonstrate the effectiveness of MADE in significantly reducing the attack success rate (ASR) while maintaining a high classification accuracy.

Title: Dynamic Logistic Ensembles with Recursive Probability and Automatic Subset Splitting for Enhanced Binary Classification

Authors: Mohammad Zubair Khan, David Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18649
Pdf URL: https://arxiv.org/pdf/2411.18649
Copy Paste: [[2411.18649]] Dynamic Logistic Ensembles with Recursive Probability and Automatic Subset Splitting for Enhanced Binary Classification(https://arxiv.org/abs/2411.18649)
Keywords: robust, interpretability
Abstract: This paper presents a novel approach to binary classification using dynamic logistic ensemble models. The proposed method addresses the challenges posed by datasets containing inherent internal clusters that lack explicit feature-based separations. By extending traditional logistic regression, we develop an algorithm that automatically partitions the dataset into multiple subsets, constructing an ensemble of logistic models to enhance classification accuracy. A key innovation in this work is the recursive probability calculation, derived through algebraic manipulation and mathematical induction, which enables scalable and efficient model construction. Compared to traditional ensemble methods such as Bagging and Boosting, our approach maintains interpretability while offering competitive performance. Furthermore, we systematically employ maximum likelihood and cost functions to facilitate the analytical derivation of recursive gradients as functions of ensemble depth. The effectiveness of the proposed approach is validated on a custom dataset created by introducing noise and shifting data to simulate group structures, resulting in significant performance improvements with layers. Implemented in Python, this work balances computational efficiency with theoretical rigor, providing a robust and interpretable solution for complex classification tasks with broad implications for machine learning applications. Code at this https URL

Title: RoMo: Robust Motion Segmentation Improves Structure from Motion

Authors: Lily Goli, Sara Sabour, Mark Matthews, Marcus Brubaker, Dmitry Lagun, Alec Jacobson, David J. Fleet, Saurabh Saxena, Andrea Tagliasacchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18650
Pdf URL: https://arxiv.org/pdf/2411.18650
Copy Paste: [[2411.18650]] RoMo: Robust Motion Segmentation Improves Structure from Motion(https://arxiv.org/abs/2411.18650)
Keywords: robust, segmentation
Abstract: There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.

Title: PRSI: Privacy-Preserving Recommendation Model Based on Vector Splitting and Interactive Protocols

Authors: Xiaokai Cao, Wenjin Mo, Zhenyu He, Changdong Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18653
Pdf URL: https://arxiv.org/pdf/2411.18653
Copy Paste: [[2411.18653]] PRSI: Privacy-Preserving Recommendation Model Based on Vector Splitting and Interactive Protocols(https://arxiv.org/abs/2411.18653)
Keywords: secure, security, privacy, protect, federate
Abstract: With the development of the internet, recommending interesting products to users has become a highly valuable research topic for businesses. Recommendation systems play a crucial role in addressing this issue. To prevent the leakage of each user's (client's) private data, Federated Recommendation Systems (FedRec) have been proposed and widely used. However, extensive research has shown that FedRec suffers from security issues such as data privacy leakage, and it is challenging to train effective models with FedRec when each client only holds interaction information for a single user. To address these two problems, this paper proposes a new privacy-preserving recommendation system (PRSI), which includes a preprocessing module and two main phases. The preprocessing module employs split vectors and fake interaction items to protect clients' interaction information and recommendation results. The two main phases are: (1) the collection of interaction information and (2) the sending of recommendation results. In the interaction information collection phase, each client uses the preprocessing module and random communication methods (according to the designed interactive protocol) to protect their ID information and IP addresses. In the recommendation results sending phase, the central server uses the preprocessing module and triplets to distribute recommendation results to each client under secure conditions, following the designed interactive protocol. Finally, we conducted multiple sets of experiments to verify the security, accuracy, and communication cost of the proposed method.

Title: HDI-Former: Hybrid Dynamic Interaction ANN-SNN Transformer for Object Detection Using Frames and Events

Authors: Dianze Li, Jianing Li, Xu Liu, Zhaokun Zhou, Xiaopeng Fan, Yonghong Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18658
Pdf URL: https://arxiv.org/pdf/2411.18658
Copy Paste: [[2411.18658]] HDI-Former: Hybrid Dynamic Interaction ANN-SNN Transformer for Object Detection Using Frames and Events(https://arxiv.org/abs/2411.18658)
Keywords: transformer
Abstract: Combining the complementary benefits of frames and events has been widely used for object detection in challenging scenarios. However, most object detection methods use two independent Artificial Neural Network (ANN) branches, limiting cross-modality information interaction across the two visual streams and encountering challenges in extracting temporal cues from event streams with low power consumption. To address these challenges, we propose HDI-Former, a Hybrid Dynamic Interaction ANN-SNN Transformer, marking the first trial to design a directly trained hybrid ANN-SNN architecture for high-accuracy and energy-efficient object detection using frames and events. Technically, we first present a novel semantic-enhanced self-attention mechanism that strengthens the correlation between image encoding tokens within the ANN Transformer branch for better performance. Then, we design a Spiking Swin Transformer branch to model temporal cues from event streams with low power consumption. Finally, we propose a bio-inspired dynamic interaction mechanism between ANN and SNN sub-networks for cross-modality information interaction. The results demonstrate that our HDI-Former outperforms eleven state-of-the-art methods and our four baselines by a large margin. Our SNN branch also shows comparable performance to the ANN with the same architecture while consuming 10.57$\times$ less energy on the DSEC-Detection dataset. Our open-source code is available in the supplementary material.

Title: OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

Authors: Yixuan Zhang, Hui Yang, Chuanchen Luo, Junran Peng, Yuxi Wang, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18660
Pdf URL: https://arxiv.org/pdf/2411.18660
Copy Paste: [[2411.18660]] OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains(https://arxiv.org/abs/2411.18660)
Keywords: robust, diffusion
Abstract: Generating realistic 3D human-object interactions (HOIs) from text descriptions is a active research topic with potential applications in virtual and augmented reality, robotics, and animation. However, creating high-quality 3D HOIs remains challenging due to the lack of large-scale interaction data and the difficulty of ensuring physical plausibility, especially in out-of-domain (OOD) scenarios. Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. In this paper, we propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness. Experimental results demonstrate that our OOD-HOI could generate more realistic and physically plausible 3D interaction pose in OOD scenarios compared to existing methods.

Title: HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior

Authors: Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin C.K. Chan, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18662
Pdf URL: https://arxiv.org/pdf/2411.18662
Copy Paste: [[2411.18662]] HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior(https://arxiv.org/abs/2411.18662)
Keywords: diffusion, segmentation
Abstract: Text-to-image diffusion models have emerged as powerful priors for real-world image super-resolution (Real-ISR). However, existing methods may produce unintended results due to noisy text prompts and their lack of spatial information. In this paper, we present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for diffusion-based Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed Segmentation-CLIP Map. Extensive experiments demonstrate that HoliSDiP achieves significant improvement in image quality across various Real-ISR scenarios through reduced prompt noise and enhanced spatial control.

Title: Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Authors: Junha Hyung, Kinam Kim, Susung Hong, Min-Jung Kim, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18664
Pdf URL: https://arxiv.org/pdf/2411.18664
Copy Paste: [[2411.18664]] Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling(https://arxiv.org/abs/2411.18664)
Keywords: diffusion, transformer
Abstract: Diffusion models have emerged as a powerful tool for generating high-quality images, videos, and 3D content. While sampling guidance techniques like CFG improve quality, they reduce diversity and motion. Autoguidance mitigates these issues but demands extra weak model training, limiting its practicality for large-scale models. In this work, we introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models. STG employs an implicit weak model via self-perturbation, avoiding the need for external models or additional training. By selectively skipping spatiotemporal layers, STG produces an aligned, degraded version of the original model to boost sample quality without compromising diversity or dynamic degree. Our contributions include: (1) introducing STG as an efficient, high-performing guidance technique for video diffusion models, (2) eliminating the need for auxiliary models by simulating a weak model through layer skipping, and (3) ensuring quality-enhanced guidance without compromising sample diversity or dynamics unlike CFG. For additional results, visit this https URL.

Title: SpotLight: Shadow-Guided Object Relighting via Diffusion

Authors: Frédéric Fortier-Chouinard, Zitian Zhang, Louis-Etienne Messier, Mathieu Garon, Anand Bhattad, Jean-François Lalonde
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.18665
Pdf URL: https://arxiv.org/pdf/2411.18665
Copy Paste: [[2411.18665]] SpotLight: Shadow-Guided Object Relighting via Diffusion(https://arxiv.org/abs/2411.18665)
Keywords: diffusion
Abstract: Recent work has shown that diffusion models can be used as powerful neural rendering engines that can be leveraged for inserting virtual objects into images. Unlike typical physics-based renderers, however, neural rendering engines are limited by the lack of manual control over the lighting setup, which is often essential for improving or personalizing the desired image outcome. In this paper, we show that precise lighting control can be achieved for object relighting simply by specifying the desired shadows of the object. Rather surprisingly, we show that injecting only the shadow of the object into a pre-trained diffusion-based neural renderer enables it to accurately shade the object according to the desired light position, while properly harmonizing the object (and its shadow) within the target background image. Our method, SpotLight, leverages existing neural rendering approaches and achieves controllable relighting results with no additional training. Specifically, we demonstrate its use with two neural renderers from the recent literature. We show that SpotLight achieves superior object compositing results, both quantitatively and perceptually, as confirmed by a user study, outperforming existing diffusion-based models specifically designed for relighting.

Title: Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting

Authors: Hao Liu, Minglin Chen, Yanni Ma, Haihong Xiao, Ying He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18667
Pdf URL: https://arxiv.org/pdf/2411.18667
Copy Paste: [[2411.18667]] Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting(https://arxiv.org/abs/2411.18667)
Keywords: segmentation
Abstract: Pre-training on large-scale unlabeled datasets contribute to the model achieving powerful performance on 3D vision tasks, especially when annotations are limited. However, existing rendering-based self-supervised frameworks are computationally demanding and memory-intensive during pre-training due to the inherent nature of volume rendering. In this paper, we propose an efficient framework named GS$^3$ to learn point cloud representation, which seamlessly integrates fast 3D Gaussian Splatting into the rendering-based framework. The core idea behind our framework is to pre-train the point cloud encoder by comparing rendered RGB images with real RGB images, as only Gaussian points enriched with learned rich geometric and appearance information can produce high-quality renderings. Specifically, we back-project the input RGB-D images into 3D space and use a point cloud encoder to extract point-wise features. Then, we predict 3D Gaussian points of the scene from the learned point cloud features and uses a tile-based rasterizer for image rendering. Finally, the pre-trained point cloud encoder can be fine-tuned to adapt to various downstream 3D tasks, including high-level perception tasks such as 3D segmentation and detection, as well as low-level tasks such as 3D scene reconstruction. Extensive experiments on downstream tasks demonstrate the strong transferability of the pre-trained point cloud encoder and the effectiveness of our self-supervised learning framework. In addition, our GS$^3$ framework is highly efficient, achieving approximately 9$\times$ pre-training speedup and less than 0.25$\times$ memory cost compared to the previous rendering-based framework Ponder.

Title: Towards Chunk-Wise Generation for Long Videos

Authors: Siyang Zhang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18668
Pdf URL: https://arxiv.org/pdf/2411.18668
Copy Paste: [[2411.18668]] Towards Chunk-Wise Generation for Long Videos(https://arxiv.org/abs/2411.18668)
Keywords: diffusion, generative
Abstract: Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together to form a long video. In this approach, a long video generation task is divided into multiple short video generation subtasks, and the cost of each subtask is reduced to a feasible level. In this paper, we conduct a detailed survey on long video generation with the autoregressive chunk-by-chunk strategy. We address common problems caused by applying short image-to-video models to long video tasks and design an efficient $k$-step search solution to mitigate these problems.

Title: SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Authors: Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Qifeng Chen, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18669
Pdf URL: https://arxiv.org/pdf/2411.18669
Copy Paste: [[2411.18669]] SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality(https://arxiv.org/abs/2411.18669)
Keywords: segmentation
Abstract: Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties (e.g., polarization). In SimCMF, we conduct a thorough analysis of different basic components from the most naive design and ultimately propose a novel cross-modal alignment module to address the modality misalignment problem. We apply SimCMF to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new imaging modality. Given the absence of relevant benchmarks, we construct a benchmark for performance evaluation. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. SimCMF can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. The code is available at this https URL

Title: TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Authors: Jinyuan Qu, Hongyang Li, Shilong Liu, Tianhe Ren, Zhaoyang Zeng, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18671
Pdf URL: https://arxiv.org/pdf/2411.18671
Copy Paste: [[2411.18671]] TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video(https://arxiv.org/abs/2411.18671)
Keywords: robust
Abstract: In this paper, we present TAPTRv3, which is built upon TAPTRv2 to improve its point tracking robustness in long videos. TAPTRv2 is a simple DETR-like framework that can accurately track any point in real-world videos without requiring cost-volume. TAPTRv3 improves TAPTRv2 by addressing its shortage in querying high quality features from long videos, where the target tracking points normally undergo increasing variation over time. In TAPTRv3, we propose to utilize both spatial and temporal context to bring better feature querying along the spatial and temporal dimensions for more robust tracking in long videos. For better spatial feature querying, we present Context-aware Cross-Attention (CCA), which leverages surrounding spatial context to enhance the quality of attention scores when querying image features. For better temporal feature querying, we introduce Visibility-aware Long-Temporal Attention (VLTA) to conduct temporal attention to all past frames while considering their corresponding visibilities, which effectively addresses the feature drifting problem in TAPTRv2 brought by its RNN-like long-temporal modeling. TAPTRv3 surpasses TAPTRv2 by a large margin on most of the challenging datasets and obtains state-of-the-art performance. Even when compared with methods trained with large-scale extra internal data, TAPTRv3 is still competitive.

Title: FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models

Authors: Alice Heiman, Xiaoman Zhang, Emma Chen, Sung Eun Kim, Pranav Rajpurkar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18672
Pdf URL: https://arxiv.org/pdf/2411.18672
Copy Paste: [[2411.18672]] FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models(https://arxiv.org/abs/2411.18672)
Keywords: large language model
Abstract: Medical vision-language model models often struggle with generating accurate quantitative measurements in radiology reports, leading to hallucinations that undermine clinical reliability. We introduce FactCheXcker, a modular framework that de-hallucinates radiology report measurements by leveraging an improved query-code-update paradigm. Specifically, FactCheXcker employs specialized modules and the code generation capabilities of large language models to solve measurement queries generated based on the original report. After extracting measurable findings, the results are incorporated into an updated report. We evaluate FactCheXcker on endotracheal tube placement, which accounts for an average of 78% of report measurements, using the MIMIC-CXR dataset and 11 medical report-generation models. Our results show that FactCheXcker significantly reduces hallucinations, improves measurement precision, and maintains the quality of the original reports. Specifically, FactCheXcker improves the performance of all 11 models and achieves an average improvement of 94.0% in reducing measurement hallucinations measured by mean absolute error.

Title: AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Authors: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18673
Pdf URL: https://arxiv.org/pdf/2411.18673
Copy Paste: [[2411.18673]] AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers(https://arxiv.org/abs/2411.18673)
Keywords: diffusion, transformer, generative
Abstract: Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.

Title: Active Data Curation Effectively Distills Large-Scale Multimodal Models

Authors: Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, Olivier J. Hénaff
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18674
Pdf URL: https://arxiv.org/pdf/2411.18674
Copy Paste: [[2411.18674]] Active Data Curation Effectively Distills Large-Scale Multimodal Models(https://arxiv.org/abs/2411.18674)
Keywords: generative
Abstract: Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.

Title: GaussianSpeech: Audio-Driven Gaussian Avatars

Authors: Shivangi Aneja, Artem Sevastopolsky, Tobias Kirschstein, Justus Thies, Angela Dai, Matthias Nießner
Subjects: cs.CV, cs.AI, cs.GR, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.18675
Pdf URL: https://arxiv.org/pdf/2411.18675
Copy Paste: [[2411.18675]] GaussianSpeech: Audio-Driven Gaussian Avatars(https://arxiv.org/abs/2411.18675)
Keywords: transformer
Abstract: We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.

Title: MatchDiffusion: Training-free Generation of Match-cuts

Authors: Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, Bernard Ghanem
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18677
Pdf URL: https://arxiv.org/pdf/2411.18677
Copy Paste: [[2411.18677]] MatchDiffusion: Training-free Generation of Match-cuts(https://arxiv.org/abs/2411.18677)
Keywords: diffusion
Abstract: Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting match-cuts is a challenging, resource-intensive process requiring deliberate artistic planning. In MatchDiffusion, we present the first training-free method for match-cut generation using text-to-video diffusion models. MatchDiffusion leverages a key property of diffusion models: early denoising steps define the scene's broad structure, while later steps add details. Guided by this insight, MatchDiffusion employs "Joint Diffusion" to initialize generation for two prompts from shared noise, aligning structure and motion. It then applies "Disjoint Diffusion", allowing the videos to diverge and introduce unique details. This approach produces visually coherent videos suited for match-cuts. User studies and metrics demonstrate MatchDiffusion's effectiveness and potential to democratize match-cut creation.

Title: Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Authors: Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Ahmad Beirami, Furong Huang, Alvaro Velasquez, Dinesh Manocha, Amrit Singh Bedi
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18688
Pdf URL: https://arxiv.org/pdf/2411.18688
Copy Paste: [[2411.18688]] Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment(https://arxiv.org/abs/2411.18688)
Keywords: defense, attack, large language model
Abstract: With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks: carefully crafted image-prompt pairs that compel the model to generate harmful content. In this work, we first highlight a critical safety gap, demonstrating that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model during decoding to defend against jailbreak attacks. Additionally, we provide a rigorous mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

Title: An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)

Authors: Ted Kwartler, Nataliia Bagan, Ivan Banny, Alan Aqrawi, Arian Abbasi
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2411.18699
Pdf URL: https://arxiv.org/pdf/2411.18699
Copy Paste: [[2411.18699]] An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)(https://arxiv.org/abs/2411.18699)
Keywords: attack, robust
Abstract: The Single-Turn Crescendo Attack (STCA), first introduced in Aqrawi and Abbasi [2024], is an innovative method designed to bypass the ethical safeguards of text-to-text AI models, compelling them to generate harmful content. This technique leverages a strategic escalation of context within a single prompt, combined with trust-building mechanisms, to subtly deceive the model into producing unintended outputs. Extending the application of STCA to text-to-image models, we demonstrate its efficacy by compromising the guardrails of a widely-used model, DALL-E 3, achieving outputs comparable to outputs from the uncensored model Flux Schnell, which served as a baseline control. This study provides a framework for researchers to rigorously evaluate the robustness of guardrails in text-to-image models and benchmark their resilience against adversarial attacks.

Title: On the Effectiveness of Incremental Training of Large Language Models

Authors: Miles Q. Li, Benjamin C. M. Fung, Shih-Chia Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18700
Pdf URL: https://arxiv.org/pdf/2411.18700
Copy Paste: [[2411.18700]] On the Effectiveness of Incremental Training of Large Language Models(https://arxiv.org/abs/2411.18700)
Keywords: large language model
Abstract: Training large language models is a computationally intensive process that often requires substantial resources to achieve state-of-the-art results. Incremental layer-wise training has been proposed as a potential strategy to optimize the training process by progressively introducing layers, with the expectation that this approach would lead to faster convergence and more efficient use of computational resources. In this paper, we investigate the effectiveness of incremental training for LLMs, dividing the training process into multiple stages where layers are added progressively. Our experimental results indicate that while the incremental approach initially demonstrates some computational efficiency, it ultimately requires greater overall computational costs to reach comparable performance to traditional full-scale training. Although the incremental training process can eventually close the performance gap with the baseline, it does so only after significantly extended continual training. These findings suggest that incremental layer-wise training may not be a viable alternative for training large language models, highlighting its limitations and providing valuable insights into the inefficiencies of this approach.

Title: Random Walks with Tweedie: A Unified Framework for Diffusion Models

Authors: Chicago Y. Park, Michael T. McCann, Cristina Garcia-Cardona, Brendt Wohlberg, Ulugbek S. Kamilov
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.18702
Pdf URL: https://arxiv.org/pdf/2411.18702
Copy Paste: [[2411.18702]] Random Walks with Tweedie: A Unified Framework for Diffusion Models(https://arxiv.org/abs/2411.18702)
Keywords: diffusion, generative
Abstract: We present a simple template for designing generative diffusion model algorithms based on an interpretation of diffusion sampling as a sequence of random walks. Score-based diffusion models are widely used to generate high-quality images. Diffusion models have also been shown to yield state-of-the-art performance in many inverse problems. While these algorithms are often surprisingly simple, the theory behind them is not, and multiple complex theoretical justifications exist in the literature. Here, we provide a simple and largely self-contained theoretical justification for score-based-diffusion models that avoids using the theory of Markov chains or reverse diffusion, instead centering the theory of random walks and Tweedie's formula. This approach leads to unified algorithmic templates for network training and sampling. In particular, these templates cleanly separate training from sampling, e.g., the noise schedule used during training need not match the one used during sampling. We show that several existing diffusion models correspond to particular choices within this template and demonstrate that other, more straightforward algorithmic choices lead to effective diffusion models. The proposed framework has the added benefit of enabling conditional sampling without any likelihood approximation.

Title: Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

Authors: Daniel Morales-Brotons, Thijs Vogels, Hadrien Hendrikx
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18704
Pdf URL: https://arxiv.org/pdf/2411.18704
Copy Paste: [[2411.18704]] Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits(https://arxiv.org/abs/2411.18704)
Keywords: robust
Abstract: Weight averaging of Stochastic Gradient Descent (SGD) iterates is a popular method for training deep learning models. While it is often used as part of complex training pipelines to improve generalization or serve as a `teacher' model, weight averaging lacks proper evaluation on its own. In this work, we present a systematic study of the Exponential Moving Average (EMA) of weights. We first explore the training dynamics of EMA, give guidelines for hyperparameter tuning, and highlight its good early performance, partly explaining its success as a teacher. We also observe that EMA requires less learning rate decay compared to SGD since averaging naturally reduces noise, introducing a form of implicit regularization. Through extensive experiments, we show that EMA solutions differ from last-iterate solutions. EMA models not only generalize better but also exhibit improved i) robustness to noisy labels, ii) prediction consistency, iii) calibration and iv) transfer learning. Therefore, we suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.

Title: Evaluating Vision-Language Models as Evaluators in Path Planning

Authors: Mohamed Aghzal, Xiang Yue, Erion Plaku, Ziyu Yao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.18711
Pdf URL: https://arxiv.org/pdf/2411.18711
Copy Paste: [[2411.18711]] Evaluating Vision-Language Models as Evaluators in Path Planning(https://arxiv.org/abs/2411.18711)
Keywords: large language model
Abstract: Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to become effective path evaluators.

Title: Generative Visual Communication in the Era of Vision-Language Models

Authors: Yael Vinker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18727
Pdf URL: https://arxiv.org/pdf/2411.18727
Copy Paste: [[2411.18727]] Generative Visual Communication in the Era of Vision-Language Models(https://arxiv.org/abs/2411.18727)
Keywords: generative
Abstract: Visual communication, dating back to prehistoric cave paintings, is the use of visual elements to convey ideas and information. In today's visually saturated world, effective design demands an understanding of graphic design principles, visual storytelling, human psychology, and the ability to distill complex information into clear visuals. This dissertation explores how recent advancements in vision-language models (VLMs) can be leveraged to automate the creation of effective visual communication designs. Although generative models have made great progress in generating images from text, they still struggle to simplify complex ideas into clear, abstract visuals and are constrained by pixel-based outputs, which lack flexibility for many design tasks. To address these challenges, we constrain the models' operational space and introduce task-specific regularizations. We explore various aspects of visual communication, namely, sketches and visual abstraction, typography, animation, and visual inspiration.

Title: The Last Mile to Supervised Performance: Semi-Supervised Domain Adaptation for Semantic Segmentation

Authors: Daniel Morales-Brotons, Grigorios Chrysos, Stratis Tzoumas, Volkan Cevher
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18728
Pdf URL: https://arxiv.org/pdf/2411.18728
Copy Paste: [[2411.18728]] The Last Mile to Supervised Performance: Semi-Supervised Domain Adaptation for Semantic Segmentation(https://arxiv.org/abs/2411.18728)
Keywords: segmentation
Abstract: Supervised deep learning requires massive labeled datasets, but obtaining annotations is not always easy or possible, especially for dense tasks like semantic segmentation. To overcome this issue, numerous works explore Unsupervised Domain Adaptation (UDA), which uses a labeled dataset from another domain (source), or Semi-Supervised Learning (SSL), which trains on a partially labeled set. Despite the success of UDA and SSL, reaching supervised performance at a low annotation cost remains a notoriously elusive goal. To address this, we study the promising setting of Semi-Supervised Domain Adaptation (SSDA). We propose a simple SSDA framework that combines consistency regularization, pixel contrastive learning, and self-training to effectively utilize a few target-domain labels. Our method outperforms prior art in the popular GTA-to-Cityscapes benchmark and shows that as little as 50 target labels can suffice to achieve near-supervised performance. Additional results on Synthia-to-Cityscapes, GTA-to-BDD and Synthia-to-BDD further demonstrate the effectiveness and practical utility of the method. Lastly, we find that existing UDA and SSL methods are not well-suited for the SSDA setting and discuss design patterns to adapt them.

Title: DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration

Authors: Zheyan Zhang, Diego Klabjan, Renee CB Manworren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18745
Pdf URL: https://arxiv.org/pdf/2411.18745
Copy Paste: [[2411.18745]] DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration(https://arxiv.org/abs/2411.18745)
Keywords: diffusion
Abstract: In this work, we address a challenge in video inpainting: reconstructing occluded regions in dynamic, real-world scenarios. Motivated by the need for continuous human motion monitoring in healthcare settings, where facial features are frequently obscured, we propose a diffusion-based video-level inpainting model, DiffMVR. Our approach introduces a dynamic dual-guided image prompting system, leveraging adaptive reference frames to guide the inpainting process. This enables the model to capture both fine-grained details and smooth transitions between video frames, offering precise control over inpainting direction and significantly improving restoration accuracy in challenging, dynamic environments. DiffMVR represents a significant advancement in the field of diffusion-based inpainting, with practical implications for real-time applications in various dynamic settings.

Title: Inference Privacy: Properties and Mechanisms

Authors: Fengwei Tian, Ravi Tandon
Subjects: cs.CR, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18746
Pdf URL: https://arxiv.org/pdf/2411.18746
Copy Paste: [[2411.18746]] Inference Privacy: Properties and Mechanisms(https://arxiv.org/abs/2411.18746)
Keywords: privacy
Abstract: Ensuring privacy during inference stage is crucial to prevent malicious third parties from reconstructing users' private inputs from outputs of public models. Despite a large body of literature on privacy preserving learning (which ensures privacy of training data), there is no existing systematic framework to ensure the privacy of users' data during inference. Motivated by this problem, we introduce the notion of Inference Privacy (IP), which can allow a user to interact with a model (for instance, a classifier, or an AI-assisted chat-bot) while providing a rigorous privacy guarantee for the users' data at inference. We establish fundamental properties of the IP privacy notion and also contrast it with the notion of Local Differential Privacy (LDP). We then present two types of mechanisms for achieving IP: namely, input perturbations and output perturbations which are customizable by the users and can allow them to navigate the trade-off between utility and privacy. We also demonstrate the usefulness of our framework via experiments and highlight the resulting trade-offs between utility and privacy during inference.

Title: Locally Differentially Private Online Federated Learning With Correlated Noise

Authors: Jiaojiao Zhang, Linglingzhi Zhu, Dominik Fay, Mikael Johansson
Subjects: cs.LG, cs.DC, stat.ML
Abstract URL: https://arxiv.org/abs/2411.18752
Pdf URL: https://arxiv.org/pdf/2411.18752
Copy Paste: [[2411.18752]] Locally Differentially Private Online Federated Learning With Correlated Noise(https://arxiv.org/abs/2411.18752)
Keywords: privacy, federate
Abstract: We introduce a locally differentially private (LDP) algorithm for online federated learning that employs temporally correlated noise to improve utility while preserving privacy. To address challenges posed by the correlated noise and local updates with streaming non-IID data, we develop a perturbed iterate analysis that controls the impact of the noise on the utility. Moreover, we demonstrate how the drift errors from local updates can be effectively managed for several classes of nonconvex loss functions. Subject to an $(\epsilon,\delta)$-LDP budget, we establish a dynamic regret bound that quantifies the impact of key parameters and the intensity of changes in the dynamic environment on the learning performance. Numerical experiments confirm the efficacy of the proposed algorithm.

Title: Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models

Authors: Weiqiu You, Youngja Park
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2411.18755
Pdf URL: https://arxiv.org/pdf/2411.18755
Copy Paste: [[2411.18755]] Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models(https://arxiv.org/abs/2411.18755)
Keywords: security, attack, large language model
Abstract: Understanding the attack patterns associated with a cyberattack is crucial for comprehending the attacker's behaviors and implementing the right mitigation measures. However, majority of the information regarding new attacks is typically presented in unstructured text, posing significant challenges for security analysts in collecting necessary information. In this paper, we present a sentence classification system that can identify the attack techniques described in natural language sentences from cyber threat intelligence (CTI) reports. We propose a new method for utilizing auxiliary data with the same labels to improve classification for the low-resource cyberattack classification task. The system first trains the model using the augmented training data and then trains more using only the primary data. We validate our model using the TRAM data1 and the MITRE ATT&CK framework. Experiments show that our method enhances Macro-F1 by 5 to 9 percentage points and keeps Micro-F1 scores competitive when compared to the baseline performance on the TRAM dataset.

Title: CoVis: A Collaborative Framework for Fine-grained Graphic Visual Understanding

Authors: Xiaoyu Deng, Zhengjian Kang, Xintao Li, Yongzhe Zhang, Tianmin Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18764
Pdf URL: https://arxiv.org/pdf/2411.18764
Copy Paste: [[2411.18764]] CoVis: A Collaborative Framework for Fine-grained Graphic Visual Understanding(https://arxiv.org/abs/2411.18764)
Keywords: extraction, segmentation
Abstract: Graphic visual content helps in promoting information communication and inspiration divergence. However, the interpretation of visual content currently relies mainly on humans' personal knowledge background, thereby affecting the quality and efficiency of information acquisition and understanding. To improve the quality and efficiency of visual information transmission and avoid the limitation of the observer due to the information cocoon, we propose CoVis, a collaborative framework for fine-grained visual understanding. By designing and implementing a cascaded dual-layer segmentation network coupled with a large-language-model (LLM) based content generator, the framework extracts as much knowledge as possible from an image. Then, it generates visual analytics for images, assisting observers in comprehending imagery from a more holistic perspective. Quantitative experiments and qualitative experiments based on 32 human participants indicate that the CoVis has better performance than current methods in feature extraction and can generate more comprehensive and detailed visual descriptions than current general-purpose large models.

Title: Fall Leaf Adversarial Attack on Traffic Sign Classification

Authors: Anthony Etim, Jakub Szefer
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2411.18776
Pdf URL: https://arxiv.org/pdf/2411.18776
Copy Paste: [[2411.18776]] Fall Leaf Adversarial Attack on Traffic Sign Classification(https://arxiv.org/abs/2411.18776)
Keywords: attack
Abstract: Adversarial input image perturbation attacks have emerged as a significant threat to machine learning algorithms, particularly in image classification setting. These attacks involve subtle perturbations to input images that cause neural networks to misclassify the input images, even though the images remain easily recognizable to humans. One critical area where adversarial attacks have been demonstrated is in automotive systems where traffic sign classification and recognition is critical, and where misclassified images can cause autonomous systems to take wrong actions. This work presents a new class of adversarial attacks. Unlike existing work that has focused on adversarial perturbations that leverage human-made artifacts to cause the perturbations, such as adding stickers, paint, or shining flashlights at traffic signs, this work leverages nature-made artifacts: tree leaves. By leveraging nature-made artifacts, the new class of attacks has plausible deniability: a fall leaf stuck to a street sign could come from a near-by tree, rather than be placed there by an malicious human attacker. To evaluate the new class of the adversarial input image perturbation attacks, this work analyses how fall leaves can cause misclassification in street signs. The work evaluates various leaves from different species of trees, and considers various parameters such as size, color due to tree leaf type, and rotation. The work demonstrates high success rate for misclassification. The work also explores the correlation between successful attacks and how they affect the edge detection, which is critical in many image classification algorithms.

Title: MRI Breast tissue segmentation using nnU-Net for biomechanical modeling

Authors: Melika Pooyan, Hadeel Awwad, Eloy García, Robert Martí
Subjects: cs.CV, eess.IV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2411.18784
Pdf URL: https://arxiv.org/pdf/2411.18784
Copy Paste: [[2411.18784]] MRI Breast tissue segmentation using nnU-Net for biomechanical modeling(https://arxiv.org/abs/2411.18784)
Keywords: segmentation
Abstract: Integrating 2D mammography with 3D magnetic resonance imaging (MRI) is crucial for improving breast cancer diagnosis and treatment planning. However, this integration is challenging due to differences in imaging modalities and the need for precise tissue segmentation and alignment. This paper addresses these challenges by enhancing biomechanical breast models in two main aspects: improving tissue identification using nnU-Net segmentation models and evaluating finite element (FE) biomechanical solvers, specifically comparing NiftySim and FEBio. We performed a detailed six-class segmentation of breast MRI data using the nnU-Net architecture, achieving Dice Coefficients of 0.94 for fat, 0.88 for glandular tissue, and 0.87 for pectoral muscle. The overall foreground segmentation reached a mean Dice Coefficient of 0.83 through an ensemble of 2D and 3D U-Net configurations, providing a solid foundation for 3D reconstruction and biomechanical modeling. The segmented data was then used to generate detailed 3D meshes and develop biomechanical models using NiftySim and FEBio, which simulate breast tissue's physical behaviors under compression. Our results include a comparison between NiftySim and FEBio, providing insights into the accuracy and reliability of these simulations in studying breast tissue responses under compression. The findings of this study have the potential to improve the integration of 2D and 3D imaging modalities, thereby enhancing diagnostic accuracy and treatment planning for breast cancer.

Title: UOE: Unlearning One Expert Is Enough For Mixture-of-experts LLMS

Authors: Haomin Zhuang, Yihua Zhang, Kehan Guo, Jinghan Jia, Gaowen Liu, Sijia Liu, Xiangliang Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.18797
Pdf URL: https://arxiv.org/pdf/2411.18797
Copy Paste: [[2411.18797]] UOE: Unlearning One Expert Is Enough For Mixture-of-experts LLMS(https://arxiv.org/abs/2411.18797)
Keywords: large language model
Abstract: Recent advancements in large language model (LLM) unlearning have shown remarkable success in removing unwanted data-model influences while preserving the model's utility for legitimate knowledge. However, despite these strides, sparse Mixture-of-Experts (MoE) LLMs--a key subset of the LLM family--have received little attention and remain largely unexplored in the context of unlearning. As MoE LLMs are celebrated for their exceptional performance and highly efficient inference processes, we ask: How can unlearning be performed effectively and efficiently on MoE LLMs? And will traditional unlearning methods be applicable to MoE architectures? Our pilot study shows that the dynamic routing nature of MoE LLMs introduces unique challenges, leading to substantial utility drops when existing unlearning methods are applied. Specifically, unlearning disrupts the router's expert selection, causing significant selection shift from the most unlearning target-related experts to irrelevant ones. As a result, more experts than necessary are affected, leading to excessive forgetting and loss of control over which knowledge is erased. To address this, we propose a novel single-expert unlearning framework, referred to as UOE, for MoE LLMs. Through expert attribution, unlearning is concentrated on the most actively engaged expert for the specified knowledge. Concurrently, an anchor loss is applied to the router to stabilize the active state of this targeted expert, ensuring focused and controlled unlearning that preserves model utility. The proposed UOE framework is also compatible with various unlearning algorithms. Extensive experiments demonstrate that UOE enhances both forget quality up to 5% and model utility by 35% on MoE LLMs across various benchmarks, LLM architectures, while only unlearning 0.06% of the model parameters.

Title: Formal Verification of Digital Twins with TLA and Information Leakage Control

Authors: Luwen Huang, Lav R. Varshney, Karen E. Willcox
Subjects: cs.CR, cs.DC, cs.IT, eess.SY
Abstract URL: https://arxiv.org/abs/2411.18798
Pdf URL: https://arxiv.org/pdf/2411.18798
Copy Paste: [[2411.18798]] Formal Verification of Digital Twins with TLA and Information Leakage Control(https://arxiv.org/abs/2411.18798)
Keywords: security
Abstract: Verifying the correctness of a digital twin provides a formal guarantee that the digital twin operates as intended. Digital twin verification is challenging due to the presence of uncertainties in the virtual representation, the physical environment, and the bidirectional flow of information between physical and virtual. A further challenge is that a digital twin of a complex system is composed of distributed components. This paper presents a methodology to specify and verify digital twin behavior, translating uncertain processes into a formally verifiable finite state machine. We use the Temporal Logic of Actions (TLA) to create a specification, an implementation abstraction that defines the properties required for correct system behavior. Our approach includes a novel weakening of formal security properties, allowing controlled information leakage while preserving theoretical guarantees. We demonstrate this approach on a digital twin of an unmanned aerial vehicle, verifying synchronization of physical-to-virtual and virtual-to-digital data flows to detect unintended misalignments.

Title: Stratified Non-Negative Tensor Factorization

Authors: Alexander Sietsema, Zerrin Vural, James Chapman, Yotam Yaniv, Deanna Needell
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2411.18805
Pdf URL: https://arxiv.org/pdf/2411.18805
Copy Paste: [[2411.18805]] Stratified Non-Negative Tensor Factorization(https://arxiv.org/abs/2411.18805)
Keywords: interpretability
Abstract: Non-negative matrix factorization (NMF) and non-negative tensor factorization (NTF) decompose non-negative high-dimensional data into non-negative low-rank components. NMF and NTF methods are popular for their intrinsic interpretability and effectiveness on large-scale data. Recent work developed Stratified-NMF, which applies NMF to regimes where data may come from different sources (strata) with different underlying distributions, and seeks to recover both strata-dependent information and global topics shared across strata. Applying Stratified-NMF to multi-modal data requires flattening across modes, and therefore loses geometric structure contained implicitly within the tensor. To address this problem, we extend Stratified-NMF to the tensor setting by developing a multiplicative update rule and demonstrating the method on text and image data. We find that Stratified-NTF can identify interpretable topics with lower memory requirements than Stratified-NMF. We also introduce a regularized version of the method and demonstrate its effects on image data.

Title: Reconstructing Animals and the Wild

Authors: Peter Kulits, Michael J. Black, Silvia Zuffi
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.18807
Pdf URL: https://arxiv.org/pdf/2411.18807
Copy Paste: [[2411.18807]] Reconstructing Animals and the Wild(https://arxiv.org/abs/2411.18807)
Keywords: large language model
Abstract: The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here, we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes containing trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting environmental context. This limits their usefulness for analysis tasks, as animals exist inherently within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct natural scenes from single images. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one million images and thousands of assets. Our approach, having been trained solely on synthetic data, generalizes to the task of reconstructing animals and their environments in real-world images. We will release our dataset and code to encourage future research at this https URL

Title: Lifting Motion to the 3D World via 2D Diffusion

Authors: Jiaman Li, C. Karen Liu, Jiajun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18808
Pdf URL: https://arxiv.org/pdf/2411.18808
Copy Paste: [[2411.18808]] Lifting Motion to the 3D World via 2D Diffusion(https://arxiv.org/abs/2411.18808)
Keywords: diffusion
Abstract: Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

Title: Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18810
Pdf URL: https://arxiv.org/pdf/2411.18810
Copy Paste: [[2411.18810]] Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds(https://arxiv.org/abs/2411.18810)
Keywords: diffusion
Abstract: Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as "two dogs" or "a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-{\alpha}, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-{\alpha}.

Title: FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Authors: Junyang Chen, Jinshan Pan, Jiangxin Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18824
Pdf URL: https://arxiv.org/pdf/2411.18824
Copy Paste: [[2411.18824]] FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution(https://arxiv.org/abs/2411.18824)
Keywords: diffusion
Abstract: Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.

Title: Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark

Authors: Jianyou Wang, Weili Cao, Longtian Bao, Youze Zheng, Gil Pasternak, Kaicheng Wang, Xiaoyue Wang, Ramamohan Paturi, Leon Bergen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18831
Pdf URL: https://arxiv.org/pdf/2411.18831
Copy Paste: [[2411.18831]] Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark(https://arxiv.org/abs/2411.18831)
Keywords: large language model
Abstract: Systems that answer questions by reviewing the scientific literature are becoming increasingly feasible. To draw reliable conclusions, these systems should take into account the quality of available evidence, placing more weight on studies that use a valid methodology. We present a benchmark for measuring the methodological strength of biomedical papers, drawing on the risk-of-bias framework used for systematic reviews. The four benchmark tasks, drawn from more than 500 papers, cover the analysis of research study methodology, followed by evaluation of risk of bias in these studies. The benchmark contains 2000 expert-generated bias annotations, and a human-validated pipeline for fine-grained alignment with research paper content. We evaluate a range of large language models on the benchmark, and find that these models fall significantly short of expert-level performance. By providing a standardized tool for measuring judgments of study quality, the benchmark can help to guide systems that perform large-scale aggregation of scientific data. The dataset is available at this https URL.

Title: Sharing the Path: A Threshold Scheme from Isogenies and Error Correcting Codes

Authors: Mohamadou Sall, M. Anwar Hasan
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2411.18844
Pdf URL: https://arxiv.org/pdf/2411.18844
Copy Paste: [[2411.18844]] Sharing the Path: A Threshold Scheme from Isogenies and Error Correcting Codes(https://arxiv.org/abs/2411.18844)
Keywords: security, attack
Abstract: In 2022, a prominent supersingular isogeny-based cryptographic scheme, namely SIDH, was compromised by a key recovery attack. However, this attack does not undermine the isogeny path problem, which remains central to the security of isogeny-based cryptography. Following the attacks by Castryck and Decru, as well as Maino and Martindale, Robert gave a mature and polynomial-time algorithm that transforms the SIDH key recovery attack into a valuable cryptographic tool. In this paper, we combine this tool with advanced encoding techniques to construct a novel threshold scheme.

Title: An Integrated Artificial Intelligence Operating System for Advanced Low-Altitude Aviation Applications

Authors: Minzhe Tan, Xinlin Fan, Jian He, Yi Hou, Zhan Liu, Yaopeng Jiang, YM Jiang
Subjects: cs.LG, cs.AI, cs.OS
Abstract URL: https://arxiv.org/abs/2411.18845
Pdf URL: https://arxiv.org/pdf/2411.18845
Copy Paste: [[2411.18845]] An Integrated Artificial Intelligence Operating System for Advanced Low-Altitude Aviation Applications(https://arxiv.org/abs/2411.18845)
Keywords: robust
Abstract: This paper introduces a comprehensive artificial intelligence operating system tailored for low-altitude aviation applications, integrating cutting-edge technologies for enhanced performance, safety, and efficiency. The system comprises six core components: OrinFlight OS, a high-performance operating system optimized for real-time task execution; UnitedVision, a versatile visual processing module supporting advanced image analysis; UnitedSense, a multi-sensor fusion module providing precise environmental modeling; UnitedNavigator, a dynamic path-planning and navigation system; UnitedMatrix, enabling multi-drone coordination and task execution; and UnitedInSight, a ground station for monitoring and management. Complemented by the UA DevKit low-code platform, the system facilitates user-friendly customization and application development. Leveraging NVIDIA Orin's computational power and advanced AI algorithms, this system addresses complex challenges in modern aviation, offering robust solutions for navigation, perception, and collaborative operations. This work highlights the system's architecture, features, and potential applications, demonstrating its ability to meet the demands of intelligent aviation environments.

Title: CrossTracker: Robust Multi-modal 3D Multi-Object Tracking via Cross Correction

Authors: Lipeng Gu, Xuefeng Yan, Weiming Wang, Honghua Chen, Dingkun Zhu, Liangliang Nan, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18850
Pdf URL: https://arxiv.org/pdf/2411.18850
Copy Paste: [[2411.18850]] CrossTracker: Robust Multi-modal 3D Multi-Object Tracking via Cross Correction(https://arxiv.org/abs/2411.18850)
Keywords: robust
Abstract: The fusion of camera- and LiDAR-based detections offers a promising solution to mitigate tracking failures in 3D multi-object tracking (MOT). However, existing methods predominantly exploit camera detections to correct tracking failures caused by potential LiDAR detection problems, neglecting the reciprocal benefit of refining camera detections using LiDAR data. This limitation is rooted in their single-stage architecture, akin to single-stage object detectors, lacking a dedicated trajectory refinement module to fully exploit the complementary multi-modal information. To this end, we introduce CrossTracker, a novel two-stage paradigm for online multi-modal 3D MOT. CrossTracker operates in a coarse-to-fine manner, initially generating coarse trajectories and subsequently refining them through an independent refinement process. Specifically, CrossTracker incorporates three essential modules: i) a multi-modal modeling (M^3) module that, by fusing multi-modal information (images, point clouds, and even plane geometry extracted from images), provides a robust metric for subsequent trajectory generation. ii) a coarse trajectory generation (C-TG) module that generates initial coarse dual-stream trajectories, and iii) a trajectory refinement (TR) module that refines coarse trajectories through cross correction between camera and LiDAR streams. Comprehensive experiments demonstrate the superior performance of our CrossTracker over its eighteen competitors, underscoring its effectiveness in harnessing the synergistic benefits of camera and LiDAR sensors for robust multi-modal 3D MOT.

Title: COMPrompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection

Authors: Xiaoqin Zhang, Zhenni Yu, Li Zhao, Deng-Ping Fan, Guobao Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18858
Pdf URL: https://arxiv.org/pdf/2411.18858
Copy Paste: [[2411.18858]] COMPrompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection(https://arxiv.org/abs/2411.18858)
Keywords: extraction, segmentation
Abstract: We rethink the segment anything model (SAM) and propose a novel multiprompt network called COMPrompter for camouflaged object detection (COD). SAM has zero-shot generalization ability beyond other models and can provide an ideal framework for COD. Our network aims to enhance the single prompt strategy in SAM to a multiprompt strategy. To achieve this, we propose an edge gradient extraction module, which generates a mask containing gradient information regarding the boundaries of camouflaged objects. This gradient mask is then used as a novel boundary prompt, enhancing the segmentation process. Thereafter, we design a box-boundary mutual guidance module, which fosters more precise and comprehensive feature extraction via mutual guidance between a boundary prompt and a box prompt. This collaboration enhances the model's ability to accurately detect camouflaged objects. Moreover, we employ the discrete wavelet transform to extract high-frequency features from image embeddings. The high-frequency features serve as a supplementary component to the multiprompt system. Finally, our COMPrompter guides the network to achieve enhanced segmentation results, thereby advancing the development of SAM in terms of COD. Experimental results across COD benchmarks demonstrate that COMPrompter achieves a cutting-edge performance, surpassing the current leading model by an average positive metric of 2.2% in COD10K. In the specific application of COD, the experimental results in polyp segmentation show that our model is superior to top-tier methods as well. The code will be made available at this https URL.

Title: Improving Batch Normalization with TTA for Robust Object Detection in Self-Driving

Authors: Dacheng Liao, Mengshi Qi, Liang Liu, Huadong Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18860
Pdf URL: https://arxiv.org/pdf/2411.18860
Copy Paste: [[2411.18860]] Improving Batch Normalization with TTA for Robust Object Detection in Self-Driving(https://arxiv.org/abs/2411.18860)
Keywords: robust
Abstract: In current open real-world autonomous driving scenarios, challenges such as sensor failure and extreme weather conditions hinder the generalization of most autonomous driving perception models to these unseen domain due to the domain shifts between the test and training data. As the parameter scale of autonomous driving perception models grows, traditional test-time adaptation (TTA) methods become unstable and often degrade model performance in most scenarios. To address these challenges, this paper proposes two new robust methods to improve the Batch Normalization with TTA for object detection in autonomous driving: (1) We introduce a LearnableBN layer based on Generalized-search Entropy Minimization (GSEM) method. Specifically, we modify the traditional BN layer by incorporating auxiliary learnable parameters, which enables the BN layer to dynamically update the statistics according to the different input data. (2) We propose a new semantic-consistency based dual-stage-adaptation strategy, which encourages the model to iteratively search for the optimal solution and eliminates unstable samples during the adaptation process. Extensive experiments on the NuScenes-C dataset shows that our method achieves a maximum improvement of about 8% using BEVFormer as the baseline model across six corruption types and three levels of severity. We will make our source code available soon.

Title: Swarm Intelligence-Driven Client Selection for Federated Learning in Cybersecurity applications

Authors: Koffka Khan, Wayne Goodridge
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18877
Pdf URL: https://arxiv.org/pdf/2411.18877
Copy Paste: [[2411.18877]] Swarm Intelligence-Driven Client Selection for Federated Learning in Cybersecurity applications(https://arxiv.org/abs/2411.18877)
Keywords: security, robust, federate
Abstract: This study addresses a critical gap in the literature regarding the use of Swarm Intelligence Optimization (SI) algorithms for client selection in Federated Learning (FL), with a focus on cybersecurity applications. Existing research primarily explores optimization techniques for centralized machine learning, leaving the unique challenges of client diveristy, non-IID data distributions, and adversarial noise in decentralized FL largely unexamined. To bridge this gap, we evaluate nine SI algorithms-Grey Wolf Optimization (GWO), Particle Swarm Optimization (PSO), Cuckoo Search, Bat Algorithm, Bee Colony, Ant Colony Optimization, Fish Swarm, Glow Worm, and Intelligent Water Droplet-across four experimental scenarios: fixed client participation, dynamic participation patterns, hetergeneous non-IID data distributions, and adversarial noise conditions. Results indicate that GWO exhibits superior adaptability and robustness, achieving the highest accuracy, recall and F1-scoress across all configurations, while PSO and Cuckoo Search also demonstrate strong performance. These findings underscore the potential of SI algorithms to address decentralized and adversarial FL challenges, offereing scalable and resilient solutions for cybersecurity applications, including intrusion detection in IoT and large-scale networks.

Title: Sneaking Syntax into Transformer Language Models with Tree Regularization

Authors: Ananjan Nandi, Christopher D. Manning, Shikhar Murty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18885
Pdf URL: https://arxiv.org/pdf/2411.18885
Copy Paste: [[2411.18885]] Sneaking Syntax into Transformer Language Models with Tree Regularization(https://arxiv.org/abs/2411.18885)
Keywords: robust, transformer
Abstract: While compositional accounts of human language understanding are based on a hierarchical tree-like process, neural models like transformers lack a direct inductive bias for such tree structures. Introducing syntactic inductive biases could unlock more robust and data-efficient learning in transformer language models (LMs), but existing methods for incorporating such structure greatly restrict models, either limiting their expressivity or increasing inference complexity. This work instead aims to softly inject syntactic inductive biases into given transformer circuits, through a structured regularizer. We introduce TREEREG, an auxiliary loss function that converts bracketing decisions from silver parses into a set of differentiable orthogonality constraints on vector hidden states. TREEREG integrates seamlessly with the standard LM objective, requiring no architectural changes. LMs pre-trained with TreeReg on natural language corpora such as WikiText-103 achieve up to 10% lower perplexities on out-of-distribution data and up to 9.5 point improvements in syntactic generalization, requiring less than half the training data to outperform standard LMs. TreeReg still provides gains for pre-trained LLMs: Continued pre-training of Sheared Llama with TreeReg results in improved syntactic generalization, and fine-tuning on MultiNLI with TreeReg mitigates degradation of performance on adversarial NLI benchmarks by 41.2 points.

Title: T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving

Authors: Changsheng Lv, Mengshi Qi, Liang Liu, Huadong Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18894
Pdf URL: https://arxiv.org/pdf/2411.18894
Copy Paste: [[2411.18894]] T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving(https://arxiv.org/abs/2411.18894)
Keywords: transformer
Abstract: Understanding the traffic scenes and then generating high-definition (HD) maps present significant challenges in autonomous driving. In this paper, we defined a novel Traffic Topology Scene Graph, a unified scene graph explicitly modeling the lane, controlled and guided by different road signals (e.g., right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. For the generation of T2SG, we propose TopoFormer, a novel one-stage Topology Scene Graph TransFormer with two newly designed layers. Specifically, TopoFormer incorporates a Lane Aggregation Layer (LAL) that leverages the geometric distance among the centerline of lanes to guide the aggregation of global information. Furthermore, we proposed a Counterfactual Intervention Layer (CIL) to model the reasonable road structure ( e.g., intersection, straight) among lanes under counterfactual intervention. Then the generated T2SG can provide a more accurate and explainable description of the topological structure in traffic scenes. Experimental results demonstrate that TopoFormer outperforms existing methods on the T2SG generation task, and the generated T2SG significantly enhances traffic topology reasoning in downstream tasks, achieving a state-of-the-art performance of 46.3 OLS on the OpenLane-V2 benchmark. We will release our source code and model.

Title: Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Authors: Adam Karvonen, Can Rager, Samuel Marks, Neel Nanda
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.18895
Pdf URL: https://arxiv.org/pdf/2411.18895
Copy Paste: [[2411.18895]] Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks(https://arxiv.org/abs/2411.18895)
Keywords: interpretability
Abstract: Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. However, a major bottleneck for SAE development has been the lack of high-quality performance metrics, with prior work largely relying on unsupervised proxies. In this work, we introduce a family of evaluations based on SHIFT, a downstream task from Marks et al. (Sparse Feature Circuits, 2024) in which spurious cues are removed from a classifier by ablating SAE features judged to be task-irrelevant by a human annotator. We adapt SHIFT into an automated metric of SAE quality; this involves replacing the human annotator with an LLM. Additionally, we introduce the Targeted Probe Perturbation (TPP) metric that quantifies an SAE's ability to disentangle similar concepts, effectively scaling SHIFT to a wider range of datasets. We apply both SHIFT and TPP to multiple open-source models, demonstrating that these metrics effectively differentiate between various SAE training hyperparameters and architectures.

Title: Textured As-Is BIM via GIS-informed Point Cloud Segmentation

Authors: Mohamed S. H. Alabassy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18898
Pdf URL: https://arxiv.org/pdf/2411.18898
Copy Paste: [[2411.18898]] Textured As-Is BIM via GIS-informed Point Cloud Segmentation(https://arxiv.org/abs/2411.18898)
Keywords: segmentation
Abstract: Creating as-is models from scratch is to this day still a time- and money-consuming task due to its high manual effort. Therefore, projects, especially those with a big spatial extent, could profit from automating the process of creating semantically rich 3D geometries from surveying data such as Point Cloud Data (PCD). An automation can be achieved by using Machine and Deep Learning Models for object recognition and semantic segmentation of PCD. As PCDs do not usually include more than the mere position and RGB colour values of points, tapping into semantically enriched Geoinformation System (GIS) data can be used to enhance the process of creating meaningful as-is models. This paper presents a methodology, an implementation framework and a proof of concept for the automated generation of GIS-informed and BIM-ready as-is Building Information Models (BIM) for railway projects. The results show a high potential for cost savings and reveal the unemployed resources of freely accessible GIS data within.

Title: FedRGL: Robust Federated Graph Learning for Label Noise

Authors: De Li, Haodong Qian, Qiyu Li, Zhou Tan, Zemin Gan, Jinyan Wang, Xianxian Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18905
Pdf URL: https://arxiv.org/pdf/2411.18905
Copy Paste: [[2411.18905]] FedRGL: Robust Federated Graph Learning for Label Noise(https://arxiv.org/abs/2411.18905)
Keywords: secure, robust, federate, noise learning
Abstract: Federated Graph Learning (FGL) is a distributed machine learning paradigm based on graph neural networks, enabling secure and collaborative modeling of local graph data among clients. However, label noise can degrade the global model's generalization performance. Existing federated label noise learning methods, primarily focused on computer vision, often yield suboptimal results when applied to FGL. To address this, we propose a robust federated graph learning method with label noise, termed FedRGL. FedRGL introduces dual-perspective consistency noise node filtering, leveraging both the global model and subgraph structure under class-aware dynamic thresholds. To enhance client-side training, we incorporate graph contrastive learning, which improves encoder robustness and assigns high-confidence pseudo-labels to noisy nodes. Additionally, we measure model quality via predictive entropy of unlabeled nodes, enabling adaptive robust aggregation of the global model. Comparative experiments on multiple real-world graph datasets show that FedRGL outperforms 12 baseline methods across various noise rates, types, and numbers of clients.

Title: MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications

Authors: Vishnou Vinayagame, Gregory Senay, Luis Martí
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.18915
Pdf URL: https://arxiv.org/pdf/2411.18915
Copy Paste: [[2411.18915]] MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications(https://arxiv.org/abs/2411.18915)
Keywords: privacy, robust
Abstract: Mathematical reasoning capabilities are increasing with tool-augmented language agents, but methods often rely either on closed-source or large models, external data, or extensive prompt engineering. This work introduces MATATA, a novel cost-effective method to train LLM agents for tabular data problems through reasoning, planning, and tool use. With a progressive self-improvement paradigm and an iterative weak supervision, it empowers 3.8B/8B Small Language Models (SLMs), particularly suited for local hosting and sensitive business contexts where data privacy is crucial. By employing a flexible and reusable tools across different datasets, it achieves robust performance with effective scalability across shared tasks. Experiments show that MATATA reaches state-of-the-art performances on FinQA and TAT-QA among reasoning frameworks based on open-source models. Moreover, MATATA models compete with GPT-4 based frameworks on TabMWP, while being SLMs.

Title: Federated Continual Graph Learning

Authors: Yinlin Zhu, Xunkai Li, Miao Hu, Di Wu
Subjects: cs.LG, cs.AI, cs.DB, cs.SI
Abstract URL: https://arxiv.org/abs/2411.18919
Pdf URL: https://arxiv.org/pdf/2411.18919
Copy Paste: [[2411.18919]] Federated Continual Graph Learning(https://arxiv.org/abs/2411.18919)
Keywords: privacy, federate
Abstract: In the era of big data, managing evolving graph data poses substantial challenges due to storage costs and privacy issues. Training graph neural networks (GNNs) on such evolving data usually causes catastrophic forgetting, impairing performance on earlier tasks. Despite existing continual graph learning (CGL) methods mitigating this to some extent, they predominantly operate in centralized architectures and overlook the potential of distributed graph databases to harness collective intelligence for enhanced performance optimization. To address these challenges, we present a pioneering study on Federated Continual Graph Learning (FCGL), which adapts GNNs to multiple evolving graphs within decentralized settings while adhering to storage and privacy constraints. Our work begins with a comprehensive empirical analysis of FCGL, assessing its data characteristics, feasibility, and effectiveness, and reveals two principal challenges: local graph forgetting (LGF), where local GNNs forget prior knowledge when adapting to new tasks, and global expertise conflict (GEC), where the global GNN exhibits sub-optimal performance in both adapting to new tasks and retaining old ones, arising from inconsistent client expertise during server-side parameter aggregation. To tackle these, we propose the POWER framework, which mitigates LGF by preserving and replaying experience nodes with maximum local-global coverage at each client and addresses GEC by using a pseudo prototype reconstruction strategy and trajectory-aware knowledge transfer at the central server. Extensive evaluations across multiple graph datasets demonstrate POWER's superior performance over straightforward federated extensions of the centralized CGL algorithms and vision-focused federated continual learning algorithms. Our code is available at this https URL.

Title: Devising a Set of Compact and Explainable Spoken Language Feature for Screening Alzheimer's Disease

Authors: Junan Li, Yunxiang Li, Yuren Wang, Xixin Wu, Helen Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18922
Pdf URL: https://arxiv.org/pdf/2411.18922
Copy Paste: [[2411.18922]] Devising a Set of Compact and Explainable Spoken Language Feature for Screening Alzheimer's Disease(https://arxiv.org/abs/2411.18922)
Keywords: interpretability, large language model
Abstract: Alzheimer's disease (AD) has become one of the most significant health challenges in an aging society. The use of spoken language-based AD detection methods has gained prevalence due to their scalability due to their scalability. Based on the Cookie Theft picture description task, we devised an explainable and effective feature set that leverages the visual capabilities of a large language model (LLM) and the Term Frequency-Inverse Document Frequency (TF-IDF) model. Our experimental results show that the newly proposed features consistently outperform traditional linguistic features across two different classifiers with high dimension efficiency. Our new features can be well explained and interpreted step by step which enhance the interpretability of automatic AD screening.

Title: EzSQL: An SQL intermediate representation for improving SQL-to-text Generation

Authors: Meher Bhardwaj, Hrishikesh Ethari, Dennis Singh Moirangthem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18923
Pdf URL: https://arxiv.org/pdf/2411.18923
Copy Paste: [[2411.18923]] EzSQL: An SQL intermediate representation for improving SQL-to-text Generation(https://arxiv.org/abs/2411.18923)
Keywords: generative
Abstract: The SQL-to-text generation task traditionally uses template base, Seq2Seq, tree-to-sequence, and graph-to-sequence models. Recent models take advantage of pre-trained generative language models for this task in the Seq2Seq framework. However, treating SQL as a sequence of inputs to the pre-trained models is not optimal. In this work, we put forward a new SQL intermediate representation called EzSQL to align SQL with the natural language text sequence. EzSQL simplifies the SQL queries and brings them closer to natural language text by modifying operators and keywords, which can usually be described in natural language. EzSQL also removes the need for set operators. Our proposed SQL-to-text generation model uses EzSQL as the input to a pre-trained generative language model for generating the text descriptions. We demonstrate that our model is an effective state-of-the-art method to generate text narrations from SQL queries on the WikiSQL and Spider datasets. We also show that by generating pretraining data using our SQL-to-text generation model, we can enhance the performance of Text-to-SQL parsers.

Title: Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?

Authors: Adrian Tormos, Blanca Llauradó, Fernando Núñez, Axel Romero, Dario Garcia-Gasulla, Javier Béjar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18926
Pdf URL: https://arxiv.org/pdf/2411.18926
Copy Paste: [[2411.18926]] Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?(https://arxiv.org/abs/2411.18926)
Keywords: diffusion, generative
Abstract: The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples that resemble the original data. Denoising Diffusion models is a promising Deep Learning technique that can learn good approximations of different kinds of data like images, time series or tabular data. Automatic colonoscopy analysis and specifically Polyp localization in colonoscopy videos is a task that can assist clinical diagnosis and treatment. The annotation of video frames for training a deep learning model is a time consuming task and usually only small datasets can be obtained. The fine tuning of application models using a large dataset of generated data could be an alternative to improve their performance. We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations using a combination of existing open datasets. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime.

Title: VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference

Authors: Sakshi Agarwal, Gabe Hoope, Erik B. Sudderth
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18929
Pdf URL: https://arxiv.org/pdf/2411.18929
Copy Paste: [[2411.18929]] VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference(https://arxiv.org/abs/2411.18929)
Keywords: diffusion, generative
Abstract: Diffusion probabilistic models learn to remove noise that is artificially added to the data during training. Novel data, like images, may then be generated from Gaussian noise through a sequence of denoising operations. While this Markov process implicitly defines a joint distribution over noise-free data, it is not simple to condition the generative process on masked or partial images. A number of heuristic sampling procedures have been proposed for solving inverse problems with diffusion priors, but these approaches do not directly approximate the true conditional distribution imposed by inference queries, and are often ineffective for large masked regions. Moreover, many of these baselines cannot be applied to latent diffusion models which use image encodings for efficiency. We instead develop a hierarchical variational inference algorithm that analytically marginalizes missing features, and uses a rigorous variational bound to optimize a non-Gaussian Markov approximation of the true diffusion posterior. Through extensive experiments with both pixel-based and latent diffusion models of images, we show that our VIPaint method significantly outperforms previous approaches in both the plausibility and diversity of imputations, and is easily generalized to other inverse problems like deblurring and superresolution.

Title: Efficient Track Anything

Authors: Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, Vikas Chandra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18933
Pdf URL: https://arxiv.org/pdf/2411.18933
Copy Paste: [[2411.18933]] Efficient Track Anything(https://arxiv.org/abs/2411.18933)
Keywords: extraction, transformer, segmentation
Abstract: Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.

Title: Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Authors: Weimin Qiu, Jieke Wang, Meng Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18936
Pdf URL: https://arxiv.org/pdf/2411.18936
Copy Paste: [[2411.18936]] Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects(https://arxiv.org/abs/2411.18936)
Keywords: diffusion, transformer
Abstract: Diffusion models have achieved unprecedented fidelity and diversity for synthesizing image, video, 3D assets, etc. However, subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross diffusion guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our self-cross guidance is more effective in eliminating subject mixing. What's more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to. Our method is training-free and can boost the performance of any transformer-based diffusion model such as Stable Diffusion.% for synthesizing similar subjects. We also release a more challenging benchmark with many text prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross guidance.

Title: Rephrasing Electronic Health Records for Pretraining Clinical Language Models

Authors: Jinghui Liu, Anthony Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18940
Pdf URL: https://arxiv.org/pdf/2411.18940
Copy Paste: [[2411.18940]] Rephrasing Electronic Health Records for Pretraining Clinical Language Models(https://arxiv.org/abs/2411.18940)
Keywords: privacy
Abstract: Clinical language models are important for many applications in healthcare, but their development depends on access to extensive clinical text for pretraining. However, obtaining clinical notes from electronic health records (EHRs) at scale is challenging due to patient privacy concerns. In this study, we rephrase existing clinical notes using LLMs to generate synthetic pretraining corpora, drawing inspiration from previous work on rephrasing web data. We examine four popular small-sized LLMs (<10B) to create synthetic clinical text to pretrain both decoder-based and encoder-based language models. The method yields better results in language modeling and downstream tasks than previous synthesis approaches without referencing real clinical text. We find that augmenting original clinical notes with synthetic corpora from different LLMs improves performances even at a small token budget, showing the potential of this method to support pretraining at the institutional level or be scaled to synthesize large-scale clinical corpora.

Title: Waterfall Transformer for Multi-person Pose Estimation

Authors: Navin Ranjan, Bruno Artacho, Andreas Savakis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18944
Pdf URL: https://arxiv.org/pdf/2411.18944
Copy Paste: [[2411.18944]] Waterfall Transformer for Multi-person Pose Estimation(https://arxiv.org/abs/2411.18944)
Keywords: transformer
Abstract: We propose the Waterfall Transformer architecture for Pose estimation (WTPose), a single-pass, end-to-end trainable framework designed for multi-person pose estimation. Our framework leverages a transformer-based waterfall module that generates multi-scale feature maps from various backbone stages. The module performs filtering in the cascade architecture to expand the receptive fields and to capture local and global context, therefore increasing the overall feature representation capability of the network. Our experiments on the COCO dataset demonstrate that the proposed WTPose architecture, with a modified Swin backbone and transformer-based waterfall module, outperforms other transformer architectures for multi-person pose estimation

Title: ICLERB: In-Context Learning Embedding and Reranker Benchmark

Authors: Marie Al Ghossein, Emile Contal, Alexandre Robicquet
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2411.18947
Pdf URL: https://arxiv.org/pdf/2411.18947
Copy Paste: [[2411.18947]] ICLERB: In-Context Learning Embedding and Reranker Benchmark(https://arxiv.org/abs/2411.18947)
Keywords: large language model
Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform new tasks by conditioning on prompts with relevant information. Retrieval-Augmented Generation (RAG) enhances ICL by incorporating retrieved documents into the LLM's context at query time. However, traditional retrieval methods focus on semantic relevance, treating retrieval as a search problem. In this paper, we propose reframing retrieval for ICL as a recommendation problem, aiming to select documents that maximize utility in ICL tasks. We introduce the In-Context Learning Embedding and Reranker Benchmark (ICLERB), a novel evaluation framework that compares retrievers based on their ability to enhance LLM accuracy in ICL settings. Additionally, we propose a novel Reinforcement Learning-to-Rank from AI Feedback (RLRAIF) algorithm, designed to fine-tune retrieval models using minimal feedback from the LLM. Our experimental results reveal notable differences between ICLERB and existing benchmarks, and demonstrate that small models fine-tuned with our RLRAIF algorithm outperform large state-of-the-art retrieval models. These findings highlight the limitations of existing evaluation methods and the need for specialized benchmarks and training strategies adapted to ICL.

Title: Knowledge Database or Poison Base? Detecting RAG Poisoning Attack through LLM Activations

Authors: Xue Tan, Hao Luan, Mingyu Luo, Xiaoyan Sun, Ping Chen, Jun Dai
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18948
Pdf URL: https://arxiv.org/pdf/2411.18948
Copy Paste: [[2411.18948]] Knowledge Database or Poison Base? Detecting RAG Poisoning Attack through LLM Activations(https://arxiv.org/abs/2411.18948)
Keywords: security, attack, robust, large language model
Abstract: As Large Language Models (LLMs) are progressively deployed across diverse fields and real-world applications, ensuring the security and robustness of LLMs has become ever more critical. Retrieval-Augmented Generation (RAG) is a cutting-edge approach designed to address the limitations of large language models (LLMs). By retrieving information from the relevant knowledge database, RAG enriches the input to LLMs, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%. We also evaluate recent backdoor detection methods specifically designed for LLMs and applicable for identifying poisoned responses in RAG. The results demonstrate that our approach significantly surpasses them.

Title: Random Sampling for Diffusion-based Adversarial Purification

Authors: Jiancheng Zhang, Peiran Dong, Yongyong Chen, Yin-Ping Zhao, Song Guo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18956
Pdf URL: https://arxiv.org/pdf/2411.18956
Copy Paste: [[2411.18956]] Random Sampling for Diffusion-based Adversarial Purification(https://arxiv.org/abs/2411.18956)
Keywords: attack, robust, diffusion
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have gained great attention in adversarial purification. Current diffusion-based works focus on designing effective condition-guided mechanisms while ignoring a fundamental problem, i.e., the original DDPM sampling is intended for stable generation, which may not be the optimal solution for adversarial purification. Inspired by the stability of the Denoising Diffusion Implicit Model (DDIM), we propose an opposite sampling scheme called random sampling. In brief, random sampling will sample from a random noisy space during each diffusion process, while DDPM and DDIM sampling will continuously sample from the adjacent or original noisy space. Thus, random sampling obtains more randomness and achieves stronger robustness against adversarial attacks. Correspondingly, we also introduce a novel mediator conditional guidance to guarantee the consistency of the prediction under the purified image and clean image input. To expand awareness of guided diffusion purification, we conduct a detailed evaluation with different sampling methods and our random sampling achieves an impressive improvement in multiple settings. Leveraging mediator-guided random sampling, we also establish a baseline method named DiffAP, which significantly outperforms state-of-the-art (SOTA) approaches in performance and defensive stability. Remarkably, under strong attack, our DiffAP even achieves a more than 20% robustness advantage with 10$\times$ sampling acceleration.

Title: Perception of Visual Content: Differences Between Humans and Foundation Models

Authors: Nardiena A. Pratama, Shaoyang Fan, Gianluca Demartini
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18968
Pdf URL: https://arxiv.org/pdf/2411.18968
Copy Paste: [[2411.18968]] Perception of Visual Content: Differences Between Humans and Foundation Models(https://arxiv.org/abs/2411.18968)
Keywords: fair
Abstract: Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study compares human-generated and ML-generated annotations of images representing diverse socio-economic contexts. We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels washing their hands. We compare human and ML-generated annotations semantically and evaluate their impact on predictive models. Our results show low similarity between human and machine annotations from a low-level perspective, i.e., types of words that appear and sentence structures, but are alike in how similar or dissimilar they perceive images across different regions. Additionally, human annotations resulted in best overall and most balanced region classification performance on the class level, while ML Objects and ML Captions performed best for income regression. Humans and machines' similarity in their lack of bias when perceiving images highlights how they are more alike than what was initially perceived. The superior and fairer performance of using human annotations for region classification and machine annotations for income regression show how important the quality of the images and the discriminative features in the annotations are.

Title: Det-SAM2:Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2

Authors: Zhiting Wang, Qiangong Zhou, Zongyang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18977
Pdf URL: https://arxiv.org/pdf/2411.18977
Copy Paste: [[2411.18977]] Det-SAM2:Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2(https://arxiv.org/abs/2411.18977)
Keywords: segmentation
Abstract: Segment Anything Model 2 (SAM2) demonstrates exceptional performance in video segmentation and refinement of segmentation results. We anticipate that it can further evolve to achieve higher levels of automation for practical applications. Building upon SAM2, we conducted a series of practices that ultimately led to the development of a fully automated pipeline, termed Det-SAM2, in which object prompts are automatically generated by a detection model to facilitate inference and refinement by SAM2. This pipeline enables inference on infinitely long video streams with constant VRAM and RAM usage, all while preserving the same efficiency and accuracy as the original SAM2. This technical report focuses on the construction of the overall Det-SAM2 framework and the subsequent engineering optimization applied to SAM2. We present a case demonstrating an application built on the Det-SAM2 framework: AI refereeing in a billiards scenario, derived from our business context. The project at \url{this https URL}.

Title: Zero-shot Slot Filling in the Age of LLMs for Dialogue Systems

Authors: Mansi Rana, Kadri Hacioglu, Sindhuja Gopalan, Maragathamani Boothalingam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18980
Pdf URL: https://arxiv.org/pdf/2411.18980
Copy Paste: [[2411.18980]] Zero-shot Slot Filling in the Age of LLMs for Dialogue Systems(https://arxiv.org/abs/2411.18980)
Keywords: large language model
Abstract: Zero-shot slot filling is a well-established subtask of Natural Language Understanding (NLU). However, most existing methods primarily focus on single-turn text data, overlooking the unique complexities of conversational dialogue. Conversational data is highly dynamic, often involving abrupt topic shifts, interruptions, and implicit references that make it difficult to directly apply zero-shot slot filling techniques, even with the remarkable capabilities of large language models (LLMs). This paper addresses these challenges by proposing strategies for automatic data annotation with slot induction and black-box knowledge distillation (KD) from a teacher LLM to a smaller model, outperforming vanilla LLMs on internal datasets by 26% absolute increase in F1 score. Additionally, we introduce an efficient system architecture for call center product settings that surpasses off-the-shelf extractive models by 34% relative F1 score, enabling near real-time inference on dialogue streams with higher accuracy, while preserving low latency.

Title: SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

Authors: Rong-Cheng Tu, Wenhao Sun, Zhao Jin, Jingyi Liao, Jiaxing Huang, Dacheng Tao
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2411.18983
Pdf URL: https://arxiv.org/pdf/2411.18983
Copy Paste: [[2411.18983]] SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing(https://arxiv.org/abs/2411.18983)
Keywords: generative
Abstract: While open-source video generation and editing models have made significant progress, individual models are typically limited to specific tasks, failing to meet the diverse needs of users. Effectively coordinating these models can unlock a wide range of video generation and editing capabilities. However, manual coordination is complex and time-consuming, requiring users to deeply understand task requirements and possess comprehensive knowledge of each model's performance, applicability, and limitations, thereby increasing the barrier to entry. To address these challenges, we propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent). SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models, enhancing the adaptability, efficiency, and overall quality of video generation and editing. Specifically, the SPAgent assembles a tool library integrating state-of-the-art open-source image and video generation and editing models as tools. After fine-tuning on our manually annotated dataset, SPAgent can automatically coordinate the tools for video generation and editing, through our novelly designed three-step framework: (1) decoupled intent recognition, (2) principle-guided route planning, and (3) capability-based execution model selection. Additionally, we enhance the SPAgent's video quality evaluation capability, enabling it to autonomously assess and incorporate new video generation and editing models into its tool library without human intervention. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos, highlighting its versatility and adaptability across various video tasks.

Title: Harden Deep Neural Networks Against Fault Injections Through Weight Scaling

Authors: Ninnart Fuengfusin, Hakaru Tamukoh
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.18993
Pdf URL: https://arxiv.org/pdf/2411.18993
Copy Paste: [[2411.18993]] Harden Deep Neural Networks Against Fault Injections Through Weight Scaling(https://arxiv.org/abs/2411.18993)
Keywords: protect
Abstract: Deep neural networks (DNNs) have enabled smart applications on hardware devices. However, these hardware devices are vulnerable to unintended faults caused by aging, temperature variance, and write errors. These faults can cause bit-flips in DNN weights and significantly degrade the performance of DNNs. Thus, protection against these faults is crucial for the deployment of DNNs in critical applications. Previous works have proposed error correction codes based methods, however these methods often require high overheads in both memory and computation. In this paper, we propose a simple yet effective method to harden DNN weights by multiplying weights by constants before storing them to fault-prone medium. When used, these weights are divided back by the same constants to restore the original scale. Our method is based on the observation that errors from bit-flips have properties similar to additive noise, therefore by dividing by constants can reduce the absolute error from bit-flips. To demonstrate our method, we conduct experiments across four ImageNet 2012 pre-trained models along with three different data types: 32-bit floating point, 16-bit floating point, and 8-bit fixed point. This method demonstrates that by only multiplying weights with constants, Top-1 Accuracy of 8-bit fixed point ResNet50 is improved by 54.418 at bit-error rate of 0.0001.

Title: MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

Authors: Jongseong Bae, Susang Kim, Minsu Cho, Ha Young Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18995
Pdf URL: https://arxiv.org/pdf/2411.18995
Copy Paste: [[2411.18995]] MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers(https://arxiv.org/abs/2411.18995)
Keywords: transformer, segmentation
Abstract: Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various receptive fields for the token mixer at each stage, efficiently capturing ranges of visual patterns. We propose a novel ViT model, multi-vision transformer (MVFormer), adopting the MVN and MVTM in the MetaFormer block, the generalized ViT scheme. Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification, object detection, and instance and semantic segmentation with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4%, 84.3%, and 84.6% top-1 accuracy, respectively, on ImageNet-1K benchmark.

Title: Presenting a new approach in security in inter-vehicle networks (VANET)

Authors: Davoud Yousefi, Farhang Farhad, Mehran Abed, Soheil Gavidel
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.19002
Pdf URL: https://arxiv.org/pdf/2411.19002
Copy Paste: [[2411.19002]] Presenting a new approach in security in inter-vehicle networks (VANET)(https://arxiv.org/abs/2411.19002)
Keywords: secure, security
Abstract: Nowadays, inter-vehicle networks are a viable communication scenario that greatly contributes to daily work, and its issues are gaining more and more attention every day. These days, space networks are growing and developing. There are numerous new uses for this new kind of network communication. One of the most significant daily programs in the world today is road traffic. For human growth, passenger and freight transportation is essential. Thus, fresh advancements in the areas of improved safety features, environmentally friendly fuel, etc., are developed daily. In order to improve safety and regulate traffic, a new application program is used. However, because of their stringent security standards, these initiatives have an impact on traffic safety. Since driving is one of the things that necessitates traffic safety, this area needs to be made more secure. Providing trustworthy driving data is crucial to achieving this goal, aside from the automated portion of the operation. Drivers would greatly benefit from accurate weather descriptions or early warnings of potential dangers (such as traffic bottlenecks or accidents). Inter-vehicle networks, a novel form of information technology, are being developed for this reason. Keywords: inter-vehicle network, transportation and security

Title: Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Authors: Muhammad Umer Ramzan, Ali Zia, Abdelwahed Khamis, yman Elgharabawy, Ahmad Liaqat, Usman Ali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19005
Pdf URL: https://arxiv.org/pdf/2411.19005
Copy Paste: [[2411.19005]] Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement(https://arxiv.org/abs/2411.19005)
Keywords: robust, generative
Abstract: This paper presents a novel deep-learning framework that significantly enhances the transformation of rudimentary face sketches into high-fidelity colour images. Employing a Convolutional Block Attention-based Auto-encoder Network (CA2N), our approach effectively captures and enhances critical facial features through a block attention mechanism within an encoder-decoder architecture. Subsequently, the framework utilises a noise-induced conditional Generative Adversarial Network (cGAN) process that allows the system to maintain high performance even on domains unseen during the training. These enhancements lead to considerable improvements in image realism and fidelity, with our model achieving superior performance metrics that outperform the best method by FID margin of 17, 23, and 38 on CelebAMask-HQ, CUHK, and CUFSF datasets; respectively. The model sets a new state-of-the-art in sketch-to-image generation, can generalize across sketch types, and offers a robust solution for applications such as criminal identification in law enforcement.

Title: Pilot Contamination Aware Transformer for Downlink Power Control in Cell-Free Massive MIMO Networks

Authors: Atchutaram K. Kocharlakota, Sergiy A. Vorobyov, Robert W. Heath Jr
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2411.19020
Pdf URL: https://arxiv.org/pdf/2411.19020
Copy Paste: [[2411.19020]] Pilot Contamination Aware Transformer for Downlink Power Control in Cell-Free Massive MIMO Networks(https://arxiv.org/abs/2411.19020)
Keywords: extraction, fair, transformer
Abstract: Learning-based downlink power control in cell-free massive multiple-input multiple-output (CFmMIMO) systems offers a promising alternative to conventional iterative optimization algorithms, which are computationally intensive due to online iterative steps. Existing learning-based methods, however, often fail to exploit the intrinsic structure of channel data and neglect pilot allocation information, leading to suboptimal performance, especially in large-scale networks with many users. This paper introduces the pilot contamination-aware power control (PAPC) transformer neural network, a novel approach that integrates pilot allocation data into the network, effectively handling pilot contamination scenarios. PAPC employs the attention mechanism with a custom masking technique to utilize structural information and pilot data. The architecture includes tailored preprocessing and post-processing stages for efficient feature extraction and adherence to power constraints. Trained in an unsupervised learning framework, PAPC is evaluated against the accelerated proximal gradient (APG) algorithm, showing comparable spectral efficiency fairness performance while significantly improving computational efficiency. Simulations demonstrate PAPC's superior performance over fully connected networks (FCNs) that lack pilot information, its scalability to large-scale CFmMIMO networks, and its computational efficiency improvement over APG. Additionally, by employing padding techniques, PAPC adapts to the dynamically varying number of users without retraining.

Title: Enhancing Neural Network Robustness Against Fault Injection Through Non-linear Weight Transformations

Authors: Ninnart Fuengfusin, Hakaru Tamukoh
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19027
Pdf URL: https://arxiv.org/pdf/2411.19027
Copy Paste: [[2411.19027]] Enhancing Neural Network Robustness Against Fault Injection Through Non-linear Weight Transformations(https://arxiv.org/abs/2411.19027)
Keywords: protect, robust
Abstract: Deploying deep neural networks (DNNs) in real-world environments poses challenges due to faults that can manifest in physical hardware from radiation, aging, and temperature fluctuations. To address this, previous works have focused on protecting DNNs via activation range restriction using clipped ReLU and finding the optimal clipping threshold. However, this work instead focuses on constraining DNN weights by applying saturated activation functions (SAFs): Tanh, Arctan, and others. SAFs prevent faults from causing DNN weights to become excessively large, which can lead to model failure. These methods not only enhance the robustness of DNNs against fault injections but also improve DNN performance by a small margin. Before deployment, DNNs are trained with weights constrained by SAFs. During deployment, the weights without applied SAF are written to mediums with faults. When read, weights with faults are applied with SAFs and are used for inference. We demonstrate our proposed method across three datasets (CIFAR10, CIFAR100, ImageNet 2012) and across three datatypes (32-bit floating point (FP32), 16-bit floating point, and 8-bit fixed point). We show that our method enables FP32 ResNet18 with ImageNet 2012 to operate at a bit-error rate of 0.00001 with minor accuracy loss, while without the proposed method, the FP32 DNN only produces random guesses. Furthermore, to accelerate the training process, we demonstrate that an ImageNet 2012 pre-trained ResNet18 can be adapted to SAF by training for a few epochs with a slight improvement in Top-1 accuracy while still ensuring robustness against fault injection.

Title: PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors

Authors: Guangshun Wei, Yuan Feng, Long Ma, Chen Wang, Yuanfeng Zhou, Changjian Li
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19036
Pdf URL: https://arxiv.org/pdf/2411.19036
Copy Paste: [[2411.19036]] PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors(https://arxiv.org/abs/2411.19036)
Keywords: diffusion
Abstract: This paper presents PCDreamer, a novel method for point cloud completion. Traditional methods typically extract features from partial point clouds to predict missing regions, but the large solution space often leads to unsatisfactory results. More recent approaches have started to use images as extra guidance, effectively improving performance, but obtaining paired data of images and partial point clouds is challenging in practice. To overcome these limitations, we harness the relatively view-consistent multi-view diffusion priors within large models, to generate novel views of the desired shape. The resulting image set encodes both global and local shape cues, which is especially beneficial for shape completion. To fully exploit the priors, we have designed a shape fusion module for producing an initial complete shape from multi-modality input (\ie, images and point clouds), and a follow-up shape consolidation module to obtain the final complete shape by discarding unreliable points introduced by the inconsistency from diffusion priors. Extensive experimental results demonstrate our superior performance, especially in recovering fine details.

Title: 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes

Authors: Tejaswini Medi, Arianna Rampini, Pradyumna Reddy, Pradeep Kumar Jayaraman, Margret Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19037
Pdf URL: https://arxiv.org/pdf/2411.19037
Copy Paste: [[2411.19037]] 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes(https://arxiv.org/abs/2411.19037)
Keywords: diffusion, transformer, generative
Abstract: Autoregressive (AR) models have achieved remarkable success in natural language and image generation, but their application to 3D shape modeling remains largely unexplored. Unlike diffusion models, AR models enable more efficient and controllable generation with faster inference times, making them especially suitable for data-intensive domains. Traditional 3D generative models using AR approaches often rely on ``next-token" predictions at the voxel or point level. While effective for certain applications, these methods can be restrictive and computationally expensive when dealing with large-scale 3D data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation, class-conditioned and also text-conditioned shape generation. Our key idea is to encode shapes as multi-scale wavelet token maps and use a Transformer to predict the ``next higher-resolution token map" in an autoregressive manner. By redefining 3D AR generation task as ``next-scale" prediction, we reduce the computational cost of generation compared to traditional ``next-token" prediction models, while preserving essential geometric details of 3D shapes in a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its benefit by quantitative and qualitative comparisons with state-of-the-art methods on widely used benchmarks. Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD, generating high-fidelity 3D shapes that closely match the real data distribution.

Title: DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs

Authors: Ben Ganon, Alon Zolfi, Omer Hofman, Inderjeet Singh, Hisashi Kojima, Yuval Elovici, Asaf Shabtai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19038
Pdf URL: https://arxiv.org/pdf/2411.19038
Copy Paste: [[2411.19038]] DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs(https://arxiv.org/abs/2411.19038)
Keywords: defense, large language model
Abstract: In recent years, conversational large language models (LLMs) have shown tremendous success in tasks such as casual conversation, question answering, and personalized dialogue, making significant advancements in domains like virtual assistance, social interaction, and online customer engagement. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety, or social norms), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space. This approach provides an efficient and effective solution for maintaining alignment with human values. Our evaluation demonstrates DIESEL's effectiveness on state-of-the-art conversational models (e.g., Llama 3), even in challenging jailbreaking scenarios that test the limits of response safety. We further show that DIESEL can be generalized to use cases other than safety, providing a versatile solution for general-purpose response filtering with minimal computational overhead.

Title: I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

Authors: Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19050
Pdf URL: https://arxiv.org/pdf/2411.19050
Copy Paste: [[2411.19050]] I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting(https://arxiv.org/abs/2411.19050)
Keywords: diffusion
Abstract: Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results. Our code, data, and trained models are available at this https URL.

Title: Way to Specialist: Closing Loop Between Specialized LLM and Evolving Domain Knowledge Graph

Authors: Yutong Zhang, Lixing Chen, Shenghong Li, Nan Cao, Yang Shi, Jiaxin Ding, Zhe Qu, Pan Zhou, Yang Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19064
Pdf URL: https://arxiv.org/pdf/2411.19064
Copy Paste: [[2411.19064]] Way to Specialist: Closing Loop Between Specialized LLM and Evolving Domain Knowledge Graph(https://arxiv.org/abs/2411.19064)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated exceptional performance across a wide variety of domains. Nonetheless, generalist LLMs continue to fall short in reasoning tasks necessitating specialized knowledge. Prior investigations into specialized LLMs focused on domain-specific training, which entails substantial efforts in domain data acquisition and model parameter fine-tuning. To address these challenges, this paper proposes the Way-to-Specialist (WTS) framework, which synergizes retrieval-augmented generation with knowledge graphs (KGs) to enhance the specialized capability of LLMs in the absence of specialized training. In distinction to existing paradigms that merely utilize external knowledge from general KGs or static domain KGs to prompt LLM for enhanced domain-specific reasoning, WTS proposes an innovative "LLM$\circlearrowright$KG" paradigm, which achieves bidirectional enhancement between specialized LLM and domain knowledge graph (DKG). The proposed paradigm encompasses two closely coupled components: the DKG-Augmented LLM and the LLM-Assisted DKG Evolution. The former retrieves question-relevant domain knowledge from DKG and uses it to prompt LLM to enhance the reasoning capability for domain-specific tasks; the latter leverages LLM to generate new domain knowledge from processed tasks and use it to evolve DKG. WTS closes the loop between DKG-Augmented LLM and LLM-Assisted DKG Evolution, enabling continuous improvement in the domain specialization as it progressively answers and learns from domain-specific questions. We validate the performance of WTS on 6 datasets spanning 5 domains. The experimental results show that WTS surpasses the previous SOTA in 4 specialized domains and achieves a maximum performance improvement of 11.3%.

Title: MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

Authors: Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19067
Pdf URL: https://arxiv.org/pdf/2411.19067
Copy Paste: [[2411.19067]] MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation(https://arxiv.org/abs/2411.19067)
Keywords: robust, segmentation
Abstract: Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at this https URL.

Title: LADDER: Multi-objective Backdoor Attack via Evolutionary Algorithm

Authors: Dazhuang Liu, Yanqi Qiao, Rui Wang, Kaitai Liang, Georgios Smaragdakis
Subjects: cs.CR, cs.AI, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2411.19075
Pdf URL: https://arxiv.org/pdf/2411.19075
Copy Paste: [[2411.19075]] LADDER: Multi-objective Backdoor Attack via Evolutionary Algorithm(https://arxiv.org/abs/2411.19075)
Keywords: attack, robust, steal
Abstract: Current black-box backdoor attacks in convolutional neural networks formulate attack objective(s) as single-objective optimization problems in single domain. Designing triggers in single domain harms semantics and trigger robustness as well as introduces visual and spectral anomaly. This work proposes a multi-objective black-box backdoor attack in dual domains via evolutionary algorithm (LADDER), the first instance of achieving multiple attack objectives simultaneously by optimizing triggers without requiring prior knowledge about victim model. In particular, we formulate LADDER as a multi-objective optimization problem (MOP) and solve it via multi-objective evolutionary algorithm (MOEA). MOEA maintains a population of triggers with trade-offs among attack objectives and uses non-dominated sort to drive triggers toward optimal solutions. We further apply preference-based selection to MOEA to exclude impractical triggers. We state that LADDER investigates a new dual-domain perspective for trigger stealthiness by minimizing the anomaly between clean and poisoned samples in the spectral domain. Lastly, the robustness against preprocessing operations is achieved by pushing triggers to low-frequency regions. Extensive experiments comprehensively showcase that LADDER achieves attack effectiveness of at least 99%, attack robustness with 90.23% (50.09% higher than state-of-the-art attacks on average), superior natural stealthiness (1.12x to 196.74x improvement) and excellent spectral stealthiness (8.45x enhancement) as compared to current stealthy attacks by the average $l_2$-norm across 5 public datasets.

Title: 360Recon: An Accurate Reconstruction Method Based on Depth Fusion from 360 Images

Authors: Zhongmiao Yan, Qi Wu, Songpengcheng Xia, Junyuan Deng, Xiang Mu, Renbiao Jin, Ling Pei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19102
Pdf URL: https://arxiv.org/pdf/2411.19102
Copy Paste: [[2411.19102]] 360Recon: An Accurate Reconstruction Method Based on Depth Fusion from 360 Images(https://arxiv.org/abs/2411.19102)
Keywords: extraction
Abstract: 360-degree images offer a significantly wider field of view compared to traditional pinhole cameras, enabling sparse sampling and dense 3D reconstruction in low-texture environments. This makes them crucial for applications in VR, AR, and related fields. However, the inherent distortion caused by the wide field of view affects feature extraction and matching, leading to geometric consistency issues in subsequent multi-view reconstruction. In this work, we propose 360Recon, an innovative MVS algorithm for ERP images. The proposed spherical feature extraction module effectively mitigates distortion effects, and by combining the constructed 3D cost volume with multi-scale enhanced features from ERP images, our approach achieves high-precision scene reconstruction while preserving local geometric consistency. Experimental results demonstrate that 360Recon achieves state-of-the-art performance and high efficiency in depth estimation and 3D reconstruction on existing public panoramic reconstruction datasets.

Title: Detailed Object Description with Controllable Dimensions

Authors: Xinran Wang, Haiwen Zhang, Baoteng Li, Kongming Liang, Hao Sun, Zhongjiang He, Zhanyu Ma, Jun Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19106
Pdf URL: https://arxiv.org/pdf/2411.19106
Copy Paste: [[2411.19106]] Detailed Object Description with Controllable Dimensions(https://arxiv.org/abs/2411.19106)
Keywords: large language model
Abstract: Object description plays an important role for visually impaired individuals to understand and compare the differences between objects. Recent multimodal large language models (MLLMs) exhibit powerful perceptual abilities and demonstrate impressive potential for generating object-centric captions. However, the descriptions generated by such models may still usually contain a lot of content that is not relevant to the user intent. Under special scenarios, users may only need the details of certain dimensions of an object. In this paper, we propose a training-free captioning refinement pipeline, \textbf{Dimension Tailor}, designed to enhance user-specified details in object descriptions. This pipeline includes three steps: dimension extracting, erasing, and supplementing, which decompose the description into pre-defined dimensions and correspond to user intent. Therefore, it can not only improve the quality of object details but also offer flexibility in including or excluding specific dimensions based on user preferences. We conducted extensive experiments to demonstrate the effectiveness of Dimension Tailor on controllable object descriptions. Notably, the proposed pipeline can consistently improve the performance of the recent MLLMs. The code is currently accessible at the following anonymous link: \url{this https URL}.

Title: Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Authors: Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, Fang Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19108
Pdf URL: https://arxiv.org/pdf/2411.19108
Copy Paste: [[2411.19108]] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model(https://arxiv.org/abs/2411.19108)
Keywords: diffusion
Abstract: As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.

Title: Integration of Contextual Descriptors in Ontology Alignment for Enrichment of Semantic Correspondence

Authors: Eduard Manziuk, Oleksander Barmak, Pavlo Radiuk, Vladislav Kuznetsov, Iurii Krak, Sergiy Yakovlev
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2411.19113
Pdf URL: https://arxiv.org/pdf/2411.19113
Copy Paste: [[2411.19113]] Integration of Contextual Descriptors in Ontology Alignment for Enrichment of Semantic Correspondence(https://arxiv.org/abs/2411.19113)
Keywords: privacy
Abstract: This paper proposes a novel approach to semantic ontology alignment using contextual descriptors. A formalization was developed that enables the integration of essential and contextual descriptors to create a comprehensive knowledge model. The hierarchical structure of the semantic approach and the mathematical apparatus for analyzing potential conflicts between concepts, particularly in the example of "Transparency" and "Privacy" in the context of artificial intelligence, are demonstrated. Experimental studies showed a significant improvement in ontology alignment metrics after the implementation of contextual descriptors, especially in the areas of privacy, responsibility, and freedom & autonomy. The application of contextual descriptors achieved an average overall improvement of approximately 4.36%. The results indicate the effectiveness of the proposed approach for more accurately reflecting the complexity of knowledge and its contextual dependence.

Title: Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models

Authors: Chung-Ting Tsai, Ching-Yun Ko, I-Hsin Chung, Yu-Chiang Frank Wang, Pin-Yu Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19117
Pdf URL: https://arxiv.org/pdf/2411.19117
Copy Paste: [[2411.19117]] Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models(https://arxiv.org/abs/2411.19117)
Keywords: robust, extraction, generative
Abstract: The rapid advancement of generative models has introduced serious risks, including deepfake techniques for facial synthesis and editing. Traditional approaches rely on training classifiers and enhancing generalizability through various feature extraction techniques. Meanwhile, training-free detection methods address issues like limited data and overfitting by directly leveraging statistical properties from vision foundation models to distinguish between real and fake images. The current leading training-free approach, RIGID, utilizes DINOv2 sensitivity to perturbations in image space for detecting fake images, with fake image embeddings exhibiting greater sensitivity than those of real images. This observation prompts us to investigate how detection performance varies across model backbones, perturbation types, and datasets. Our experiments reveal that detection performance is closely linked to model robustness, with self-supervised (SSL) models providing more reliable representations. While Gaussian noise effectively detects general objects, it performs worse on facial images, whereas Gaussian blur is more effective due to potential frequency artifacts. To further improve detection, we introduce Contrastive Blur, which enhances performance on facial images, and MINDER (MINimum distance DetEctoR), which addresses noise type bias, balancing performance across domains. Beyond performance gains, our work offers valuable insights for both the generative and detection communities, contributing to a deeper understanding of model robustness property utilized for deepfake detection.

Title: MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

Authors: Daewon Yoon, Hyungsuk Lee, Wonsik Shin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19121
Pdf URL: https://arxiv.org/pdf/2411.19121
Copy Paste: [[2411.19121]] MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation(https://arxiv.org/abs/2411.19121)
Keywords: diffusion
Abstract: This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario, as opposed to traditional short video generation. Scenario-based videos require a comprehensive evaluation that considers multiple factors such as character consistency, artistic coherence, aesthetic quality, and the alignment of the generated content with the intended prompt. Additionally, in video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes, which must be effectively evaluated and corrected. In the context of probabilistic models like diffusion, generating the desired scene requires repeated sampling and manual selection, akin to how a film director chooses the best shots from numerous takes. We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities. This approach allows for the generation of high-quality multi-scene videos by selecting the best outcomes based on automated scoring rather than manual inspection.

Title: A Comparative Analysis of Vulnerability Management Tools: Evaluating Nessus, Acunetix, and Nikto for Risk Based Security Solutions

Authors: Swetha B, Susmitha NRK, Thirulogaveni J, Sruthi S
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.19123
Pdf URL: https://arxiv.org/pdf/2411.19123
Copy Paste: [[2411.19123]] A Comparative Analysis of Vulnerability Management Tools: Evaluating Nessus, Acunetix, and Nikto for Risk Based Security Solutions(https://arxiv.org/abs/2411.19123)
Keywords: security
Abstract: The evolving threat landscape in cybersecurity necessitates the adoption of advanced tools for effective vulnerability management. This paper presents a comprehensive comparative analysis of three widely used tools: Nessus, Acunetix, and Nikto. Each tool is assessed based on its detection accuracy, risk scoring using the Common Vulnerability Scoring System (CVSS), ease of use, automation and reporting capabilities, performance metrics, and cost effectiveness. The research addresses the challenges faced by organizations in selecting the most suitable tool for their unique security requirements.

Title: Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures

Authors: Yicheng Zhang, Zhen Qin, Zhaomin Wu, Shuiguang Deng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19128
Pdf URL: https://arxiv.org/pdf/2411.19128
Copy Paste: [[2411.19128]] Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures(https://arxiv.org/abs/2411.19128)
Keywords: federate, large language model
Abstract: A large amount of instructional text data is essential to enhance the performance of pre-trained large language models (LLMs) for downstream tasks. This data can contain sensitive information and therefore cannot be shared in practice, resulting in data silos that limit the effectiveness of LLMs on various tasks. Federated learning (FL) enables collaborative fine-tuning across different clients without sharing their data. Nonetheless, in practice, this instructional text data is highly heterogeneous in both quantity and distribution across clients, necessitating distinct model structures to best accommodate the variations. However, existing federated fine-tuning approaches either enforce the same model structure or rely on predefined ad-hoc architectures unaware of data distribution, resulting in suboptimal performance. To address this challenge, we propose FedAMoLE, a lightweight personalized federated fine-tuning framework that leverages data-driven heterogeneous model architectures. FedAMoLE introduces the Adaptive Mixture of LoRA Experts (AMoLE) module, which facilitates model heterogeneity with minimal communication overhead by allocating varying numbers of LoRA-based domain experts to each client. Furthermore, we develop a reverse selection-based expert assignment (RSEA) strategy, which enables data-driven model architecture adjustment during fine-tuning by allowing domain experts to select clients that best align with their knowledge domains. Extensive experiments across six different scenarios of data heterogeneity demonstrate that FedAMoLE significantly outperforms existing methods for federated LLM fine-tuning, achieving superior accuracy while maintaining good scalability.

Title: TEA: Trajectory Encoding Augmentation for Robust and Transferable Policies in Offline Reinforcement Learning

Authors: Batıkan Bora Ormancı, Phillip Swazinna, Steffen Udluft, Thomas A. Runkler
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19133
Pdf URL: https://arxiv.org/pdf/2411.19133
Copy Paste: [[2411.19133]] TEA: Trajectory Encoding Augmentation for Robust and Transferable Policies in Offline Reinforcement Learning(https://arxiv.org/abs/2411.19133)
Keywords: robust
Abstract: In this paper, we investigate offline reinforcement learning (RL) with the goal of training a single robust policy that generalizes effectively across environments with unseen dynamics. We propose a novel approach, Trajectory Encoding Augmentation (TEA), which extends the state space by integrating latent representations of environmental dynamics obtained from sequence encoders, such as AutoEncoders. Our findings show that incorporating these encodings with TEA improves the transferability of a single policy to novel environments with new dynamics, surpassing methods that rely solely on unmodified states. These results indicate that TEA captures critical, environment-specific characteristics, enabling RL agents to generalize effectively across dynamic conditions.

Title: On Moving Object Segmentation from Monocular Video with Transformers

Authors: Christian Homeyer, Christoph Schnörr
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19141
Pdf URL: https://arxiv.org/pdf/2411.19141
Copy Paste: [[2411.19141]] On Moving Object Segmentation from Monocular Video with Transformers(https://arxiv.org/abs/2411.19141)
Keywords: transformer, segmentation
Abstract: Moving object detection and segmentation from a single moving camera is a challenging task, requiring an understanding of recognition, motion and 3D geometry. Combining both recognition and reconstruction boils down to a fusion problem, where appearance and motion features need to be combined for classification and segmentation. In this paper, we present a novel fusion architecture for monocular motion segmentation - M3Former, which leverages the strong performance of transformers for segmentation and multi-modal fusion. As reconstructing motion from monocular video is ill-posed, we systematically analyze different 2D and 3D motion representations for this problem and their importance for segmentation performance. Finally, we analyze the effect of training data and show that diverse datasets are required to achieve SotA performance on Kitti and Davis.

Title: Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Authors: Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19146
Pdf URL: https://arxiv.org/pdf/2411.19146
Copy Paste: [[2411.19146]] Puzzle: Distillation-Based NAS for Inference-Optimized LLMs(https://arxiv.org/abs/2411.19146)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their adoption is limited by high computational costs during inference. While increasing parameter counts enhances accuracy, it also widens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a framework to accelerate LLM inference on specific hardware while preserving their capabilities. Through an innovative application of neural architecture search (NAS) at an unprecedented scale, Puzzle systematically optimizes models with tens of billions of parameters under hardware constraints. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We demonstrate the real-world impact of our framework through Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B), a publicly available model derived from Llama-3.1-70B-Instruct. Nemotron-51B achieves a 2.17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while preserving 98.4% of the original model's capabilities. Nemotron-51B currently stands as the most accurate language model capable of inference on a single GPU with large batch sizes. Remarkably, this transformation required just 45B training tokens, compared to over 15T tokens used for the 70B model it was derived from. This establishes a new paradigm where powerful models can be optimized for efficient deployment with only negligible compromise of their capabilities, demonstrating that inference performance, not parameter count alone, should guide model selection. With the release of Nemotron-51B and the presentation of the Puzzle framework, we provide practitioners immediate access to state-of-the-art language modeling capabilities at significantly reduced computational costs.

Title: LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair

Authors: Xue Song, Jiequan Cui, Hanwang Zhang, Jiaxin Shi, Jingjing Chen, Chi Zhang, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19156
Pdf URL: https://arxiv.org/pdf/2411.19156
Copy Paste: [[2411.19156]] LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair(https://arxiv.org/abs/2411.19156)
Keywords: interpretability
Abstract: In this paper, we propose the LoRA of Change (LoC) framework for image editing with visual instructions, i.e., before-after image pairs. Compared to the ambiguities, insufficient specificity, and diverse interpretations of natural language, visual instructions can accurately reflect users' intent. Building on the success of LoRA in text-based image editing and generation, we dynamically learn an instruction-specific LoRA to encode the "change" in a before-after image pair, enhancing the interpretability and reusability of our model. Furthermore, generalizable models for image editing with visual instructions typically require quad data, i.e., a before-after image pair, along with query and target images. Due to the scarcity of such quad data, existing models are limited to a narrow range of visual instructions. To overcome this limitation, we introduce the LoRA Reverse optimization technique, enabling large-scale training with paired data alone. Extensive qualitative and quantitative experiments demonstrate that our model produces high-quality images that align with user intent and support a broad spectrum of real-world visual instructions.

Title: A Game-Theoretic Approach to the Study of Blockchain's Robustness

Authors: Ulysse Pavloff
Subjects: cs.CR, cs.GT
Abstract URL: https://arxiv.org/abs/2411.19175
Pdf URL: https://arxiv.org/pdf/2411.19175
Copy Paste: [[2411.19175]] A Game-Theoretic Approach to the Study of Blockchain's Robustness(https://arxiv.org/abs/2411.19175)
Keywords: attack, robust
Abstract: Blockchains have sparked global interest in recent years, gaining importance as they increasingly influence technology and finance. This thesis investigates the robustness of blockchain protocols, specifically focusing on Ethereum Proof-of-Stake. We define robustness in terms of two critical properties: Safety, which ensures that the blockchain will not have permanent conflicting blocks, and Liveness, which guarantees the continuous addition of new reliable blocks. Our research addresses the gap between traditional distributed systems approaches, which classify agents as either honest or Byzantine (i.e., malicious or faulty), and game-theoretic models that consider rational agents driven by incentives. We explore how incentives impact the robustness with both approaches. The thesis comprises three distinct analyses. First, we formalize the Ethereum PoS protocol, defining its properties and examining potential vulnerabilities through a distributed systems perspective. We identify that certain attacks can undermine the system's robustness. Second, we analyze the inactivity leak mechanism, a critical feature of Ethereum PoS, highlighting its role in maintaining system liveness during network disruptions but at the cost of safety. Finally, we employ game-theoretic models to study the strategies of rational validators within Ethereum PoS, identifying conditions under which these agents might deviate from the prescribed protocol to maximize their rewards. Our findings contribute to a deeper understanding of the importance of incentive mechanisms for blockchain robustness and provide insights into designing more resilient blockchain protocols.

Title: SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Authors: Yuhan Pei, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, Yu Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19182
Pdf URL: https://arxiv.org/pdf/2411.19182
Copy Paste: [[2411.19182]] SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation(https://arxiv.org/abs/2411.19182)
Keywords: diffusion, generative, large language model
Abstract: Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.

Title: Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs

Authors: Anirudh Phukan, Divyansh, Harshit Kumar Morj, Vaishnavi, Apoorv Saxena, Koustava Goswami
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19187
Pdf URL: https://arxiv.org/pdf/2411.19187
Copy Paste: [[2411.19187]] Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs(https://arxiv.org/abs/2411.19187)
Keywords: robust, large language model, segmentation
Abstract: The rapid development of Large Multimodal Models (LMMs) has significantly advanced multimodal understanding by harnessing the language abilities of Large Language Models (LLMs) and integrating modality-specific encoders. However, LMMs are plagued by hallucinations that limit their reliability and adoption. While traditional methods to detect and mitigate these hallucinations often involve costly training or rely heavily on external models, recent approaches utilizing internal model features present a promising alternative. In this paper, we critically assess the limitations of the state-of-the-art training-free technique, the logit lens, in handling generalized visual hallucinations. We introduce a refined method that leverages contextual token embeddings from middle layers of LMMs. This approach significantly improves hallucination detection and grounding across diverse categories, including actions and OCR, while also excelling in tasks requiring contextual understanding, such as spatial relations and attribute comparison. Our novel grounding technique yields highly precise bounding boxes, facilitating a transition from Zero-Shot Object Segmentation to Grounded Visual Question Answering. Our contributions pave the way for more reliable and interpretable multimodal models.

Title: Video Depth without Video Models

Authors: Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, Konrad Schindler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19189
Pdf URL: https://arxiv.org/pdf/2411.19189
Copy Paste: [[2411.19189]] Video Depth without Video Models(https://arxiv.org/abs/2411.19189)
Keywords: robust, diffusion
Abstract: Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: this http URL.

Title: Convex Regularization and Convergence of Policy Gradient Flows under Safety Constraints

Authors: Pekka Malo, Lauri Viitasaari, Antti Suominen, Eeva Vilkkumaa, Olli Tahvonen
Subjects: cs.LG, cs.AI, math.OC, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2411.19193
Pdf URL: https://arxiv.org/pdf/2411.19193
Copy Paste: [[2411.19193]] Convex Regularization and Convergence of Policy Gradient Flows under Safety Constraints(https://arxiv.org/abs/2411.19193)
Keywords: robust
Abstract: This paper studies reinforcement learning (RL) in infinite-horizon dynamic decision processes with almost-sure safety constraints. Such safety-constrained decision processes are central to applications in autonomous systems, finance, and resource management, where policies must satisfy strict, state-dependent constraints. We consider a doubly-regularized RL framework that combines reward and parameter regularization to address these constraints within continuous state-action spaces. Specifically, we formulate the problem as a convex regularized objective with parametrized policies in the mean-field regime. Our approach leverages recent developments in mean-field theory and Wasserstein gradient flows to model policies as elements of an infinite-dimensional statistical manifold, with policy updates evolving via gradient flows on the space of parameter distributions. Our main contributions include establishing solvability conditions for safety-constrained problems, defining smooth and bounded approximations that facilitate gradient flows, and demonstrating exponential convergence towards global solutions under sufficient regularization. We provide general conditions on regularization functions, encompassing standard entropy regularization as a special case. The results also enable a particle method implementation for practical RL applications. The theoretical insights and convergence guarantees presented here offer a robust framework for safe RL in complex, high-dimensional decision-making problems.

Title: An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation

Authors: Joy Mahapatra, Utpal Garain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19203
Pdf URL: https://arxiv.org/pdf/2411.19203
Copy Paste: [[2411.19203]] An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation(https://arxiv.org/abs/2411.19203)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown exceptional performance across various Data-to-Text Generation (DTG) tasks. However, generating factually consistent text in DTG remains challenging for LLMs. Despite this, in-depth evaluations of LLM factual consistency for DTG remain missing in the current literature. This paper addresses this gap by providing an extensive evaluation of factual consistency in LLMs for DTG. Our evaluation covers five widely used DTG datasets (E2E, ViGGo, WikiTableText, DART, and WebNLG) and five prominent LLM families (T5, BART, OPT, BLOOM, and Llama 2). To ensure a thorough evaluation of factual consistency, we use four state-of-the-art automatic metrics and include essential human assessments. Our extensive evaluations reveals three key findings regarding factual consistency in LLMs for DTG. First, Llama 2 often excels in generating factually consistent text, although smaller models like T5 and BART can achieve strong factual consistency on larger, lexically less-diverse datasets. Second, the average rate of change (AROC) indicates that increasing model size (number of model trainable parameters) generally enhances factual consistency of LLMs in DTG. Third, we observe that source-reference divergence (i.e., when the reference text diverges semantically from the source) typically reduces the factual consistency of LLMs in DTG.

Title: Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

Authors: Finlay G. C. Hudson, William A. P. Smith
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19210
Pdf URL: https://arxiv.org/pdf/2411.19210
Copy Paste: [[2411.19210]] Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation(https://arxiv.org/abs/2411.19210)
Keywords: segmentation
Abstract: We present Track Anything Behind Everything (TABE), a novel dataset, pipeline, and evaluation framework for zero-shot amodal completion from visible masks. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero-shot inference. Our dataset, TABE-51 provides highly accurate ground truth amodal segmentation masks without the need for human estimation or 3D reconstruction. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. We also introduce a specialised evaluation framework that isolates amodal completion performance, free from the influence of traditional visual segmentation metrics.

Title: Cross-Spectral Attention for Unsupervised RGB-IR Face Verification and Person Re-identification

Authors: Kshitij Nikhal, Cedric Nimpa Fondje, Benjamin S. Riggan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19215
Pdf URL: https://arxiv.org/pdf/2411.19215
Copy Paste: [[2411.19215]] Cross-Spectral Attention for Unsupervised RGB-IR Face Verification and Person Re-identification(https://arxiv.org/abs/2411.19215)
Keywords: robust, biometric
Abstract: Cross-spectral biometrics, such as matching imagery of faces or persons from visible (RGB) and infrared (IR) bands, have rapidly advanced over the last decade due to increasing sensitivity, size, quality, and ubiquity of IR focal plane arrays and enhanced analytics beyond the visible spectrum. Current techniques for mitigating large spectral disparities between RGB and IR imagery often include learning a discriminative common subspace by exploiting precisely curated data acquired from multiple spectra. Although there are challenges with determining robust architectures for extracting common information, a critical limitation for supervised methods is poor scalability in terms of acquiring labeled data. Therefore, we propose a novel unsupervised cross-spectral framework that combines (1) a new pseudo triplet loss with cross-spectral voting, (2) a new cross-spectral attention network leveraging multiple subspaces, and (3) structured sparsity to perform more discriminative cross-spectral clustering. We extensively compare our proposed RGB-IR biometric learning framework (and its individual components) with recent and previous state-of-the-art models on two challenging benchmark datasets: DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF) and RegDB person re-identification dataset, and, in some cases, achieve performance superior to completely supervised methods.

Title: Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection

Authors: Tsun-Hin Cheung, Ka-Chun Fung, Songjiang Lai, Kwan-Ho Lin, Vincent Ng, Kin-Man Lam
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2411.19220
Pdf URL: https://arxiv.org/pdf/2411.19220
Copy Paste: [[2411.19220]] Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection(https://arxiv.org/abs/2411.19220)
Keywords: large language model
Abstract: Identifying defects and anomalies in industrial products is a critical quality control task. Traditional manual inspection methods are slow, subjective, and error-prone. In this work, we propose a novel zero-shot training-free approach for automated industrial image anomaly detection using a multimodal machine learning pipeline, consisting of three foundation models. Our method first uses a large language model, i.e., GPT-3. generate text prompts describing the expected appearances of normal and abnormal products. We then use a grounding object detection model, called Grounding DINO, to locate the product in the image. Finally, we compare the cropped product image patches to the generated prompts using a zero-shot image-text matching model, called CLIP, to identify any anomalies. Our experiments on two datasets of industrial product images, namely MVTec-AD and VisA, demonstrate the effectiveness of this method, achieving high accuracy in detecting various types of defects and anomalies without the need for model training. Our proposed model enables efficient, scalable, and objective quality control in industrial manufacturing settings.

Title: Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Authors: Xinxu Wei, Kanhao Zhao, Yong Jiao, Nancy B. Carlisle, Hua Xie, Yu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19230
Pdf URL: https://arxiv.org/pdf/2411.19230
Copy Paste: [[2411.19230]] Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG(https://arxiv.org/abs/2411.19230)
Keywords: robust, generative
Abstract: Effectively utilizing extensive unlabeled high-density EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this by framing it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled/labeled and high/low-density EEG data. To fully leverage the abundant unlabeled EEG data, we introduce a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates Graph Contrastive Pre-training and Graph Masked Autoencoder Pre-training. This approach synergistically combines contrastive and generative pre-training techniques by reconstructing contrastive samples and contrasting the reconstructions. For knowledge distillation from high-density to low-density EEG data, we propose a Graph Topology Distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data, effectively handling missing electrodes through contrastive distillation. To integrate transfer learning and distillation, we jointly pre-train the teacher and student models by contrasting their queries and keys during pre-training, enabling robust distillers for downstream tasks. We demonstrate the effectiveness of our method on four classification tasks across two clinical EEG datasets with abundant unlabeled data and limited labeled data. The experimental results show that our approach significantly outperforms contemporary methods in both efficiency and accuracy.

Title: Z-STAR+: A Zero-shot Style Transfer Method via Adjusting Style Distribution

Authors: Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19231
Pdf URL: https://arxiv.org/pdf/2411.19231
Copy Paste: [[2411.19231]] Z-STAR+: A Zero-shot Style Transfer Method via Adjusting Style Distribution(https://arxiv.org/abs/2411.19231)
Keywords: extraction, diffusion, generative
Abstract: Style transfer presents a significant challenge, primarily centered on identifying an appropriate style representation. Conventional methods employ style loss, derived from second-order statistics or contrastive learning, to constrain style representation in the stylized result. However, these pre-defined style representations often limit stylistic expression, leading to artifacts. In contrast to existing approaches, we have discovered that latent features in vanilla diffusion models inherently contain natural style and content distributions. This allows for direct extraction of style information and seamless integration of generative priors into the content image without necessitating retraining. Our method adopts dual denoising paths to represent content and style references in latent space, subsequently guiding the content image denoising process with style latent codes. We introduce a Cross-attention Reweighting module that utilizes local content features to query style image information best suited to the input patch, thereby aligning the style distribution of the stylized results with that of the style image. Furthermore, we design a scaled adaptive instance normalization to mitigate inconsistencies in color distribution between style and stylized images on a global scale. Through theoretical analysis and extensive experimentation, we demonstrate the effectiveness and superiority of our diffusion-based \uline{z}ero-shot \uline{s}tyle \uline{t}ransfer via \uline{a}djusting style dist\uline{r}ibution, termed Z-STAR+.

Title: Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes

Authors: Thomas Wimmer, Michael Oechsle, Michael Niemeyer, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19233
Pdf URL: https://arxiv.org/pdf/2411.19233
Copy Paste: [[2411.19233]] Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes(https://arxiv.org/abs/2411.19233)
Keywords: robust, diffusion, generative
Abstract: State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack "liveliness," a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.

Title: SmartLLMSentry: A Comprehensive LLM Based Smart Contract Vulnerability Detection Framework

Authors: Oualid Zaazaa, Hanan El Bakkali
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19234
Pdf URL: https://arxiv.org/pdf/2411.19234
Copy Paste: [[2411.19234]] SmartLLMSentry: A Comprehensive LLM Based Smart Contract Vulnerability Detection Framework(https://arxiv.org/abs/2411.19234)
Keywords: security, large language model
Abstract: Smart contracts are essential for managing digital assets in blockchain networks, highlighting the need for effective security measures. This paper introduces SmartLLMSentry, a novel framework that leverages large language models (LLMs), specifically ChatGPT with in-context training, to advance smart contract vulnerability detection. Traditional rule-based frameworks have limitations in integrating new detection rules efficiently. In contrast, SmartLLMSentry utilizes LLMs to streamline this process. We created a specialized dataset of five randomly selected vulnerabilities for model training and evaluation. Our results show an exact match accuracy of 91.1% with sufficient data, although GPT-4 demonstrated reduced performance compared to GPT-3 in rule generation. This study illustrates that SmartLLMSentry significantly enhances the speed and accuracy of vulnerability detection through LLMdriven rule integration, offering a new approach to improving Blockchain security and addressing previously underexplored vulnerabilities in smart contracts.

Title: InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

Authors: Haijie Li, Yanmin Wu, Jiarui Meng, Qiankun Gao, Zhiyao Zhang, Ronggang Wang, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19235
Pdf URL: https://arxiv.org/pdf/2411.19235
Copy Paste: [[2411.19235]] InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception(https://arxiv.org/abs/2411.19235)
Keywords: segmentation
Abstract: 3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: this https URL

Title: Controlling Participation in Federated Learning with Feedback

Authors: Michael Cummins, Guner Dilsad Er, Michael Muehlebach
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2411.19242
Pdf URL: https://arxiv.org/pdf/2411.19242
Copy Paste: [[2411.19242]] Controlling Participation in Federated Learning with Feedback(https://arxiv.org/abs/2411.19242)
Keywords: federate
Abstract: We address the problem of client participation in federated learning, where traditional methods typically rely on a random selection of a small subset of clients for each training round. In contrast, we propose FedBack, a deterministic approach that leverages control-theoretic principles to manage client participation in ADMM-based federated learning. FedBack models client participation as a discrete-time dynamical system and employs an integral feedback controller to adjust each client's participation rate individually, based on the client's optimization dynamics. We provide global convergence guarantees for our approach by building on the recent federated learning research. Numerical experiments on federated image classification demonstrate that FedBack achieves up to 50\% improvement in communication and computational efficiency over algorithms that rely on a random selection of clients.

Title: Face2QR: A Unified Framework for Aesthetic, Face-Preserving, and Scannable QR Code Generation

Authors: Xuehao Cui, Guangyang Wu, Zhenghao Gan, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19246
Pdf URL: https://arxiv.org/pdf/2411.19246
Copy Paste: [[2411.19246]] Face2QR: A Unified Framework for Aesthetic, Face-Preserving, and Scannable QR Code Generation(https://arxiv.org/abs/2411.19246)
Keywords: robust, diffusion
Abstract: Existing methods to generate aesthetic QR codes, such as image and style transfer techniques, tend to compromise either the visual appeal or the scannability of QR codes when they incorporate human face identity. Addressing these imperfections, we present Face2QR-a novel pipeline specifically designed for generating personalized QR codes that harmoniously blend aesthetics, face identity, and scannability. Our pipeline introduces three innovative components. First, the ID-refined QR integration (IDQR) seamlessly intertwines the background styling with face ID, utilizing a unified Stable Diffusion (SD)-based framework with control networks. Second, the ID-aware QR ReShuffle (IDRS) effectively rectifies the conflicts between face IDs and QR patterns, rearranging QR modules to maintain the integrity of facial features without compromising scannability. Lastly, the ID-preserved Scannability Enhancement (IDSE) markedly boosts scanning robustness through latent code optimization, striking a delicate balance between face ID, aesthetic quality and QR functionality. In comprehensive experiments, Face2QR demonstrates remarkable performance, outperforming existing approaches, particularly in preserving facial recognition features within custom QR code designs. Codes are available at $\href{this https URL}{\text{this URL link}}$.

Title: Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention

Authors: Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, Huan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19261
Pdf URL: https://arxiv.org/pdf/2411.19261
Copy Paste: [[2411.19261]] Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention(https://arxiv.org/abs/2411.19261)
Keywords: diffusion
Abstract: Training-free diffusion models have achieved remarkable progress in generating multi-subject consistent images within open-domain scenarios. The key idea of these methods is to incorporate reference subject information within the attention layer. However, existing methods still obtain suboptimal performance when handling numerous subjects. This paper reveals the two primary issues contributing to this deficiency. Firstly, there is undesired interference among different subjects within the target image. Secondly, tokens tend to reference nearby tokens, which reduces the effectiveness of the attention mechanism when there is a significant positional difference between subjects in reference and target images. To address these challenges, we propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion. Specifically, Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject fusion. On the other hand, Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images. This ensures that subjects in the target image can better reference those in the reference image, thereby maintaining better consistency. Extensive experiments demonstrate that the proposed methods significantly enhance multi-subject consistency, outperforming all existing methods in open-domain scenarios.

Title: AGS-Mesh: Adaptive Gaussian Splatting and Meshing with Geometric Priors for Indoor Room Reconstruction Using Smartphones

Authors: Xuqian Ren, Matias Turkulainen, Jiepeng Wang, Otto Seiskari, Iaroslav Melekhov, Juho Kannala, Esa Rahtu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19271
Pdf URL: https://arxiv.org/pdf/2411.19271
Copy Paste: [[2411.19271]] AGS-Mesh: Adaptive Gaussian Splatting and Meshing with Geometric Priors for Indoor Room Reconstruction Using Smartphones(https://arxiv.org/abs/2411.19271)
Keywords: extraction
Abstract: Geometric priors are often used to enhance 3D reconstruction. With many smartphones featuring low-resolution depth sensors and the prevalence of off-the-shelf monocular geometry estimators, incorporating geometric priors as regularization signals has become common in 3D vision tasks. However, the accuracy of depth estimates from mobile devices is typically poor for highly detailed geometry, and monocular estimators often suffer from poor multi-view consistency and precision. In this work, we propose an approach for joint surface depth and normal refinement of Gaussian Splatting methods for accurate 3D reconstruction of indoor scenes. We develop supervision strategies that adaptively filters low-quality depth and normal estimates by comparing the consistency of the priors during optimization. We mitigate regularization in regions where prior estimates have high uncertainty or ambiguities. Our filtering strategy and optimization design demonstrate significant improvements in both mesh estimation and novel-view synthesis for both 3D and 2D Gaussian Splatting-based methods on challenging indoor room datasets. Furthermore, we explore the use of alternative meshing strategies for finer geometry extraction. We develop a scale-aware meshing strategy inspired by TSDF and octree-based isosurface extraction, which recovers finer details from Gaussian models compared to other commonly used open-source meshing tools. Our code is released in this https URL.

Title: On-chip Hyperspectral Image Segmentation with Fully Convolutional Networks for Scene Understanding in Autonomous Driving

Authors: Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe, M. Victoria Martínez, Unai Martínez-Corral, Óscar Mata Carballeira, Inés del Campo
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.19274
Pdf URL: https://arxiv.org/pdf/2411.19274
Copy Paste: [[2411.19274]] On-chip Hyperspectral Image Segmentation with Fully Convolutional Networks for Scene Understanding in Autonomous Driving(https://arxiv.org/abs/2411.19274)
Keywords: segmentation
Abstract: Most of current computer vision-based advanced driver assistance systems (ADAS) perform detection and tracking of objects quite successfully under regular conditions. However, under adverse weather and changing lighting conditions, and in complex situations with many overlapping objects, these systems are not completely reliable. The spectral reflectance of the different objects in a driving scene beyond the visible spectrum can offer additional information to increase the reliability of these systems, especially under challenging driving conditions. Furthermore, this information may be significant enough to develop vision systems that allow for a better understanding and interpretation of the whole driving scene. In this work we explore the use of snapshot, video-rate hyperspectral imaging (HSI) cameras in ADAS on the assumption that the near infrared (NIR) spectral reflectance of different materials can help to better segment the objects in real driving scenarios. To do this, we have used the HSI-Drive 1.1 dataset to perform various experiments on spectral classification algorithms. However, the information retrieval of hyperspectral recordings in natural outdoor scenarios is challenging, mainly because of deficient colour constancy and other inherent shortcomings of current snapshot HSI technology, which poses some limitations to the development of pure spectral classifiers. In consequence, in this work we analyze to what extent the spatial features codified by standard, tiny fully convolutional network (FCN) models can improve the performance of HSI segmentation systems for ADAS applications. The abstract above is truncated due to submission limits. For the full abstract, please refer to the published article.

Title: OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration

Authors: Yiming Zuo, Willow Yang, Zeyu Ma, Jia Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19278
Pdf URL: https://arxiv.org/pdf/2411.19278
Copy Paste: [[2411.19278]] OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration(https://arxiv.org/abs/2411.19278)
Keywords: robust
Abstract: Depth completion (DC) aims to predict a dense depth map from an RGB image and sparse depth observations. Existing methods for DC generalize poorly on new datasets or unseen sparse depth patterns, limiting their practical applications. We propose OMNI-DC, a highly robust DC model that generalizes well across various scenarios. Our method incorporates a novel multi-resolution depth integration layer and a probability-based loss, enabling it to deal with sparse depth maps of varying densities. Moreover, we train OMNI-DC on a mixture of synthetic datasets with a scale normalization technique. To evaluate our model, we establish a new evaluation protocol named Robust-DC for zero-shot testing under various sparse depth patterns. Experimental results on Robust-DC and conventional benchmarks show that OMNI-DC significantly outperforms the previous state of the art. The checkpoints, training code, and evaluations are available at this https URL.

Title: GMS-VINS:Multi-category Dynamic Objects Semantic Segmentation for Enhanced Visual-Inertial Odometry Using a Promptable Foundation Model

Authors: Rui Zhou, Jingbin Liu, Junbin Xie, Jianyu Zhang, Yingze Hu, Jiele Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19289
Pdf URL: https://arxiv.org/pdf/2411.19289
Copy Paste: [[2411.19289]] GMS-VINS:Multi-category Dynamic Objects Semantic Segmentation for Enhanced Visual-Inertial Odometry Using a Promptable Foundation Model(https://arxiv.org/abs/2411.19289)
Keywords: robust, segmentation
Abstract: Visual-inertial odometry (VIO) is widely used in various fields, such as robots, drones, and autonomous vehicles, due to its low cost and complementary sensors. Most VIO methods presuppose that observed objects are static and time-invariant. However, real-world scenes often feature dynamic objects, compromising the accuracy of pose estimation. These moving entities include cars, trucks, buses, motorcycles, and pedestrians. The diversity and partial occlusion of these objects present a tough challenge for existing dynamic object removal techniques. To tackle this challenge, we introduce GMS-VINS, which integrates an enhanced SORT algorithm along with a robust multi-category segmentation framework into VIO, thereby improving pose estimation accuracy in environments with diverse dynamic objects and frequent occlusions. Leveraging the promptable foundation model, our solution efficiently tracks and segments a wide range of object categories. The enhanced SORT algorithm significantly improves the reliability of tracking multiple dynamic objects, especially in urban settings with partial occlusions or swift movements. We evaluated our proposed method using multiple public datasets representing various scenes, as well as in a real-world scenario involving diverse dynamic objects. The experimental results demonstrate that our proposed method performs impressively in multiple scenarios, outperforming other state-of-the-art methods. This highlights its remarkable generalization and adaptability in diverse dynamic environments, showcasing its potential to handle various dynamic objects in practical applications.

Title: SADG: Segment Any Dynamic Gaussian Without Object Trackers

Authors: Yun-Jin Li, Mariia Gladkova, Yan Xia, Daniel Cremers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19290
Pdf URL: https://arxiv.org/pdf/2411.19290
Copy Paste: [[2411.19290]] SADG: Segment Any Dynamic Gaussian Without Object Trackers(https://arxiv.org/abs/2411.19290)
Keywords: segmentation
Abstract: Understanding dynamic 3D scenes is fundamental for various applications, including extended reality (XR) and autonomous driving. Effectively integrating semantic information into 3D reconstruction enables holistic representation that opens opportunities for immersive and interactive applications. We introduce SADG, Segment Any Dynamic Gaussian Without Object Trackers, a novel approach that combines dynamic Gaussian Splatting representation and semantic information without reliance on object IDs. In contrast to existing works, we do not rely on supervision based on object identities to enable consistent segmentation of dynamic 3D objects. To this end, we propose to learn semantically-aware features by leveraging masks generated from the Segment Anything Model (SAM) and utilizing our novel contrastive learning objective based on hard pixel mining. The learned Gaussian features can be effectively clustered without further post-processing. This enables fast computation for further object-level editing, such as object removal, composition, and style transfer by manipulating the Gaussians in the scene. We further extend several dynamic novel-view datasets with segmentation benchmarks to enable testing of learned feature fields from unseen viewpoints. We evaluate SADG on proposed benchmarks and demonstrate the superior performance of our approach in segmenting objects within dynamic scenes along with its effectiveness for further downstream editing tasks.

Title: Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

Authors: Clémence Sebe, Sarah Cohen-Boulakia, Olivier Ferret, Aurélie Névéol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19295
Pdf URL: https://arxiv.org/pdf/2411.19295
Copy Paste: [[2411.19295]] Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows(https://arxiv.org/abs/2411.19295)
Keywords: extraction
Abstract: Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.

Title: Enhancing Parameter-Efficient Fine-Tuning of Vision Transformers through Frequency-Based Adaptation

Authors: Son Thai Ly, Hien V. Nguyen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19297
Pdf URL: https://arxiv.org/pdf/2411.19297
Copy Paste: [[2411.19297]] Enhancing Parameter-Efficient Fine-Tuning of Vision Transformers through Frequency-Based Adaptation(https://arxiv.org/abs/2411.19297)
Keywords: transformer
Abstract: Adapting vision transformer foundation models through parameter-efficient fine-tuning (PEFT) methods has become increasingly popular. These methods optimize a limited subset of parameters, enabling efficient adaptation without the need to fine-tune the entire model while still achieving competitive performance. However, traditional PEFT methods may limit the model's capacity to capture complex patterns, especially those associated with high-frequency spectra. This limitation becomes particularly problematic as existing research indicates that high-frequency features are crucial for distinguishing subtle image structures. To address this issue, we introduce FreqFit, a novel Frequency Fine-tuning module between ViT blocks to enhance model adaptability. FreqFit is simple yet surprisingly effective, and can be integrated with all existing PEFT methods to boost their performance. By manipulating features in the frequency domain, our approach allows models to capture subtle patterns more effectively. Extensive experiments on 24 datasets, using both supervised and self-supervised foundational models with various state-of-the-art PEFT methods, reveal that FreqFit consistently improves performance over the original PEFT methods with performance gains ranging from 1% to 16%. For instance, FreqFit-LoRA surpasses the performances of state-of-the-art baselines on CIFAR100 by more than 10% even without applying regularization or strong augmentation. For reproducibility purposes, the source code is available at this https URL.

Title: SAMa: Material-aware 3D Selection and Segmentation

Authors: Michael Fischer, Iliyan Georgiev, Thibault Groueix, Vladimir G. Kim, Tobias Ritschel, Valentin Deschaintre
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19322
Pdf URL: https://arxiv.org/pdf/2411.19322
Copy Paste: [[2411.19322]] SAMa: Material-aware 3D Selection and Segmentation(https://arxiv.org/abs/2411.19322)
Keywords: segmentation
Abstract: Decomposing 3D assets into material parts is a common task for artists and creators, yet remains a highly manual process. In this work, we introduce Select Any Material (SAMa), a material selection approach for various 3D representations. Building on the recently introduced SAM2 video selection model, we extend its capabilities to the material domain. We leverage the model's cross-view consistency to create a 3D-consistent intermediate material-similarity representation in the form of a point cloud from a sparse set of views. Nearest-neighbour lookups in this similarity cloud allow us to efficiently reconstruct accurate continuous selection masks over objects' surfaces that can be inspected from any view. Our method is multiview-consistent by design, alleviating the need for contrastive learning or feature-field pre-processing, and performs optimization-free selection in seconds. Our approach works on arbitrary 3D representations and outperforms several strong baselines in terms of selection accuracy and multiview consistency. It enables several compelling applications, such as replacing the diffuse-textured materials on a text-to-3D output, or selecting and editing materials on NeRFs and 3D-Gaussians.

Title: Trajectory Attention for Fine-grained Video Motion Control

Authors: Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19324
Pdf URL: https://arxiv.org/pdf/2411.19324
Copy Paste: [[2411.19324]] Trajectory Attention for Fine-grained Video Motion Control(https://arxiv.org/abs/2411.19324)
Keywords: diffusion
Abstract: Recent advancements in video generation have been greatly driven by video diffusion models, with camera motion control emerging as a crucial challenge in creating view-customized visual content. This paper introduces trajectory attention, a novel approach that performs attention along available pixel trajectories for fine-grained camera motion control. Unlike existing methods that often yield imprecise outputs or neglect temporal correlations, our approach possesses a stronger inductive bias that seamlessly injects trajectory information into the video generation process. Importantly, our approach models trajectory attention as an auxiliary branch alongside traditional temporal attention. This design enables the original temporal attention and the trajectory attention to work in synergy, ensuring both precise motion control and new content generation capability, which is critical when the trajectory is only partially available. Experiments on camera motion control for images and videos demonstrate significant improvements in precision and long-range consistency while maintaining high-quality generation. Furthermore, we show that our approach can be extended to other video motion control tasks, such as first-frame-guided video editing, where it excels in maintaining content consistency over large spatial and temporal ranges.

Title: Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Authors: Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.19331
Pdf URL: https://arxiv.org/pdf/2411.19331
Copy Paste: [[2411.19331]] Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation(https://arxiv.org/abs/2411.19331)
Keywords: transformer, segmentation
Abstract: Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: this https URL.

Title: PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning

Authors: Shenghui Li, Edith C.-H. Ngai, Fanghua Ye, Thiemo Voigt
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19335
Pdf URL: https://arxiv.org/pdf/2411.19335
Copy Paste: [[2411.19335]] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2411.19335)
Keywords: security, privacy, defense, attack, robust, federate
Abstract: Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a promising paradigm for privacy-preserving and efficient adaptation of Pre-trained Language Models (PLMs) in Federated Learning (FL) settings. It preserves data privacy by keeping the data decentralized and training the model on local devices, ensuring that raw data never leaves the user's device. Moreover, the integration of PEFT methods such as LoRA significantly reduces the number of trainable parameters compared to fine-tuning the entire model, thereby minimizing communication costs and computational overhead. Despite its potential, the security implications of FedPEFT remain underexplored. This paper introduces a novel security threat to FedPEFT, termed PEFT-as-an-Attack (PaaA), which exposes how PEFT can be exploited as an attack vector to circumvent PLMs' safety alignment and generate harmful content in response to malicious prompts. Our evaluation of PaaA reveals that with less than 1% of the model's parameters set as trainable, and a small subset of clients acting maliciously, the attack achieves an approximate 80% attack success rate using representative PEFT methods such as LoRA. To mitigate this threat, we further investigate potential defense strategies, including Robust Aggregation Schemes (RASs) and Post-PEFT Safety Alignment (PPSA). However, our empirical analysis highlights the limitations of these defenses, i.e., even the most advanced RASs, such as DnC and ClippedClustering, struggle to defend against PaaA in scenarios with highly heterogeneous data distributions. Similarly, while PPSA can reduce attack success rates to below 10%, it severely degrades the model's accuracy on the target task. Our results underscore the urgent need for more effective defense mechanisms that simultaneously ensure security and maintain the performance of the FedPEFT paradigm.

Title: Towards a Mechanistic Explanation of Diffusion Model Generalization

Authors: Matthew Niedoba, Berend Zwartsenberg, Kevin Murphy, Frank Wood
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19339
Pdf URL: https://arxiv.org/pdf/2411.19339
Copy Paste: [[2411.19339]] Towards a Mechanistic Explanation of Diffusion Model Generalization(https://arxiv.org/abs/2411.19339)
Keywords: diffusion
Abstract: We propose a mechanism for diffusion generalization based on local denoising operations. Through analysis of network and empirical denoisers, we identify local inductive biases in diffusion models. We demonstrate that local denoising operations can be used to approximate the optimal diffusion denoiser. Using a collection of patch-based, local empirical denoisers, we construct a denoiser which approximates the generalization behaviour of diffusion model denoisers over forward and reverse diffusion processes.

Title: CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

Authors: Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar Fiaz, Alham Fikri Aji, Hisham Cholakkal
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19346
Pdf URL: https://arxiv.org/pdf/2411.19346
Copy Paste: [[2411.19346]] CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections(https://arxiv.org/abs/2411.19346)
Keywords: robust, large language model
Abstract: In the era of foundation models, CLIP has emerged as a powerful tool for aligning text and visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from LLMs, enabling more effective zero-shot classification compared to CLIP's default name-specific prompts. (2) These textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings and DINO's visual features. (3) Finally, we prompt-tune CLIP's vision encoder through DINO-assisted supervision using the trained alignment module. This three-step process allows us to harness the best of visual and textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFter across 11 diverse image classification datasets.

Title: Characterizing JavaScript Security Code Smells

Authors: Vikas Kambhampati, Nehaz Hussain Mohammed, Amin Milani Fard
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2411.19358
Pdf URL: https://arxiv.org/pdf/2411.19358
Copy Paste: [[2411.19358]] Characterizing JavaScript Security Code Smells(https://arxiv.org/abs/2411.19358)
Keywords: security
Abstract: JavaScript has been consistently among the most popular programming languages in the past decade. However, its dynamic, weakly-typed, and asynchronous nature can make it challenging to write maintainable code for developers without in-depth knowledge of the language. Consequently, many JavaScript applications tend to contain code smells that adversely influence program comprehension, maintenance, and debugging. Due to the widespread usage of JavaScript, code security is an important matter. While JavaScript code smells and detection techniques have been studied in the past, current work on security smells for JavaScript is scarce. Security code smells are coding patterns indicative of potential vulnerabilities or security weaknesses. Identifying security code smells can help developers to focus on areas where additional security measures may be needed. We present a set of 24 JavaScript security code smells, map them to a possible security awareness defined by Common Weakness Enumeration (CWE), explain possible refactoring, and explain our detection mechanism. We implement our security code smell detection on top of an existing open source tool that was proposed to detect general code smells in JavaScript.

Title: Libra: Leveraging Temporal Images for Biomedical Radiology Analysis

Authors: Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19378
Pdf URL: https://arxiv.org/pdf/2411.19378
Copy Paste: [[2411.19378]] Libra: Leveraging Temporal Images for Biomedical Radiology Analysis(https://arxiv.org/abs/2411.19378)
Keywords: large language model
Abstract: Radiology report generation (RRG) is a challenging task, as it requires a thorough understanding of medical images, integration of multiple temporal inputs, and accurate report generation. Effective interpretation of medical images, such as chest X-rays (CXRs), demands sophisticated visual-language reasoning to map visual findings to structured reports. Recent studies have shown that multimodal large language models (MLLMs) can acquire multimodal capabilities by aligning with pre-trained vision encoders. However, current approaches predominantly focus on single-image analysis or utilise rule-based symbolic processing to handle multiple images, thereby overlooking the essential temporal information derived from comparing current images with prior ones. To overcome this critical limitation, we introduce Libra, a temporal-aware MLLM tailored for CXR report generation using temporal images. Libra integrates a radiology-specific image encoder with a MLLM and utilises a novel Temporal Alignment Connector to capture and synthesise temporal information of images across different time points with unprecedented precision. Extensive experiments show that Libra achieves new state-of-the-art performance among the same parameter scale MLLMs for RRG tasks on the MIMIC-CXR. Specifically, Libra improves the RadCliQ metric by 12.9% and makes substantial gains across all lexical metrics compared to previous models.

Title: Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints

Authors: Gaurav Rai, Ojaswa Sharma
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19381
Pdf URL: https://arxiv.org/pdf/2411.19381
Copy Paste: [[2411.19381]] Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints(https://arxiv.org/abs/2411.19381)
Keywords: diffusion
Abstract: Animating hand-drawn sketches using traditional tools is challenging and complex. Sketches provide a visual basis for explanations, and animating these sketches offers an experience of real-time scenarios. We propose an approach for animating a given input sketch based on a descriptive text prompt. Our method utilizes a parametric representation of the sketch's strokes. Unlike previous methods, which struggle to estimate smooth and accurate motion and often fail to preserve the sketch's topology, we leverage a pre-trained text-to-video diffusion model with SDS loss to guide the motion of the sketch's strokes. We introduce length-area (LA) regularization to ensure temporal consistency by accurately estimating the smooth displacement of control points across the frame sequence. Additionally, to preserve shape and avoid topology changes, we apply a shape-preserving As-Rigid-As-Possible (ARAP) loss to maintain sketch rigidity. Our method surpasses state-of-the-art performance in both quantitative and qualitative evaluations.

Title: DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

Authors: Shwetha Ram, Tal Neiman, Qianli Feng, Andrew Stuart, Son Tran, Trishul Chilimbi
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19390
Pdf URL: https://arxiv.org/pdf/2411.19390
Copy Paste: [[2411.19390]] DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models(https://arxiv.org/abs/2411.19390)
Keywords: diffusion
Abstract: Given a small number of images of a subject, personalized image generation techniques can fine-tune large pre-trained text-to-image diffusion models to generate images of the subject in novel contexts, conditioned on text prompts. In doing so, a trade-off is made between prompt fidelity, subject fidelity and diversity. As the pre-trained model is fine-tuned, earlier checkpoints synthesize images with low subject fidelity but high prompt fidelity and diversity. In contrast, later checkpoints generate images with low prompt fidelity and diversity but high subject fidelity. This inherent trade-off limits the prompt fidelity, subject fidelity and diversity of generated images. In this work, we propose DreamBlend to combine the prompt fidelity from earlier checkpoints and the subject fidelity from later checkpoints during inference. We perform a cross attention guided image synthesis from a later checkpoint, guided by an image generated by an earlier checkpoint, for the same prompt. This enables generation of images with better subject fidelity, prompt fidelity and diversity on challenging prompts, outperforming state-of-the-art fine-tuning methods.

Title: On the effectiveness of discrete representations in sparse mixture of experts

Authors: Giang Do, Kha Pham, Hung Le, Truyen Tran
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19402
Pdf URL: https://arxiv.org/pdf/2411.19402
Copy Paste: [[2411.19402]] On the effectiveness of discrete representations in sparse mixture of experts(https://arxiv.org/abs/2411.19402)
Keywords: robust, large language model
Abstract: Sparse mixture of experts (SMoE) is an effective solution for scaling up model capacity without increasing the computational costs. A crucial component of SMoE is the router, responsible for directing the input to relevant experts; however, it also presents a major weakness, leading to routing inconsistencies and representation collapse issues. Instead of fixing the router like previous works, we propose an alternative that assigns experts to input via indirection, which employs the discrete representation of input that points to the expert. The discrete representations are learnt via vector quantization, resulting in a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE). We provide theoretical support and empirical evidence demonstrating the VQMoE's ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMoE achieves a 28% improvement in robustness compared to other SMoE routing methods, while maintaining strong performance in fine-tuning tasks.

Title: AMO Sampler: Enhancing Text Rendering with Overshooting

Authors: Xixi Hu, Keyang Xu, Bo Liu, Qiang Liu, Hongliang Fei
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19415
Pdf URL: https://arxiv.org/pdf/2411.19415
Copy Paste: [[2411.19415]] AMO Sampler: Enhancing Text Rendering with Overshooting(https://arxiv.org/abs/2411.19415)
Keywords: diffusion
Abstract: Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Sate-of-the-art models like Stable Diffusion 3 (SD3), Flux, and AuraFlow still struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for pretrained rectified flow (RF) models, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we propose an Attention Modulated Overshooting sampler (AMO), which adaptively controls the strength of overshooting for each image patch according to their attention score with the text content. AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost.

Title: Any-Resolution AI-Generated Image Detection by Spectral Learning

Authors: Dimitrios Karageorgiou, Symeon Papadopoulos, Ioannis Kompatsiaris, Efstratios Gavves
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.19417
Pdf URL: https://arxiv.org/pdf/2411.19417
Copy Paste: [[2411.19417]] Any-Resolution AI-Generated Image Detection by Spectral Learning(https://arxiv.org/abs/2411.19417)
Keywords: robust, generative
Abstract: Recent works have established that AI models introduce spectral artifacts into generated images and propose approaches for learning to capture them using labeled data. However, the significant differences in such artifacts among different generative models hinder these approaches from generalizing to generators not seen during training. In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. To model this under a self-supervised setup, we employ masked spectral learning using the pretext task of frequency reconstruction. Since generated images constitute out-of-distribution samples for this model, we propose spectral reconstruction similarity to capture this divergence. Moreover, we introduce spectral context attention, which enables our approach to efficiently capture subtle spectral inconsistencies in images of any resolution. Our spectral AI-generated image detection approach (SPAI) achieves a 5.5% absolute improvement in AUC over the previous state-of-the-art across 13 recent generative approaches, while exhibiting robustness against common online perturbations.

Title: Gradient Inversion Attack on Graph Neural Networks

Authors: Divya Anand Sinha, Yezi Liu, Ruijie Du, Yanning Shen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19440
Pdf URL: https://arxiv.org/pdf/2411.19440
Copy Paste: [[2411.19440]] Gradient Inversion Attack on Graph Neural Networks(https://arxiv.org/abs/2411.19440)
Keywords: privacy, protect, attack, steal, federate
Abstract: Graph federated learning is of essential importance for training over large graph datasets while protecting data privacy, where each client stores a subset of local graph data, while the server collects the local gradients and broadcasts only the aggregated gradients. Recent studies reveal that a malicious attacker can steal private image data from gradient exchanging of neural networks during federated learning. However, none of the existing works have studied the vulnerability of graph data and graph neural networks under such attack. To answer this question, the present paper studies the problem of whether private data can be recovered from leaked gradients in both node classification and graph classification tasks and { proposes a novel attack named Graph Leakage from Gradients (GLG)}. Two widely-used GNN frameworks are analyzed, namely GCN and GraphSAGE. The effects of different model settings on recovery are extensively discussed. Through theoretical analysis and empirical validation, it is shown that parts of the graph data can be leaked from the gradients.

Title: Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models

Authors: Tian Yu, Shaolei Zhang, Yang Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19443
Pdf URL: https://arxiv.org/pdf/2411.19443
Copy Paste: [[2411.19443]] Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models(https://arxiv.org/abs/2411.19443)
Keywords: interpretability, large language model
Abstract: Iterative retrieval refers to the process in which the model continuously queries the retriever during generation to enhance the relevance of the retrieved knowledge, thereby improving the performance of Retrieval-Augmented Generation (RAG). Existing work typically employs few-shot prompting or manually constructed rules to implement iterative retrieval. This introduces additional inference overhead and overlooks the remarkable reasoning capabilities of Large Language Models (LLMs). In this paper, we introduce Auto-RAG, an autonomous iterative retrieval model centered on the LLM's powerful decision-making capabilities. Auto-RAG engages in multi-turn dialogues with the retriever, systematically planning retrievals and refining queries to acquire valuable knowledge. This process continues until sufficient external information is gathered, at which point the results are presented to the user. To this end, we develop a method for autonomously synthesizing reasoning-based decision-making instructions in iterative retrieval and fine-tuned the latest open-source LLMs. The experimental results indicate that Auto-RAG is capable of autonomous iterative interaction with the retriever, effectively leveraging the remarkable reasoning and decision-making abilities of LLMs, which lead to outstanding performance across six benchmarks. Further analysis reveals that Auto-RAG can autonomously adjust the number of iterations based on the difficulty of the questions and the utility of the retrieved knowledge, without requiring any human intervention. Moreover, Auto-RAG expresses the iterative retrieval process in natural language, enhancing interpretability while providing users with a more intuitive experience\footnote{Code is available at \url{this https URL}.

Title: Adaptive Interactive Segmentation for Multimodal Medical Imaging via Selection Engine

Authors: Zhi Li, Kai Zhao, Yaqi Wang, Shuai Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19447
Pdf URL: https://arxiv.org/pdf/2411.19447
Copy Paste: [[2411.19447]] Adaptive Interactive Segmentation for Multimodal Medical Imaging via Selection Engine(https://arxiv.org/abs/2411.19447)
Keywords: robust, interpretability, segmentation
Abstract: In medical image analysis, achieving fast, efficient, and accurate segmentation is essential for automated diagnosis and treatment. Although recent advancements in deep learning have significantly improved segmentation accuracy, current models often face challenges in adaptability and generalization, particularly when processing multi-modal medical imaging data. These limitations stem from the substantial variations between imaging modalities and the inherent complexity of medical data. To address these challenges, we propose the Strategy-driven Interactive Segmentation Model (SISeg), built on SAM2, which enhances segmentation performance across various medical imaging modalities by integrating a selection engine. To mitigate memory bottlenecks and optimize prompt frame selection during the inference of 2D image sequences, we developed an automated system, the Adaptive Frame Selection Engine (AFSE). This system dynamically selects the optimal prompt frames without requiring extensive prior medical knowledge and enhances the interpretability of the model's inference process through an interactive feedback mechanism. We conducted extensive experiments on 10 datasets covering 7 representative medical imaging modalities, demonstrating the SISeg model's robust adaptability and generalization in multi-modal tasks. The project page and code will be available at: [URL].

Title: Learning Visual Abstract Reasoning through Dual-Stream Networks

Authors: Kai Zhao, Chang Xu, Bailu Si
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19451
Pdf URL: https://arxiv.org/pdf/2411.19451
Copy Paste: [[2411.19451]] Learning Visual Abstract Reasoning through Dual-Stream Networks(https://arxiv.org/abs/2411.19451)
Keywords: robust
Abstract: Visual abstract reasoning tasks present challenges for deep neural networks, exposing limitations in their capabilities. In this work, we present a neural network model that addresses the challenges posed by Raven's Progressive Matrices (RPM). Inspired by the two-stream hypothesis of visual processing, we introduce the Dual-stream Reasoning Network (DRNet), which utilizes two parallel branches to capture image features. On top of the two streams, a reasoning module first learns to merge the high-level features of the same image. Then, it employs a rule extractor to handle combinations involving the eight context images and each candidate image, extracting discrete abstract rules and utilizing an multilayer perceptron (MLP) to make predictions. Empirical results demonstrate that the proposed DRNet achieves state-of-the-art average performance across multiple RPM benchmarks. Furthermore, DRNet demonstrates robust generalization capabilities, even extending to various out-of-distribution scenarios. The dual streams within DRNet serve distinct functions by addressing local or spatial information. They are then integrated into the reasoning module, leveraging abstract rules to facilitate the execution of visual reasoning tasks. These findings indicate that the dual-stream architecture could play a crucial role in visual abstract reasoning.

Title: Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension Ability

Authors: Yujin Han, Lei Xu, Sirui Chen, Difan Zou, Chaochao Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19456
Pdf URL: https://arxiv.org/pdf/2411.19456
Copy Paste: [[2411.19456]] Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension Ability(https://arxiv.org/abs/2411.19456)
Keywords: large language model
Abstract: Large language models (LLMs) have shown remarkable capability in natural language tasks, yet debate persists on whether they truly comprehend deep structure (i.e., core semantics) or merely rely on surface structure (e.g., presentation format). Prior studies observe that LLMs' performance declines when intervening on surface structure, arguing their success relies on surface structure recognition. However, surface structure sensitivity does not prevent deep structure comprehension. Rigorously evaluating LLMs' capability requires analyzing both, yet deep structure is often overlooked. To this end, we assess LLMs' comprehension ability using causal mediation analysis, aiming to fully discover the capability of using both deep and surface structures. Specifically, we formulate the comprehension of deep structure as direct causal effect (DCE) and that of surface structure as indirect causal effect (ICE), respectively. To address the non-estimability of original DCE and ICE -- stemming from the infeasibility of isolating mutual influences of deep and surface structures, we develop the corresponding quantifiable surrogates, including approximated DCE (ADCE) and approximated ICE (AICE). We further apply the ADCE to evaluate a series of mainstream LLMs, showing that most of them exhibit deep structure comprehension ability, which grows along with the prediction accuracy. Comparing ADCE and AICE demonstrates closed-source LLMs rely more on deep structure, while open-source LLMs are more surface-sensitive, which decreases with model scale. Theoretically, ADCE is a bidirectional evaluation, which measures both the sufficiency and necessity of deep structure changes in causing output variations, thus offering a more comprehensive assessment than accuracy, a common evaluation in LLMs. Our work provides new insights into LLMs' deep structure comprehension and offers novel methods for LLMs evaluation.

Title: Multi-task CNN Behavioral Embedding Model For Transaction Fraud Detection

Authors: Bo Qu, Zhurong Wang, Minghao Gu, Daisuke Yagi, Yang Zhao, Yinan Shan, Frank Zahradnik
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19457
Pdf URL: https://arxiv.org/pdf/2411.19457
Copy Paste: [[2411.19457]] Multi-task CNN Behavioral Embedding Model For Transaction Fraud Detection(https://arxiv.org/abs/2411.19457)
Keywords: transformer
Abstract: The burgeoning e-Commerce sector requires advanced solutions for the detection of transaction fraud. With an increasing risk of financial information theft and account takeovers, deep learning methods have become integral to the embedding of behavior sequence data in fraud detection. However, these methods often struggle to balance modeling capabilities and efficiency and incorporate domain knowledge. To address these issues, we introduce the multitask CNN behavioral Embedding Model for Transaction Fraud Detection. Our contributions include 1) introducing a single-layer CNN design featuring multirange kernels which outperform LSTM and Transformer models in terms of scalability and domain-focused inductive bias, and 2) the integration of positional encoding with CNN to introduce sequence-order signals enhancing overall performance, and 3) implementing multitask learning with randomly assigned label weights, thus removing the need for manual tuning. Testing on real-world data reveals our model's enhanced performance of downstream transaction models and comparable competitiveness with the Transformer Time Series (TST) model.

Title: Fleximo: Towards Flexible Text-to-Human Motion Video Generation

Authors: Yuhang Zhang, Yuan Zhou, Zeyu Liu, Yuxuan Cai, Qiuyue Wang, Aidong Men, Huan Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19459
Pdf URL: https://arxiv.org/pdf/2411.19459
Copy Paste: [[2411.19459]] Fleximo: Towards Flexible Text-to-Human Motion Video Generation(https://arxiv.org/abs/2411.19459)
Keywords: large language model
Abstract: Current methods for generating human motion videos rely on extracting pose sequences from reference videos, which restricts flexibility and control. Additionally, due to the limitations of pose detection techniques, the extracted pose sequences can sometimes be inaccurate, leading to low-quality video outputs. We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. This approach offers greater flexibility and ease of use, as text is more accessible than the desired guidance videos. However, training an end-to-end model for this task requires millions of high-quality text and human motion video pairs, which are challenging to obtain. To address this, we propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. This approach is not straightforward, as the text-generated skeletons may not consistently match the scale of the reference image and may lack detailed information. To overcome these challenges, we introduce an anchor point based rescale method and design a skeleton adapter to fill in missing details and bridge the gap between text-to-motion and motion-to-video generation. We also propose a video refinement process to further enhance video quality. A large language model (LLM) is employed to decompose natural language into discrete motion sequences, enabling the generation of motion videos of any desired length. To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions. We also propose a new metric, MotionScore, to evaluate the accuracy of motion following. Both qualitative and quantitative results demonstrate that our method outperforms existing text-conditioned image-to-video generation methods. All code and model weights will be made publicly available.

Title: Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

Authors: Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19460
Pdf URL: https://arxiv.org/pdf/2411.19460
Copy Paste: [[2411.19460]] Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing(https://arxiv.org/abs/2411.19460)
Keywords: transformer
Abstract: With the growing scale and complexity of video data, efficiently processing long video sequences poses significant challenges due to the quadratic increase in memory and computational demands associated with existing transformer-based Large Multi-modal Models (LMMs). To address these issues, we introduce Video-Ma$^2$mba, a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework, replacing the attention mechanisms. This allows the LMMs to scale linearly in terms of time and memory requirements, making it feasible to handle long-duration video content. Furthermore, we enhance the memory efficiency introducing the Multi-Axis Gradient Checkpointing (MA-GC) method, which strategically manages memory by retaining only essential activations across multiple computational axes. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. Empirical analyses show that Video-Ma$^2$mba can process extensive video sequences-equivalent to millions of tokens or over two hours of continuous sequences at 1 FPS-on a single GPU. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks, demonstrating substantial advantages over existing frameworks.

Title: Robust Bayesian Scene Reconstruction by Leveraging Retrieval-Augmented Priors

Authors: Herbert Wright, Weiming Zhi, Matthew Johnson-Roberson, Tucker Hermans
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2411.19461
Pdf URL: https://arxiv.org/pdf/2411.19461
Copy Paste: [[2411.19461]] Robust Bayesian Scene Reconstruction by Leveraging Retrieval-Augmented Priors(https://arxiv.org/abs/2411.19461)
Keywords: robust
Abstract: Constructing 3D representations of object geometry is critical for many downstream manipulation tasks. These representations must be built from potentially noisy partial observations. In this work we focus on the problem of reconstructing a multi-object scene from a single RGBD image. Current deep learning approaches to this problem can be brittle to noisy real world observations and out-of-distribution objects. Other approaches that do not rely on training data cannot accurately infer the backside of objects. We propose BRRP, a reconstruction method that can leverage preexisting mesh datasets to build an informative prior during robust probabilistic reconstruction. In order to make our method more efficient, we introduce the concept of retrieval-augmented prior, where we retrieve relevant components of our prior distribution during inference. Our method produces a distribution over object shape that can be used for reconstruction or measuring uncertainty. We evaluate our method in both procedurally generated scenes and in real world scenes. We show our method is more robust than a deep learning approach while being more accurate than a method with an uninformative prior.

Title: ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

Authors: Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19466
Pdf URL: https://arxiv.org/pdf/2411.19466
Copy Paste: [[2411.19466]] ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection(https://arxiv.org/abs/2411.19466)
Keywords: robust, explainability, large language model, segmentation
Abstract: Multimodal large language models have unlocked new possibilities for various multimodal tasks. However, their potential in image manipulation detection remains unexplored. When directly applied to the IMD task, M-LLMs often produce reasoning texts that suffer from hallucinations and overthinking. To address this, in this work, we propose ForgerySleuth, which leverages M-LLMs to perform comprehensive clue fusion and generate segmentation outputs indicating specific regions that are tampered with. Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task. A data engine is also introduced to build a larger-scale dataset for the pre-training phase. Our extensive experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.

Title: Random Feature Models with Learnable Activation Functions

Authors: Zailin Ma, Jiansheng Yang, Yaodong Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19468
Pdf URL: https://arxiv.org/pdf/2411.19468
Copy Paste: [[2411.19468]] Random Feature Models with Learnable Activation Functions(https://arxiv.org/abs/2411.19468)
Keywords: interpretability
Abstract: Current random feature models typically rely on fixed activation functions, limiting their ability to capture diverse patterns in data. To address this, we introduce the Random Feature model with Learnable Activation Functions (RFLAF), a novel model that significantly enhances the expressivity and interpretability of traditional random feature (RF) models. We begin by studying the RF model with a single radial basis function, where we discover a new kernel and provide the first theoretical analysis on it. By integrating the basis functions with learnable weights, we show that RFLAF can represent a broad class of random feature models whose activation functions belong in $C_c(\mathbb{R})$. Theoretically, we prove that the model requires only about twice the parameter number compared to a traditional RF model to achieve the significant leap in expressivity. Experimentally, RFLAF demonstrates two key advantages: (1) it performs better across various tasks compared to traditional RF model with the same number of parameters, and (2) the optimized weights offer interpretability, as the learned activation function can be directly inferred from these weights. Our model paves the way for developing more expressive and interpretable frameworks within random feature models.

Title: A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

Authors: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19477
Pdf URL: https://arxiv.org/pdf/2411.19477
Copy Paste: [[2411.19477]] A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models(https://arxiv.org/abs/2411.19477)
Keywords: large language model
Abstract: We propose a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates $N$ candidate solutions, and then chooses the best one via a multiple-round knockout tournament where each pair of candidates are compared for $K$ times and only the winners move on to the next round. In a minimalistic implementation, both stages can be executed with a black-box LLM alone and nothing else (e.g., no external verifier or reward model), and a total of $N \times (K + 1)$ highly parallelizable LLM calls are needed for solving an input problem. Assuming that a generated candidate solution is correct with probability $p_{\text{gen}} > 0$ and a comparison between a pair of correct and incorrect solutions identifies the right winner with probability $p_{\text{comp}} > 0.5$ (i.e., better than a random guess), we prove theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to $N$ and $K$: $$\mathbb{P}(\text{final output is incorrect}) \le (1 - p_{\text{gen}})^N + \lceil \log_2 N \rceil e^{-2 K (p_{\text{comp}} - 0.5)^2}.$$ Our empirical results with the challenging MMLU-Pro benchmark validate the technical assumptions, as well as the efficacy of the proposed algorithm and the gains from scaling up its test-time compute.

Title: FLARE: Towards Universal Dataset Purification against Backdoor Attacks

Authors: Linshan Hou, Wei Luo, Zhongyun Hua, Songhua Chen, Leo Yu Zhang, Yiming Li
Subjects: cs.CR, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19479
Pdf URL: https://arxiv.org/pdf/2411.19479
Copy Paste: [[2411.19479]] FLARE: Towards Universal Dataset Purification against Backdoor Attacks(https://arxiv.org/abs/2411.19479)
Keywords: defense, attack, robust
Abstract: Deep neural networks (DNNs) are susceptible to backdoor attacks, where adversaries poison datasets with adversary-specified triggers to implant hidden backdoors, enabling malicious manipulation of model predictions. Dataset purification serves as a proactive defense by removing malicious training samples to prevent backdoor injection at its source. We first reveal that the current advanced purification methods rely on a latent assumption that the backdoor connections between triggers and target labels in backdoor attacks are simpler to learn than the benign features. We demonstrate that this assumption, however, does not always hold, especially in all-to-all (A2A) and untargeted (UT) attacks. As a result, purification methods that analyze the separation between the poisoned and benign samples in the input-output space or the final hidden layer space are less effective. We observe that this separability is not confined to a single layer but varies across different hidden layers. Motivated by this understanding, we propose FLARE, a universal purification method to counter various backdoor attacks. FLARE aggregates abnormal activations from all hidden layers to construct representations for clustering. To enhance separation, FLARE develops an adaptive subspace selection algorithm to isolate the optimal space for dividing an entire dataset into two clusters. FLARE assesses the stability of each cluster and identifies the cluster with higher stability as poisoned. Extensive evaluations on benchmark datasets demonstrate the effectiveness of FLARE against 22 representative backdoor attacks, including all-to-one (A2O), all-to-all (A2A), and untargeted (UT) attacks, and its robustness to adaptive attacks.

Title: V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Authors: Jeongsoo Choi, Ji-Hoon Kim, Jinyu Li, Joon Son Chung, Shujie Liu
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.19486
Pdf URL: https://arxiv.org/pdf/2411.19486
Copy Paste: [[2411.19486]] V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow(https://arxiv.org/abs/2411.19486)
Keywords: transformer
Abstract: In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.

Title: Interleaved-Modal Chain-of-Thought

Authors: Jun Gao, Yongqi Li, Ziqiang Cao, Wenjie Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19488
Pdf URL: https://arxiv.org/pdf/2411.19488
Copy Paste: [[2411.19488]] Interleaved-Modal Chain-of-Thought(https://arxiv.org/abs/2411.19488)
Keywords: interpretability, large language model
Abstract: Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.

Title: Diorama: Unleashing Zero-shot Single-view 3D Scene Modeling

Authors: Qirui Wu, Denys Iliash, Daniel Ritchie, Manolis Savva, Angel X. Chang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19492
Pdf URL: https://arxiv.org/pdf/2411.19492
Copy Paste: [[2411.19492]] Diorama: Unleashing Zero-shot Single-view 3D Scene Modeling(https://arxiv.org/abs/2411.19492)
Keywords: robust
Abstract: Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce robust, generalizable solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and real-world data to show we significantly outperform baselines from prior work. We also demonstrate generalization to internet images and the text-to-scene task.

Title: COLD: Causal reasOning in cLosed Daily activities

Authors: Abhinav Joshi, Areeb Ahmad, Ashutosh Modi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19500
Pdf URL: https://arxiv.org/pdf/2411.19500
Copy Paste: [[2411.19500]] COLD: Causal reasOning in cLosed Daily activities(https://arxiv.org/abs/2411.19500)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown state-of-the-art performance in a variety of tasks, including arithmetic and reasoning; however, to gauge the intellectual capabilities of LLMs, causal reasoning has become a reliable proxy for validating a general understanding of the mechanics and intricacies of the world similar to humans. Previous works in natural language processing (NLP) have either focused on open-ended causal reasoning via causal commonsense reasoning (CCR) or framed a symbolic representation-based question answering for theoretically backed-up analysis via a causal inference engine. The former adds an advantage of real-world grounding but lacks theoretically backed-up analysis/validation, whereas the latter is far from real-world grounding. In this work, we bridge this gap by proposing the COLD (Causal reasOning in cLosed Daily activities) framework, which is built upon human understanding of daily real-world activities to reason about the causal nature of events. We show that the proposed framework facilitates the creation of enormous causal queries (~ 9 million) and comes close to the mini-turing test, simulating causal reasoning to evaluate the understanding of a daily real-world task. We evaluate multiple LLMs on the created causal queries and find that causal reasoning is challenging even for activities trivial to humans. We further explore (the causal reasoning abilities of LLMs) using the backdoor criterion to determine the causal strength between events.

Title: Knowledge-Data Fusion Based Source-Free Semi-Supervised Domain Adaptation for Seizure Subtype Classification

Authors: Ruimin Peng, Jiayu An, Dongrui Wu
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2411.19502
Pdf URL: https://arxiv.org/pdf/2411.19502
Copy Paste: [[2411.19502]] Knowledge-Data Fusion Based Source-Free Semi-Supervised Domain Adaptation for Seizure Subtype Classification(https://arxiv.org/abs/2411.19502)
Keywords: privacy, transformer
Abstract: Electroencephalogram (EEG)-based seizure subtype classification enhances clinical diagnosis efficiency. Source-free semi-supervised domain adaptation (SF-SSDA), which transfers a pre-trained model to a new dataset with no source data and limited labeled target data, can be used for privacy-preserving seizure subtype classification. This paper considers two challenges in SF-SSDA for EEG-based seizure subtype classification: 1) How to effectively fuse both raw EEG data and expert knowledge in classifier design? 2) How to align the source and target domain distributions for SF-SSDA? We propose a Knowledge-Data Fusion based SF-SSDA approach, KDF-MutualSHOT, for EEG-based seizure subtype classification. In source model training, KDF uses Jensen-Shannon Divergence to facilitate mutual learning between a feature-driven Decision Tree-based model and a data-driven Transformer-based model. To adapt KDF to a new target dataset, an SF-SSDA algorithm, MutualSHOT, is developed, which features a consistency-based pseudo-label selection strategy. Experiments on the public TUSZ and CHSZ datasets demonstrated that KDF-MutualSHOT outperformed other supervised and source-free domain adaptation approaches in cross-subject seizure subtype classification.

Title: Graph-Enhanced EEG Foundation Model

Authors: Limin Wang, Toyotaro Suzumura, Hiroki Kanezashi
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.19507
Pdf URL: https://arxiv.org/pdf/2411.19507
Copy Paste: [[2411.19507]] Graph-Enhanced EEG Foundation Model(https://arxiv.org/abs/2411.19507)
Keywords: robust
Abstract: Electroencephalography (EEG) signals provide critical insights for applications in disease diagnosis and healthcare. However, the scarcity of labeled EEG data poses a significant challenge. Foundation models offer a promising solution by leveraging large-scale unlabeled data through pre-training, enabling strong performance across diverse tasks. While both temporal dynamics and inter-channel relationships are vital for understanding EEG signals, existing EEG foundation models primarily focus on the former, overlooking the latter. To address this limitation, we propose a novel foundation model for EEG that integrates both temporal and inter-channel information. Our architecture combines Graph Neural Networks (GNNs), which effectively capture relational structures, with a masked autoencoder to enable efficient pre-training. We evaluated our approach using three downstream tasks and experimented with various GNN architectures. The results demonstrate that our proposed model, particularly when employing the GCN architecture with optimized configurations, consistently outperformed baseline methods across all tasks. These findings suggest that our model serves as a robust foundation model for EEG analysis.

Title: Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

Authors: Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, Ming Yang
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.19509
Pdf URL: https://arxiv.org/pdf/2411.19509
Copy Paste: [[2411.19509]] Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis(https://arxiv.org/abs/2411.19509)
Keywords: extraction, diffusion
Abstract: Recent advances in diffusion models have revolutionized audio-driven talking head synthesis. Beyond precise lip synchronization, diffusion-based methods excel in generating subtle expressions and natural head movements that are well-aligned with the audio signal. However, these methods are confronted by slow inference speed, insufficient fine-grained control over facial motions, and occasional visual artifacts largely due to an implicit latent space derived from Variational Auto-Encoders (VAE), which prevent their adoption in realtime interaction applications. To address these issues, we introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis. Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space, replacing conventional VAE representations. This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads. We further propose an inference strategy that jointly optimizes three key components: audio feature extraction, motion generation, and video synthesis. This optimization enables streaming processing, realtime inference, and low first-frame delay, which are the functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and substantially outperforms existing methods in both motion control and realtime performance.

Title: Retrieval-guided Cross-view Image Synthesis

Authors: Hongji Yang, Yiru Li, Yingying Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19510
Pdf URL: https://arxiv.org/pdf/2411.19510
Copy Paste: [[2411.19510]] Retrieval-guided Cross-view Image Synthesis(https://arxiv.org/abs/2411.19510)
Keywords: segmentation
Abstract: Cross-view image synthesis involves generating new images of a scene from different viewpoints or perspectives, given one input image from other viewpoints. Despite recent advancements, there are several limitations in existing methods: 1) reliance on additional data such as semantic segmentation maps or preprocessing modules to bridge the domain gap; 2) insufficient focus on view-specific semantics, leading to compromised image quality and realism; and 3) a lack of diverse datasets representing complex urban environments. To tackle these challenges, we propose: 1) a novel retrieval-guided framework that employs a retrieval network as an embedder to address the domain gap; 2) an innovative generator that enhances semantic consistency and diversity specific to the target view to improve image quality and realism; and 3) a new dataset, VIGOR-GEN, providing diverse cross-view image pairs in urban settings to enrich dataset diversity. Extensive experiments on well-known CVUSA, CVACT, and new VIGOR-GEN datasets demonstrate that our method generates images of superior realism, significantly outperforming current leading approaches, particularly in SSIM and FID evaluations.

Title: RL-MILP Solver: A Reinforcement Learning Approach for Solving Mixed-Integer Linear Programs with Graph Neural Networks

Authors: Tae-Hoon Lee, Min-Soo Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19517
Pdf URL: https://arxiv.org/pdf/2411.19517
Copy Paste: [[2411.19517]] RL-MILP Solver: A Reinforcement Learning Approach for Solving Mixed-Integer Linear Programs with Graph Neural Networks(https://arxiv.org/abs/2411.19517)
Keywords: transformer
Abstract: Mixed-Integer Linear Programming (MILP) is an optimization technique widely used in various fields. Primal heuristics, which reduce the search space of MILP, have enabled traditional solvers (e.g., Gurobi) to efficiently find high-quality solutions. However, traditional primal heuristics rely on expert knowledge, motivating the advent of machine learning (ML)-based primal heuristics that learn repetitive patterns in MILP. Nonetheless, existing ML-based primal heuristics do not guarantee solution feasibility (i.e., satisfying all constraints) and primarily focus on prediction for binary decision variables. When addressing MILP involving non-binary integer variables using ML-based approaches, feasibility issues can become even more pronounced. Since finding an optimal solution requires satisfying all constraints, addressing feasibility is critical. To overcome these limitations, we propose a novel reinforcement learning (RL)-based solver that interacts with MILP to find feasible solutions, rather than delegating sub-problems to traditional solvers. We design reward functions tailored for MILP, which enables the RL agent to learn relationships between decision variables and constraints. Additionally, to effectively model complex relationships among decision variables, we leverage a Transformer encoder-based graph neural network (GNN). Our experimental results demonstrate that the proposed method can solve MILP problems and find near-optimal solutions without delegating the remainder to traditional solvers. The proposed method provides a meaningful step forward as an initial study in solving MILP problems end-to-end based solely on ML.

Title: DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Authors: Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19527
Pdf URL: https://arxiv.org/pdf/2411.19527
Copy Paste: [[2411.19527]] DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding(https://arxiv.org/abs/2411.19527)
Keywords: robust, generative
Abstract: Human motion, inherently continuous and dynamic, presents significant challenges for generative models. Despite their dominance, discrete quantization methods, such as VQ-VAEs, suffer from inherent limitations, including restricted expressiveness and frame-wise noise artifacts. Continuous approaches, while producing smoother and more natural motions, often falter due to high-dimensional complexity and limited training data. To resolve this "discord" between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that decodes discrete motion tokens into continuous motion through rectified flow. By employing an iterative refinement process in the continuous space, DisCoRD captures fine-grained dynamics and ensures smoother and more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results solidify DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Our project page is available at: this https URL.

Title: RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation

Authors: Xianfeng Tan, Yuhan Li, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Ran Lin, Bingbing Ni
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19528
Pdf URL: https://arxiv.org/pdf/2411.19528
Copy Paste: [[2411.19528]] RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation(https://arxiv.org/abs/2411.19528)
Keywords: diffusion, generative
Abstract: Standard clothing asset generation involves creating forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized sampling distributions and precise structural requirements in the generated images. Existing models have limited spatial perception and often exhibit structural hallucinations in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating external knowledge from LLM and databases. RAGDiffusion consists of two core processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a three-level alignment that ensures fidelity in structural, pattern, and decoding components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and detail-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.

Title: Quantized Delta Weight Is Safety Keeper

Authors: Yule Liu, Zhen Sun, Xinlei He, Xinyi Huang
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19530
Pdf URL: https://arxiv.org/pdf/2411.19530
Copy Paste: [[2411.19530]] Quantized Delta Weight Is Safety Keeper(https://arxiv.org/abs/2411.19530)
Keywords: secure, security, attack, robust
Abstract: Recent advancements in fine-tuning proprietary language models enable customized applications across various domains but also introduce two major challenges: high resource demands and security risks. Regarding resource demands, recent work proposes novel partial compression, such as BitDelta, to quantize the delta weights between the fine-tuned model and base model. Regarding the security risks, user-defined fine-tuning can introduce security vulnerabilities, such as alignment issues, backdoor attacks, and hallucinations. However, most of the current efforts in security assessment focus on the full-precision or full-compression models, it is not well-discussed how the partial compression methods affect security concerns. To bridge this gap, we evaluate the robustness of delta-weight quantization against these security threats. In this paper, we uncover a "free lunch" phenomenon: partial compression can enhance model security against fine-tuning-based attacks with bearable utility loss. Using Llama-2-7b-chat as a case study, we show that, with under 10% utility degradation, the partial compression mitigates alignment-breaking risks by up to 66.17%, harmful backdoor vulnerabilities by 64.46%, and targeted output manipulation risks by up to 90.53%. We further apply LogitLens to visualize internal state transformations during forward passes, suggesting mechanisms for both security failure and recovery in standard versus compressed fine-tuning. This work offers new insights into selecting effective delta compression methods for secure, resource-efficient multi-tenant services.

Title: QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

Authors: Wenfang Sun, Yingjun Du, Gaowen Liu, Cees G. M. Snoek
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19534
Pdf URL: https://arxiv.org/pdf/2411.19534
Copy Paste: [[2411.19534]] QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain(https://arxiv.org/abs/2411.19534)
Keywords: generative
Abstract: We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.

Title: Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook

Authors: Florinel-Alin Croitoru, Andrei-Iulian Hiji, Vlad Hondru, Nicolae Catalin Ristea, Paul Irofti, Marius Popescu, Cristian Rusu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.19537
Pdf URL: https://arxiv.org/pdf/2411.19537
Copy Paste: [[2411.19537]] Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook(https://arxiv.org/abs/2411.19537)
Keywords: robust, diffusion, generative
Abstract: With the recent advancements in generative modeling, the realism of deepfake content has been increasing at a steady pace, even reaching the point where people often fail to detect manipulated media content online, thus being deceived into various kinds of scams. In this paper, we survey deepfake generation and detection techniques, including the most recent developments in the field, such as diffusion models and Neural Radiance Fields. Our literature review covers all deepfake media types, comprising image, video, audio and multimodal (audio-visual) content. We identify various kinds of deepfakes, according to the procedure used to alter or generate the fake content. We further construct a taxonomy of deepfake generation and detection methods, illustrating the important groups of methods and the domains where these methods are applied. Next, we gather datasets used for deepfake detection and provide updated rankings of the best performing deepfake detectors on the most popular datasets. In addition, we develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content. The results indicate that state-of-the-art detectors fail to generalize to deepfake content generated by unseen deepfake generators. Finally, we propose future directions to obtain robust and powerful deepfake detectors. Our project page and new benchmark are available at this https URL.

Title: SkelMamba: A State Space Model for Efficient Skeleton Action Recognition of Neurological Disorders

Authors: Niki Martinel, Mariano Serrao, Christian Micheloni
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19544
Pdf URL: https://arxiv.org/pdf/2411.19544
Copy Paste: [[2411.19544]] SkelMamba: A State Space Model for Efficient Skeleton Action Recognition of Neurological Disorders(https://arxiv.org/abs/2411.19544)
Keywords: transformer
Abstract: We introduce a novel state-space model (SSM)-based framework for skeleton-based human action recognition, with an anatomically-guided architecture that improves state-of-the-art performance in both clinical diagnostics and general action recognition tasks. Our approach decomposes skeletal motion analysis into spatial, temporal, and spatio-temporal streams, using channel partitioning to capture distinct movement characteristics efficiently. By implementing a structured, multi-directional scanning strategy within SSMs, our model captures local joint interactions and global motion patterns across multiple anatomical body parts. This anatomically-aware decomposition enhances the ability to identify subtle motion patterns critical in medical diagnosis, such as gait anomalies associated with neurological conditions. On public action recognition benchmarks, i.e., NTU RGB+D, NTU RGB+D 120, and NW-UCLA, our model outperforms current state-of-the-art methods, achieving accuracy improvements up to $3.2\%$ with lower computational complexity than previous leading transformer-based models. We also introduce a novel medical dataset for motion-based patient neurological disorder analysis to validate our method's potential in automated disease diagnosis.

Title: Training Agents with Weakly Supervised Feedback from Large Language Models

Authors: Dihong Gong, Pu Lu, Zelong Wang, Meng Zhou, Xiuqiang He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19547
Pdf URL: https://arxiv.org/pdf/2411.19547
Copy Paste: [[2411.19547]] Training Agents with Weakly Supervised Feedback from Large Language Models(https://arxiv.org/abs/2411.19547)
Keywords: large language model
Abstract: Large Language Models (LLMs) offer a promising basis for creating agents that can tackle complex tasks through iterative environmental interaction. Existing methods either require these agents to mimic expert-provided trajectories or rely on definitive environmental feedback for reinforcement learning which limits their application to specific scenarios like gaming or code generation. This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM, bypassing the need for expert trajectories or definitive feedback. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Subsequently, a critic LLM selects a subset of good trajectories, which are then used to update the agents, enabling them to generate improved trajectories in the next iteration. Extensive tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4, despite using open-source models with much fewer parameters.

Title: Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding

Authors: Wenbo Zhang, Lu Zhang, Ping Hu, Liqian Ma, Yunzhi Zhuge, Huchuan Lu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19551
Pdf URL: https://arxiv.org/pdf/2411.19551
Copy Paste: [[2411.19551]] Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding(https://arxiv.org/abs/2411.19551)
Keywords: segmentation
Abstract: Injecting semantics into 3D Gaussian Splatting (3DGS) has recently garnered significant attention. While current approaches typically distill 3D semantic features from 2D foundational models (e.g., CLIP and SAM) to facilitate novel view segmentation and semantic understanding, their heavy reliance on 2D supervision can undermine cross-view semantic consistency and necessitate complex data preparation processes, therefore hindering view-consistent scene understanding. In this work, we present FreeGS, an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels. Instead of directly learning semantic features, we introduce the IDentity-coupled Semantic Field (IDSF) into 3DGS, which captures both semantic representations and view-consistent instance indices for each Gaussian. We optimize IDSF with a two-step alternating strategy: semantics help to extract coherent instances in 3D space, while the resulting instances regularize the injection of stable semantics from 2D space. Additionally, we adopt a 2D-3D joint contrastive loss to enhance the complementarity between view-consistent 3D geometry and rich semantics during the bootstrapping process, enabling FreeGS to uniformly perform tasks such as novel-view semantic segmentation, object selection, and 3D object detection. Extensive experiments on LERF-Mask, 3D-OVS, and ScanNet datasets demonstrate that FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.

Title: Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning

Authors: Kaustubh Ponkshe, Raghav Singhal, Eduard Gorbunov, Alexey Tumanov, Samuel Horvath, Praneeth Vepakomma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19557
Pdf URL: https://arxiv.org/pdf/2411.19557
Copy Paste: [[2411.19557]] Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning(https://arxiv.org/abs/2411.19557)
Keywords: large language model
Abstract: Low-rank adapters have become a standard approach for efficiently fine-tuning large language models (LLMs), but they often fall short of achieving the performance of full fine-tuning. We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a carefully designed initialization strategy. We theoretically demonstrate that the architecture of LoRA-XS, which inserts a trainable (r x r) matrix between B and A while keeping other matrices fixed, provides the precise conditions needed for this approximation. We leverage its constrained update space to achieve optimal scaling for high-rank gradient updates while removing the need for hyperparameter tuning. We prove that our initialization offers an optimal low-rank approximation of the initial gradient and preserves update directions throughout training. Extensive experiments across mathematical reasoning, commonsense reasoning, and language understanding tasks demonstrate that our approach exceeds the performance of standard LoRA while using 27-90x fewer parameters, and comprehensively outperforms LoRA-XS. Our findings establish that it is possible to simulate full fine-tuning in low-rank subspaces, and achieve significant efficiency gains without sacrificing performance. Our code is publicly available at this https URL.

Title: Ensemble Watermarks for Large Language Models

Authors: Georg Niess, Roman Kern
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19563
Pdf URL: https://arxiv.org/pdf/2411.19563
Copy Paste: [[2411.19563]] Ensemble Watermarks for Large Language Models(https://arxiv.org/abs/2411.19563)
Keywords: attack, watermark, large language model
Abstract: The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. While watermarks already exist for LLMs, they often lack flexibility, and struggle with attacks such as paraphrasing. To address these issues, we propose a multi-feature method for generating watermarks that combines multiple distinct watermark features into an ensemble watermark. Concretely, we combine acrostica and sensorimotor norms with the established red-green watermark to achieve a 98% detection rate. After a paraphrasing attack the performance remains high with 95% detection rate. The red-green feature alone as baseline achieves a detection rate of 49%. The evaluation of all feature combinations reveals that the ensemble of all three consistently has the highest detection rate across several LLMs and watermark strength settings. Due to the flexibility of combining features in the ensemble, various requirements and trade-offs can be addressed. Additionally, for all ensemble configurations the same detection function can be used without adaptations. This method is particularly of interest to facilitate accountability and prevent societal harm.

Title: KV Shifting Attention Enhances Language Modeling

Authors: Mingyu Xu, Wei Cheng, Bingning Wang, Weipeng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19574
Pdf URL: https://arxiv.org/pdf/2411.19574
Copy Paste: [[2411.19574]] KV Shifting Attention Enhances Language Modeling(https://arxiv.org/abs/2411.19574)
Keywords: transformer, large language model
Abstract: The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model's induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model's requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.

Title: In-Context Learning with Noisy Labels

Authors: Junyong Kang, Donghyun Son, Hwanjun Song, Buru Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19581
Pdf URL: https://arxiv.org/pdf/2411.19581
Copy Paste: [[2411.19581]] In-Context Learning with Noisy Labels(https://arxiv.org/abs/2411.19581)
Keywords: large language model
Abstract: In-context learning refers to the emerging ability of large language models (LLMs) to perform a target task without additional training, utilizing demonstrations of the task. Recent studies aim to enhance in-context learning performance by selecting more useful demonstrations. However, they overlook the presence of inevitable noisy labels in task demonstrations that arise during the labeling process in the real-world. In this paper, we propose a new task, in-context learning with noisy labels, which aims to solve real-world problems for in-context learning where labels in task demonstrations would be corrupted. Moreover, we propose a new method and baseline methods for the new task, inspired by studies in learning with noisy labels. Through experiments, we demonstrate that our proposed method can serve as a safeguard against performance degradation in in-context learning caused by noisy labels.

Title: Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT

Authors: Hemal Mahmud, Hasan Mahmud
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19584
Pdf URL: https://arxiv.org/pdf/2411.19584
Copy Paste: [[2411.19584]] Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT(https://arxiv.org/abs/2411.19584)
Keywords: transformer
Abstract: Sentiment analysis (SA) is a process of identifying the emotional tone or polarity within a given text and aims to uncover the user's complex emotions and inner feelings. While sentiment analysis has been extensively studied for languages like English, research in Bengali, remains limited, particularly for fine-grained sentiment categorization. This work aims to connect this gap by developing a novel approach that integrates rule-based algorithms with pre-trained language models. We developed a dataset from scratch, comprising over 15,000 manually labeled reviews. Next, we constructed a Lexicon Data Dictionary, assigning polarity scores to the reviews. We developed a novel rule based algorithm Bangla Sentiment Polarity Score (BSPS), an approach capable of generating sentiment scores and classifying reviews into nine distinct sentiment categories. To assess the performance of this method, we evaluated the classified sentiments using BanglaBERT, a pre-trained transformer-based language model. We also performed sentiment classification directly with BanglaBERT on the original data and evaluated this model's results. Our analysis revealed that the BSPS + BanglaBERT hybrid approach outperformed the standalone BanglaBERT model, achieving higher accuracy, precision, and nuanced classification across the nine sentiment categories. The results of our study emphasize the value and effectiveness of combining rule-based and pre-trained language model approaches for enhanced sentiment analysis in Bengali and suggest pathways for future research and application in languages with similar linguistic complexities.

Title: LDA-AQU: Adaptive Query-guided Upsampling via Local Deformable Attention

Authors: Zewen Du, Zhenjiang Hu, Guiyu Zhao, Ying Jin, Hongbin Ma
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19585
Pdf URL: https://arxiv.org/pdf/2411.19585
Copy Paste: [[2411.19585]] LDA-AQU: Adaptive Query-guided Upsampling via Local Deformable Attention(https://arxiv.org/abs/2411.19585)
Keywords: segmentation
Abstract: Feature upsampling is an essential operation in constructing deep convolutional neural networks. However, existing upsamplers either lack specific feature guidance or necessitate the utilization of high-resolution feature maps, resulting in a loss of performance and flexibility. In this paper, we find that the local self-attention naturally has the feature guidance capability, and its computational paradigm aligns closely with the essence of feature upsampling (\ie feature reassembly of neighboring points). Therefore, we introduce local self-attention into the upsampling task and demonstrate that the majority of existing upsamplers can be regarded as special cases of upsamplers based on local self-attention. Considering the potential semantic gap between upsampled points and their neighboring points, we further introduce the deformation mechanism into the upsampler based on local self-attention, thereby proposing LDA-AQU. As a novel dynamic kernel-based upsampler, LDA-AQU utilizes the feature of queries to guide the model in adaptively adjusting the position and aggregation weight of neighboring points, thereby meeting the upsampling requirements across various complex scenarios. In addition, LDA-AQU is lightweight and can be easily integrated into various model architectures. We evaluate the effectiveness of LDA-AQU across four dense prediction tasks: object detection, instance segmentation, panoptic segmentation, and semantic segmentation. LDA-AQU consistently outperforms previous state-of-the-art upsamplers, achieving performance enhancements of 1.7 AP, 1.5 AP, 2.0 PQ, and 2.5 mIoU compared to the baseline models in the aforementioned four tasks, respectively. Code is available at \url{this https URL}.

Title: Gaussian Splashing: Direct Volumetric Rendering Underwater

Authors: Nir Mualem, Roy Amoyal, Oren Freifeld, Derya Akkaynak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19588
Pdf URL: https://arxiv.org/pdf/2411.19588
Copy Paste: [[2411.19588]] Gaussian Splashing: Direct Volumetric Rendering Underwater(https://arxiv.org/abs/2411.19588)
Keywords: robust
Abstract: In underwater images, most useful features are occluded by water. The extent of the occlusion depends on imaging geometry and can vary even across a sequence of burst images. As a result, 3D reconstruction methods robust on in-air scenes, like Neural Radiance Field methods (NeRFs) or 3D Gaussian Splatting (3DGS), fail on underwater scenes. While a recent underwater adaptation of NeRFs achieved state-of-the-art results, it is impractically slow: reconstruction takes hours and its rendering rate, in frames per second (FPS), is less than 1. Here, we present a new method that takes only a few minutes for reconstruction and renders novel underwater scenes at 140 FPS. Named Gaussian Splashing, our method unifies the strengths and speed of 3DGS with an image formation model for capturing scattering, introducing innovations in the rendering and depth estimation procedures and in the 3DGS loss function. Despite the complexities of underwater adaptation, our method produces images at unparalleled speeds with superior details. Moreover, it reveals distant scene details with far greater clarity than other methods, dramatically improving reconstructed and rendered images. We demonstrate results on existing datasets and a new dataset we have collected. Additional visual results are available at: this https URL .

Title: Can Large Language Models Reason about the Region Connection Calculus?

Authors: Anthony G Cohn, Robert E Blackwell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19589
Pdf URL: https://arxiv.org/pdf/2411.19589
Copy Paste: [[2411.19589]] Can Large Language Models Reason about the Region Connection Calculus?(https://arxiv.org/abs/2411.19589)
Keywords: large language model
Abstract: Qualitative Spatial Reasoning is a well explored area of Knowledge Representation and Reasoning and has multiple applications ranging from Geographical Information Systems to Robotics and Computer Vision. Recently, many claims have been made for the reasoning capabilities of Large Language Models (LLMs). Here, we investigate the extent to which a set of representative LLMs can perform classical qualitative spatial reasoning tasks on the mereotopological Region Connection Calculus, RCC-8. We conduct three pairs of experiments (reconstruction of composition tables, alignment to human composition preferences, conceptual neighbourhood reconstruction) using state-of-the-art LLMs; in each pair one experiment uses eponymous relations and one, anonymous relations (to test the extent to which the LLM relies on knowledge about the relation names obtained during training). All instances are repeated 30 times to measure the stochasticity of the LLMs.

Title: Tortho-Gaussian: Splatting True Digital Orthophoto Maps

Authors: Xin Wang, Wendi Zhang, Hong Xie, Haibin Ai, Qiangqiang Yuan, Zongqian Zhan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19594
Pdf URL: https://arxiv.org/pdf/2411.19594
Copy Paste: [[2411.19594]] Tortho-Gaussian: Splatting True Digital Orthophoto Maps(https://arxiv.org/abs/2411.19594)
Keywords: robust
Abstract: True Digital Orthophoto Maps (TDOMs) are essential products for digital twins and Geographic Information Systems (GIS). Traditionally, TDOM generation involves a complex set of traditional photogrammetric process, which may deteriorate due to various challenges, including inaccurate Digital Surface Model (DSM), degenerated occlusion detections, and visual artifacts in weak texture regions and reflective surfaces, etc. To address these challenges, we introduce TOrtho-Gaussian, a novel method inspired by 3D Gaussian Splatting (3DGS) that generates TDOMs through orthogonal splatting of optimized anisotropic Gaussian kernel. More specifically, we first simplify the orthophoto generation by orthographically splatting the Gaussian kernels onto 2D image planes, formulating a geometrically elegant solution that avoids the need for explicit DSM and occlusion detection. Second, to produce TDOM of large-scale area, a divide-and-conquer strategy is adopted to optimize memory usage and time efficiency of training and rendering for 3DGS. Lastly, we design a fully anisotropic Gaussian kernel that adapts to the varying characteristics of different regions, particularly improving the rendering quality of reflective surfaces and slender structures. Extensive experimental evaluations demonstrate that our method outperforms existing commercial software in several aspects, including the accuracy of building boundaries, the visual quality of low-texture regions and building facades. These results underscore the potential of our approach for large-scale urban scene reconstruction, offering a robust alternative for enhancing TDOM quality and scalability.

Title: FairDD: Fair Dataset Distillation via Synchronized Matching

Authors: Qihang Zhou, Shenhao Fang, Shibo He, Wenchao Meng, Jiming Chen
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19623
Pdf URL: https://arxiv.org/pdf/2411.19623
Copy Paste: [[2411.19623]] FairDD: Fair Dataset Distillation via Synchronized Matching(https://arxiv.org/abs/2411.19623)
Keywords: protect, fair
Abstract: Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation (DD) fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches, requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DD methods, without sacrificing classification accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach.

Title: Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Authors: Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
Subjects: cs.CV, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2411.19628
Pdf URL: https://arxiv.org/pdf/2411.19628
Copy Paste: [[2411.19628]] Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings(https://arxiv.org/abs/2411.19628)
Keywords: large language model
Abstract: The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning} resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer, thereby addressing the observed visual redundancy. To validate VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL, and conduct extensive experiments on a bunch of benchmarks. The experiment results not only show the effectiveness of our VTE in improving MLLMs' efficiency, but also yield the general modeling patterns of MLLMs, well facilitating the in-depth understanding of MLLMs. Our code is anonymously released at this https URL.

Title: LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Authors: Taja Kuzman, Nikola Ljubešić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19638
Pdf URL: https://arxiv.org/pdf/2411.19638
Copy Paste: [[2411.19638]] LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification(https://arxiv.org/abs/2411.19638)
Keywords: transformer, generative, large language model
Abstract: With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset through automatic annotation of news articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits a high zero-shot performance on all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.

Title: Learned Random Label Predictions as a Neural Network Complexity Metric

Authors: Marlon Becker, Benjamin Risse
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19640
Pdf URL: https://arxiv.org/pdf/2411.19640
Copy Paste: [[2411.19640]] Learned Random Label Predictions as a Neural Network Complexity Metric(https://arxiv.org/abs/2411.19640)
Keywords: extraction, fair
Abstract: We empirically investigate the impact of learning randomly generated labels in parallel to class labels in supervised learning on memorization, model complexity, and generalization in deep neural networks. To this end, we introduce a multi-head network architecture as an extension of standard CNN architectures. Inspired by methods used in fair AI, our approach allows for the unlearning of random labels, preventing the network from memorizing individual samples. Based on the concept of Rademacher complexity, we first use our proposed method as a complexity metric to analyze the effects of common regularization techniques and challenge the traditional understanding of feature extraction and classification in CNNs. Second, we propose a novel regularizer that effectively reduces sample memorization. However, contrary to the predictions of classical statistical learning theory, we do not observe improvements in generalization.

Title: CAdam: Confidence-Based Optimization for Online Learning

Authors: Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Jian Li
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2411.19647
Pdf URL: https://arxiv.org/pdf/2411.19647
Copy Paste: [[2411.19647]] CAdam: Confidence-Based Optimization for Online Learning(https://arxiv.org/abs/2411.19647)
Keywords: robust
Abstract: Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum ($m_t$) and adaptive learning rate ($v_t$). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noises, poses significant challenges to Adam's standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam's performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates. If momentum and gradient are in sync, CAdam proceeds with parameter updates according to Adam's original formulation; if not, it temporarily withholds updates and monitors potential shifts in data distribution in subsequent iterations. This method allows CAdam to distinguish between the true distributional shifts and mere noise, and adapt more quickly to new data distributions. Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known optimizers, including the original Adam, in efficiency and noise robustness. Furthermore, in large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV).

Title: Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing

Authors: Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19652
Pdf URL: https://arxiv.org/pdf/2411.19652
Copy Paste: [[2411.19652]] Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing(https://arxiv.org/abs/2411.19652)
Keywords: robust, diffusion
Abstract: Text-guided image generation and editing using diffusion models have achieved remarkable advancements. Among these, tuning-free methods have gained attention for their ability to perform edits without extensive model adjustments, offering simplicity and efficiency. However, existing tuning-free approaches often struggle with balancing fidelity and editing precision. Reconstruction errors in DDIM Inversion are partly attributed to the cross-attention mechanism in U-Net, which introduces misalignments during the inversion and reconstruction process. To address this, we analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps, significantly enhancing image reconstruction fidelity. Our method effectively minimizes distortions caused by varying text conditions during noise prediction. To complement this improvement, we introduce an adaptive mask-guided editing technique that integrates seamlessly with our reconstruction approach, ensuring consistency and accuracy in editing tasks. Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios. This study underscores the potential of uniform attention maps to enhance the fidelity and versatility of diffusion-based image processing methods. Code is available at this https URL.

Title: TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting

Authors: Bojun Xiong, Jialun Liu, Jiakui Hu, Chenming Wu, Jinbo Wu, Xing Liu, Chen Zhao, Errui Ding, Zhouhui Lian
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19654
Pdf URL: https://arxiv.org/pdf/2411.19654
Copy Paste: [[2411.19654]] TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting(https://arxiv.org/abs/2411.19654)
Keywords: diffusion
Abstract: Physically Based Rendering (PBR) materials play a crucial role in modern graphics, enabling photorealistic rendering across diverse environment maps. Developing an effective and efficient algorithm that is capable of automatically generating high-quality PBR materials rather than RGB texture for 3D meshes can significantly streamline the 3D content creation. Most existing methods leverage pre-trained 2D diffusion models for multi-view image synthesis, which often leads to severe inconsistency between the generated textures and input 3D meshes. This paper presents TexGaussian, a novel method that uses octant-aligned 3D Gaussian Splatting for rapid PBR material generation. Specifically, we place each 3D Gaussian on the finest leaf node of the octree built from the input 3D mesh to render the multiview images not only for the albedo map but also for roughness and metallic. Moreover, our model is trained in a regression manner instead of diffusion denoising, capable of generating the PBR material for a 3D mesh in a single feed-forward process. Extensive experiments on publicly available benchmarks demonstrate that our method synthesizes more visually pleasing PBR materials and runs faster than previous methods in both unconditional and text-conditional scenarios, which exhibit better consistency with the given geometry. Our code and trained models are available at this https URL.

Title: Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS

Authors: Alessandro Scirè, Andrei Stefan Bejgu, Simone Tedeschi, Karim Ghonim, Federico Martelli, Roberto Navigli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19655
Pdf URL: https://arxiv.org/pdf/2411.19655
Copy Paste: [[2411.19655]] Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS(https://arxiv.org/abs/2411.19655)
Keywords: large language model
Abstract: After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field.

Title: ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Authors: Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19668
Pdf URL: https://arxiv.org/pdf/2411.19668
Copy Paste: [[2411.19668]] ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information(https://arxiv.org/abs/2411.19668)
Keywords: large language model
Abstract: During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website this https URL

Title: Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation

Authors: Siqing Zhang, Yuchen Ding, Wei Tang, Wei Sun, Yong Liao, Peng Yuan Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19678
Pdf URL: https://arxiv.org/pdf/2411.19678
Copy Paste: [[2411.19678]] Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation(https://arxiv.org/abs/2411.19678)
Keywords: secure, privacy, protect, federate, fair
Abstract: Under stringent privacy constraints, whether federated recommendation systems can achieve group fairness remains an inadequately explored question. Taking gender fairness as a representative issue, we identify three phenomena in federated recommendation systems: performance difference, data imbalance, and preference disparity. We discover that the state-of-the-art methods only focus on the first phenomenon. Consequently, their imposition of inappropriate fairness constraints detrimentally affects the model training. Moreover, due to insufficient sensitive attribute protection of existing works, we can infer the gender of all users with 99.90% accuracy even with the addition of maximal noise. In this work, we propose Privacy-Preserving Orthogonal Aggregation (PPOA), which employs the secure aggregation scheme and quantization technique, to prevent the suppression of minority groups by the majority and preserve the distinct preferences for better group fairness. PPOA can assist different groups in obtaining their respective model aggregation results through a designed orthogonal mapping while keeping their attributes private. Experimental results on three real-world datasets demonstrate that PPOA enhances recommendation effectiveness for both females and males by up to 8.25% and 6.36%, respectively, with a maximum overall improvement of 7.30%, and achieves optimal fairness in most cases. Extensive ablation experiments and visualizations indicate that PPOA successfully maintains preferences for different gender groups.

Title: SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks

Authors: Kim-Celine Kahl, Selen Erkan, Jeremias Traub, Carsten T. Lüth, Klaus Maier-Hein, Lena Maier-Hein, Paul F. Jaeger
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19688
Pdf URL: https://arxiv.org/pdf/2411.19688
Copy Paste: [[2411.19688]] SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks(https://arxiv.org/abs/2411.19688)
Keywords: robust, interpretability, large language model
Abstract: Vision-Language Models (VLMs) have great potential in medical tasks, like Visual Question Answering (VQA), where they could act as interactive assistants for both patients and clinicians. Yet their robustness to distribution shifts on unseen data remains a critical concern for safe deployment. Evaluating such robustness requires a controlled experimental setup that allows for systematic insights into the model's behavior. However, we demonstrate that current setups fail to offer sufficiently thorough evaluations, limiting their ability to accurately assess model robustness. To address this gap, our work introduces a novel framework, called SURE-VQA, centered around three key requirements to overcome the current pitfalls and systematically analyze the robustness of VLMs: 1) Since robustness on synthetic shifts does not necessarily translate to real-world shifts, robustness should be measured on real-world shifts that are inherent to the VQA data; 2) Traditional token-matching metrics often fail to capture underlying semantics, necessitating the use of large language models (LLMs) for more accurate semantic evaluation; 3) Model performance often lacks interpretability due to missing sanity baselines, thus meaningful baselines should be reported that allow assessing the multimodal impact on the VLM. To demonstrate the relevance of this framework, we conduct a study on the robustness of various fine-tuning methods across three medical datasets with four different types of distribution shifts. Our study reveals several important findings: 1) Sanity baselines that do not utilize image data can perform surprisingly well; 2) We confirm LoRA as the best-performing PEFT method; 3) No PEFT method consistently outperforms others in terms of robustness to shifts. Code is provided at this https URL.

Title: MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating Multi-Insight Multi-Document Extraction Tasks

Authors: John Francis, Saba Esnaashari, Anton Poletaev, Sukankana Chakraborty, Youmna Hashem, Jonathan Bright
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19689
Pdf URL: https://arxiv.org/pdf/2411.19689
Copy Paste: [[2411.19689]] MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating Multi-Insight Multi-Document Extraction Tasks(https://arxiv.org/abs/2411.19689)
Keywords: extraction, large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in text analysis tasks, yet their evaluation on complex, real-world applications remains challenging. We define a set of tasks, Multi-Insight Multi-Document Extraction (MIMDE) tasks, which involves extracting an optimal set of insights from a document corpus and mapping these insights back to their source documents. This task is fundamental to many practical applications, from analyzing survey responses to processing medical records, where identifying and tracing key insights across documents is crucial. We develop an evaluation framework for MIMDE and introduce a novel set of complementary human and synthetic datasets to examine the potential of synthetic data for LLM evaluation. After establishing optimal metrics for comparing extracted insights, we benchmark 20 state-of-the-art LLMs on both datasets. Our analysis reveals a strong correlation (0.71) between the ability of LLMs to extracts insights on our two datasets but synthetic data fails to capture the complexity of document-level analysis. These findings offer crucial guidance for the use of synthetic data in evaluating text analysis systems, highlighting both its potential and limitations.

Title: Explaining the Impact of Training on Vision Models via Activation Clustering

Authors: Ahcène Boubekki, Samuel G. Fadel, Sebastian Mair
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19700
Pdf URL: https://arxiv.org/pdf/2411.19700
Copy Paste: [[2411.19700]] Explaining the Impact of Training on Vision Models via Activation Clustering(https://arxiv.org/abs/2411.19700)
Keywords: watermark, transformer
Abstract: Recent developments in the field of explainable artificial intelligence (XAI) for vision models investigate the information extracted by their feature encoder. We contribute to this effort and propose Neuro-Activated Vision Explanations (NAVE), which extracts the information captured by the encoder by clustering the feature activations of the frozen network to be explained. The method does not aim to explain the model's prediction but to answer questions such as which parts of the image are processed similarly or which information is kept in deeper layers. Experimentally, we leverage NAVE to show that the training dataset and the level of supervision affect which concepts are captured. In addition, our method reveals the impact of registers on vision transformers (ViT) and the information saturation caused by the watermark Clever Hans effect in the training set.

Title: JetFormer: An Autoregressive Generative Model of Raw Images and Text

Authors: Michael Tschannen, André Susano Pinto, Alexander Kolesnikov
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19722
Pdf URL: https://arxiv.org/pdf/2411.19722
Copy Paste: [[2411.19722]] JetFormer: An Autoregressive Generative Model of Raw Images and Text(https://arxiv.org/abs/2411.19722)
Keywords: robust, transformer, generative
Abstract: Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating high-fidelity images and producing strong log-likelihood bounds.

Title: Towards Santali Linguistic Inclusion: Building the First Santali-to-English Translation Model using mT5 Transformer and Data Augmentation

Authors: Syed Mohammed Mostaque Billah, Ateya Ahmed Subarna, Sudipta Nandi Sarna, Ahmad Shawkat Wasit, Anika Fariha, Asif Sushmit, Arig Yousuf Sadeque
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19726
Pdf URL: https://arxiv.org/pdf/2411.19726
Copy Paste: [[2411.19726]] Towards Santali Linguistic Inclusion: Building the First Santali-to-English Translation Model using mT5 Transformer and Data Augmentation(https://arxiv.org/abs/2411.19726)
Keywords: transformer
Abstract: Around seven million individuals in India, Bangladesh, Bhutan, and Nepal speak Santali, positioning it as nearly the third most commonly used Austroasiatic language. Despite its prominence among the Austroasiatic language family's Munda subfamily, Santali lacks global recognition. Currently, no translation models exist for the Santali language. Our paper aims to include Santali to the NPL spectrum. We aim to examine the feasibility of building Santali translation models based on available Santali corpora. The paper successfully addressed the low-resource problem and, with promising results, examined the possibility of creating a functional Santali machine translation model in a low-resource setup. Our study shows that Santali-English parallel corpus performs better when in transformers like mt5 as opposed to untrained transformers, proving that transfer learning can be a viable technique that works with Santali language. Besides the mT5 transformer, Santali-English performs better than Santali-Bangla parallel corpus as the mT5 has been trained in way more English data than Bangla data. Lastly, our study shows that with data augmentation, our model performs better.

Title: Risk-Averse Certification of Bayesian Neural Networks

Authors: Xiyue Zhang, Zifan Wang, Yulong Gao, Licio Romao, Alessandro Abate, Marta Kwiatkowska
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19729
Pdf URL: https://arxiv.org/pdf/2411.19729
Copy Paste: [[2411.19729]] Risk-Averse Certification of Bayesian Neural Networks(https://arxiv.org/abs/2411.19729)
Keywords: robust
Abstract: In light of the inherently complex and dynamic nature of real-world environments, incorporating risk measures is crucial for the robustness evaluation of deep learning models. In this work, we propose a Risk-Averse Certification framework for Bayesian neural networks called RAC-BNN. Our method leverages sampling and optimisation to compute a sound approximation of the output set of a BNN, represented using a set of template polytopes. To enhance robustness evaluation, we integrate a coherent distortion risk measure--Conditional Value at Risk (CVaR)--into the certification framework, providing probabilistic guarantees based on empirical distributions obtained through sampling. We validate RAC-BNN on a range of regression and classification benchmarks and compare its performance with a state-of-the-art method. The results show that RAC-BNN effectively quantifies robustness under worst-performing risky scenarios, and achieves tighter certified bounds and higher efficiency in complex tasks.

Title: Real-Time Anomaly Detection in Video Streams

Authors: Fabien Poirier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19731
Pdf URL: https://arxiv.org/pdf/2411.19731
Copy Paste: [[2411.19731]] Real-Time Anomaly Detection in Video Streams(https://arxiv.org/abs/2411.19731)
Keywords: interpretability
Abstract: This thesis is part of a CIFRE agreement between the company Othello and the LIASD laboratory. The objective is to develop an artificial intelligence system that can detect real-time dangers in a video stream. To achieve this, a novel approach combining temporal and spatial analysis has been proposed. Several avenues have been explored to improve anomaly detection by integrating object detection, human pose detection, and motion analysis. For result interpretability, techniques commonly used for image analysis, such as activation and saliency maps, have been extended to videos, and an original method has been proposed. The proposed architecture performs binary or multiclass classification depending on whether an alert or the cause needs to be identified. Numerous neural networkmodels have been tested, and three of them have been selected. You Only Looks Once (YOLO) has been used for spatial analysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19 and a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer perceptron for classification. These models handle different types of data and can be combined in parallel or in series. Although the parallel mode is faster, the serial mode is generally more reliable. For training these models, supervised learning was chosen, and two proprietary datasets were created. The first dataset focuses on objects that may play a potential role in anomalies, while the second consists of videos containing anomalies or non-anomalies. This approach allows for the processing of both continuous video streams and finite videos, providing greater flexibility in detection.

Title: A Note on Small Percolating Sets on Hypercubes via Generative AI

Authors: Gergely Bérczi, Adam Zsolt Wagner
Subjects: cs.LG, cs.DM
Abstract URL: https://arxiv.org/abs/2411.19734
Pdf URL: https://arxiv.org/pdf/2411.19734
Copy Paste: [[2411.19734]] A Note on Small Percolating Sets on Hypercubes via Generative AI(https://arxiv.org/abs/2411.19734)
Keywords: generative
Abstract: We apply a generative AI pattern-recognition technique called PatternBoost to study bootstrap percolation on hypercubes. With this, we slightly improve the best existing upper bound for the size of percolating subsets of the hypercube.

Title: Graph Neural Networks for Heart Failure Prediction on an EHR-Based Patient Similarity Graph

Authors: Heloisa Oss Boll, Ali Amirahmadi, Amira Soliman, Stefan Byttner, Mariana Recamonde-Mendoza
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19742
Pdf URL: https://arxiv.org/pdf/2411.19742
Copy Paste: [[2411.19742]] Graph Neural Networks for Heart Failure Prediction on an EHR-Based Patient Similarity Graph(https://arxiv.org/abs/2411.19742)
Keywords: interpretability, transformer
Abstract: Objective: In modern healthcare, accurately predicting diseases is a crucial matter. This study introduces a novel approach using graph neural networks (GNNs) and a Graph Transformer (GT) to predict the incidence of heart failure (HF) on a patient similarity graph at the next hospital visit. Materials and Methods: We used electronic health records (EHR) from the MIMIC-III dataset and applied the K-Nearest Neighbors (KNN) algorithm to create a patient similarity graph using embeddings from diagnoses, procedures, and medications. Three models - GraphSAGE, Graph Attention Network (GAT), and Graph Transformer (GT) - were implemented to predict HF incidence. Model performance was evaluated using F1 score, AUROC, and AUPRC metrics, and results were compared against baseline algorithms. An interpretability analysis was performed to understand the model's decision-making process. Results: The GT model demonstrated the best performance (F1 score: 0.5361, AUROC: 0.7925, AUPRC: 0.5168). Although the Random Forest (RF) baseline achieved a similar AUPRC value, the GT model offered enhanced interpretability due to the use of patient relationships in the graph structure. A joint analysis of attention weights, graph connectivity, and clinical features provided insight into model predictions across different classification groups. Discussion and Conclusion: Graph-based approaches such as GNNs provide an effective framework for predicting HF. By leveraging a patient similarity graph, GNNs can capture complex relationships in EHR data, potentially improving prediction accuracy and clinical interpretability.

Title: HVAC-DPT: A Decision Pretrained Transformer for HVAC Control

Authors: Anaïs Berkes
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2411.19746
Pdf URL: https://arxiv.org/pdf/2411.19746
Copy Paste: [[2411.19746]] HVAC-DPT: A Decision Pretrained Transformer for HVAC Control(https://arxiv.org/abs/2411.19746)
Keywords: transformer
Abstract: Building operations consume approximately 40% of global energy, with Heating, Ventilation, and Air Conditioning (HVAC) systems responsible for up to 50% of this consumption. As HVAC energy demands are expected to rise, optimising system efficiency is crucial for reducing future energy use and mitigating climate change. Existing control strategies lack generalisation and require extensive training and data, limiting their rapid deployment across diverse buildings. This paper introduces HVAC-DPT, a Decision-Pretrained Transformer using in-context Reinforcement Learning (RL) for multi-zone HVAC control. HVAC-DPT frames HVAC control as a sequential prediction task, training a causal transformer on interaction histories generated by diverse RL agents. This approach enables HVAC-DPT to refine its policy in-context, without modifying network parameters, allowing for deployment across different buildings without the need for additional training or data collection. HVAC-DPT reduces energy consumption in unseen buildings by 45% compared to the baseline controller, offering a scalable and effective approach to mitigating the increasing environmental impact of HVAC systems.

Title: A Multi-Loss Strategy for Vehicle Trajectory Prediction: Combining Off-Road, Diversity, and Directional Consistency Losses

Authors: Ahmad Rahimi, Alexandre Alahi
Subjects: cs.CV, cs.AI, cs.LG, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2411.19747
Pdf URL: https://arxiv.org/pdf/2411.19747
Copy Paste: [[2411.19747]] A Multi-Loss Strategy for Vehicle Trajectory Prediction: Combining Off-Road, Diversity, and Directional Consistency Losses(https://arxiv.org/abs/2411.19747)
Keywords: attack, robust
Abstract: Trajectory prediction is essential for the safety and efficiency of planning in autonomous vehicles. However, current models often fail to fully capture complex traffic rules and the complete range of potential vehicle movements. Addressing these limitations, this study introduces three novel loss functions: Offroad Loss, Direction Consistency Error, and Diversity Loss. These functions are designed to keep predicted paths within driving area boundaries, aligned with traffic directions, and cover a wider variety of plausible driving scenarios. As all prediction modes should adhere to road rules and conditions, this work overcomes the shortcomings of traditional "winner takes all" training methods by applying the loss functions to all prediction modes. These loss functions not only improve model training but can also serve as metrics for evaluating the realism and diversity of trajectory predictions. Extensive validation on the nuScenes and Argoverse 2 datasets with leading baseline models demonstrates that our approach not only maintains accuracy but significantly improves safety and robustness, reducing offroad errors on average by 47% on original and by 37% on attacked scenes. This work sets a new benchmark for trajectory prediction in autonomous driving, offering substantial improvements in navigating complex environments. Our code is available at this https URL .

Title: A Comprehensive Content Verification System for ensuring Digital Integrity in the Age of Deep Fakes

Authors: RaviKanth Kaja
Subjects: cs.CR, cs.CV, cs.ET
Abstract URL: https://arxiv.org/abs/2411.19750
Pdf URL: https://arxiv.org/pdf/2411.19750
Copy Paste: [[2411.19750]] A Comprehensive Content Verification System for ensuring Digital Integrity in the Age of Deep Fakes(https://arxiv.org/abs/2411.19750)
Keywords: robust
Abstract: In an era marked by the widespread sharing of digital content, the need for a robust content-integrity verification goes beyond the confines of individual social media platforms. While verified profiles (such as blue ticks on platforms like Instagram and X) have become synonymous with credibility, the content they share often traverses a complex network of interconnected platforms, by means of re-sharing, re-posting, etc., leaving a void in the authentication process of the content itself. With the advent of easily accessible AI tools (like DALL-E, Sora, and the tools that are explicitly built for generating deepfakes & face swaps), the risk of misinformation through social media platforms is growing exponentially. This paper discusses a solution, a Content Verification System, designed to authenticate images and videos shared as posts or stories across the digital landscape. Going beyond the limitations of blue ticks, this system empowers individuals and influencers to validate the authenticity of their digital footprint, safeguarding their reputation in an interconnected world.

Title: Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models

Authors: Kaican Li, Weiyan Xie, Yongxiang Huang, Didan Deng, Lanqing Hong, Zhenguo Li, Ricardo Silva, Nevin L. Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19757
Pdf URL: https://arxiv.org/pdf/2411.19757
Copy Paste: [[2411.19757]] Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models(https://arxiv.org/abs/2411.19757)
Keywords: robust
Abstract: Fine-tuning foundation models often compromises their robustness to distribution shifts. To remedy this, most robust fine-tuning methods aim to preserve the pre-trained features. However, not all pre-trained features are robust and those methods are largely indifferent to which ones to preserve. We propose dual risk minimization (DRM), which combines empirical risk minimization with worst-case risk minimization, to better preserve the core features of downstream tasks. In particular, we utilize core-feature descriptions generated by LLMs to induce core-based zero-shot predictions which then serve as proxies to estimate the worst-case risk. DRM balances two crucial aspects of model robustness: expected performance and worst-case performance, establishing a new state of the art on various real-world benchmarks. DRM significantly improves the out-of-distribution performance of CLIP ViT-L/14@336 on ImageNet (75.9 to 77.1), WILDS-iWildCam (47.1 to 51.8), and WILDS-FMoW (50.7 to 53.1); opening up new avenues for robust fine-tuning. Our code is available at this https URL .

Title: Evidence-Based Threat Modeling for ICS

Authors: Can Ozkan, Dave Singelee
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.19759
Pdf URL: https://arxiv.org/pdf/2411.19759
Copy Paste: [[2411.19759]] Evidence-Based Threat Modeling for ICS(https://arxiv.org/abs/2411.19759)
Keywords: attack
Abstract: ICS environments are vital to the operation of critical infrastructure such as power grids, water treatment facilities, and manufacturing plants. However, these systems are vulnerable to cyber attacks due to their reliance on interconnected devices and networks, which could lead to catastrophic failures. Therefore, securing these systems from cyber threats becomes paramount. In this context, threat modeling plays an essential role. Despite the advances in threat modeling, the fundamental gap in the state-of-the art is the lack of a systematic methodology for identifying threats in ICS comprehensively. Most threat models in the literature (i) rely on expert knowledge, (ii) only include generic threats such as spoofing, tampering, etc., and (iii) these threats are not comprehensive enough for the systems in question. To overcome these limitations, we propose a novel evidence-based methodology to systematically identify threats based on existing CVE entries of components and their associated fundamental weaknesses in the form of CWE entries - namely, CVE-CWE pairs - and thereby generate a comprehensive threat list. Furthermore, we have implemented our methodology as a ready-to-use tool and have applied it to a typical SCADA system to demonstrate that our methodology is practical and applicable in real-world settings.

Title: Riemannian Denoising Score Matching for Molecular Structure Optimization with Accurate Energy

Authors: Jeheon Woo, Seonghwan Kim, Jun Hyeong Kim, Woo Youn Kim
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2411.19769
Pdf URL: https://arxiv.org/pdf/2411.19769
Copy Paste: [[2411.19769]] Riemannian Denoising Score Matching for Molecular Structure Optimization with Accurate Energy(https://arxiv.org/abs/2411.19769)
Keywords: robust, diffusion
Abstract: This study introduces a modified score matching method aimed at generating molecular structures with high energy accuracy. The denoising process of score matching or diffusion models mirrors molecular structure optimization, where scores act like physical force fields that guide particles toward equilibrium states. To achieve energetically accurate structures, it can be advantageous to have the score closely approximate the gradient of the actual potential energy surface. Unlike conventional methods that simply design the target score based on structural differences in Euclidean space, we propose a Riemannian score matching approach. This method represents molecular structures on a manifold defined by physics-informed internal coordinates to efficiently mimic the energy landscape, and performs noising and denoising within this space. Our method has been evaluated by refining several types of starting structures on the QM9 and GEOM datasets, demonstrating that the proposed Riemannian score matching method significantly improves the accuracy of the generated molecular structures, attaining chemical accuracy. The implications of this study extend to various applications in computational chemistry, offering a robust tool for accurate molecular structure prediction.

Title: LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Authors: Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng
Subjects: cs.CV, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2411.19772
Pdf URL: https://arxiv.org/pdf/2411.19772
Copy Paste: [[2411.19772]] LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos(https://arxiv.org/abs/2411.19772)
Keywords: large language model
Abstract: Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

Title: PerLA: Perceptive 3D Language Assistant

Authors: Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Fabio Poiesi, Yiming Wang
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19774
Pdf URL: https://arxiv.org/pdf/2411.19774
Copy Paste: [[2411.19774]] PerLA: Perceptive 3D Language Assistant(https://arxiv.org/abs/2411.19774)
Keywords: large language model
Abstract: Enabling Large Language Models (LLMs) to understand the 3D physical world is an emerging yet challenging research direction. Current strategies for processing point clouds typically downsample the scene or divide it into smaller parts for separate analysis. However, both approaches risk losing key local details or global contextual information. In this paper, we introduce PerLA, a 3D language assistant designed to be more perceptive to both details and context, making visual representations more informative for the LLM. PerLA captures high-resolution (local) details in parallel from different point cloud areas and integrates them with (global) context obtained from a lower-resolution whole point cloud. We present a novel algorithm that preserves point cloud locality through the Hilbert curve and effectively aggregates local-to-global information via cross-attention and a graph neural network. Lastly, we introduce a novel loss for local representation consensus to promote training stability. PerLA outperforms state-of-the-art 3D language assistants, with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on ScanRefer and +3.88 on Nr3D for dense captioning.\url{this https URL}

Title: MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks

Authors: Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, Dong Xu
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19786
Pdf URL: https://arxiv.org/pdf/2411.19786
Copy Paste: [[2411.19786]] MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks(https://arxiv.org/abs/2411.19786)
Keywords: diffusion, generative, large language model
Abstract: Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model. While the existing approaches mainly focus on generating motions with textual descriptions and overlook the reciprocal task. In this paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe enables us to handle the paired text-motion generation, motion captioning, and text-driven motion generation by simply modifying the input context. Specifically, MoTe is composed of three components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for extracting latent embeddings, and subsequently reconstructing the motion sequences and textual descriptions from the extracted embeddings, respectively. MTDM, on the other hand, performs an iterative denoising process on the input context to handle diverse tasks. Experimental results on the benchmark datasets demonstrate the superior performance of our proposed method on text-to-motion generation and competitive performance on motion captioning.

Title: Rethinking the initialization of Momentum in Federated Learning with Heterogeneous Data

Authors: Chenguang Xiao, Shuo Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19798
Pdf URL: https://arxiv.org/pdf/2411.19798
Copy Paste: [[2411.19798]] Rethinking the initialization of Momentum in Federated Learning with Heterogeneous Data(https://arxiv.org/abs/2411.19798)
Keywords: federate
Abstract: Data Heterogeneity is a major challenge of Federated Learning performance. Recently, momentum based optimization techniques have beed proved to be effective in mitigating the heterogeneity issue. Along with the model updates, the momentum updates are transmitted to the server side and aggregated. Therefore, the local training initialized with a global momentum is guided by the global history of the gradients. However, we spot a problem in the traditional cumulation of the momentum which is suboptimal in the Federated Learning systems. The momentum used to weight less on the historical gradients and more on the recent gradients. This however, will engage more biased local gradients in the end of the local training. In this work, we propose a new way to calculate the estimated momentum used in local initialization. The proposed method is named as Reversed Momentum Federated Learning (RMFL). The key idea is to assign exponentially decayed weights to the gradients with the time going forward, which is on the contrary to the traditional momentum cumulation. The effectiveness of RMFL is evaluated on three popular benchmark datasets with different heterogeneity levels.

Title: INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Authors: Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19799
Pdf URL: https://arxiv.org/pdf/2411.19799
Copy Paste: [[2411.19799]] INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge(https://arxiv.org/abs/2411.19799)
Keywords: generative, large language model
Abstract: The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.

Title: SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition

Authors: Fangze Fu, Wei Ai, Fan Yang, Yuntao Shou, Tao Meng, Keqin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19822
Pdf URL: https://arxiv.org/pdf/2411.19822
Copy Paste: [[2411.19822]] SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition(https://arxiv.org/abs/2411.19822)
Keywords: extraction
Abstract: Multimodal Emotion Recognition in Conversations (MERC) aims to classify utterance emotions using textual, auditory, and visual modal features. Most existing MERC methods assume each utterance has complete modalities, overlooking the common issue of incomplete modalities in real-world scenarios. Recently, graph neural networks (GNNs) have achieved notable results in Incomplete Multimodal Emotion Recognition in Conversations (IMERC). However, traditional GNNs focus on binary relationships between nodes, limiting their ability to capture more complex, higher-order information. Moreover, repeated message passing can cause over-smoothing, reducing their capacity to preserve essential high-frequency details. To address these issues, we propose a Spectral Domain Reconstruction Graph Neural Network (SDR-GNN) for incomplete multimodal learning in conversational emotion recognition. SDR-GNN constructs an utterance semantic interaction graph using a sliding window based on both speaker and context relationships to model emotional dependencies. To capture higher-order and high-frequency information, SDR-GNN utilizes weighted relationship aggregation, ensuring consistent semantic feature extraction across utterances. Additionally, it performs multi-frequency aggregation in the spectral domain, enabling efficient recovery of incomplete modalities by extracting both high- and low-frequency information. Finally, multi-head attention is applied to fuse and optimize features for emotion recognition. Extensive experiments on various real-world datasets demonstrate that our approach is effective in incomplete multimodal learning and outperforms current state-of-the-art methods.

Title: Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation

Authors: Dimosthenis Antypas, Indira Sen, Carla Perez-Almendros, Jose Camacho-Collados, Francesco Barbieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19832
Pdf URL: https://arxiv.org/pdf/2411.19832
Copy Paste: [[2411.19832]] Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation(https://arxiv.org/abs/2411.19832)
Keywords: privacy, large language model
Abstract: The detection of sensitive content in large datasets is crucial for ensuring that shared and analysed data is free from harmful material. However, current moderation tools, such as external APIs, suffer from limitations in customisation, accuracy across diverse sensitive categories, and privacy concerns. Additionally, existing datasets and open-source models focus predominantly on toxic language, leaving gaps in detecting other sensitive categories such as substance abuse or self-harm. In this paper, we put forward a unified dataset tailored for social media content moderation across six sensitive categories: conflictual language, profanity, sexually explicit material, drug-related content, self-harm, and spam. By collecting and annotating data with consistent retrieval strategies and guidelines, we address the shortcomings of previous focalised research. Our analysis demonstrates that fine-tuning large language models (LLMs) on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which underperform by 10-15% overall. This limitation is even more pronounced on popular moderation APIs, which cannot be easily tailored to specific sensitive content categories, among others.

Title: Towards Class-wise Robustness Analysis

Authors: Tejaswini Medi, Julia Grabinski, Margret Keuper
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19853
Pdf URL: https://arxiv.org/pdf/2411.19853
Copy Paste: [[2411.19853]] Towards Class-wise Robustness Analysis(https://arxiv.org/abs/2411.19853)
Keywords: attack, robust, fair
Abstract: While being very successful in solving many downstream tasks, the application of deep neural networks is limited in real-life scenarios because of their susceptibility to domain shifts such as common corruptions, and adversarial attacks. The existence of adversarial examples and data corruption significantly reduces the performance of deep classification models. Researchers have made strides in developing robust neural architectures to bolster decisions of deep classifiers. However, most of these works rely on effective adversarial training methods, and predominantly focus on overall model robustness, disregarding class-wise differences in robustness, which are critical. Exploiting weakly robust classes is a potential avenue for attackers to fool the image recognition models. Therefore, this study investigates class-to-class biases across adversarially trained robust classification models to understand their latent space structures and analyze their strong and weak class-wise properties. We further assess the robustness of classes against common corruptions and adversarial attacks, recognizing that class vulnerability extends beyond the number of correct classifications for a specific class. We find that the number of false positives of classes as specific target classes significantly impacts their vulnerability to attacks. Through our analysis on the Class False Positive Score, we assess a fair evaluation of how susceptible each class is to misclassification.

Title: What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric review

Authors: Mohammed Q. Shormani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.19858
Pdf URL: https://arxiv.org/pdf/2411.19858
Copy Paste: [[2411.19858]] What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric review(https://arxiv.org/abs/2411.19858)
Keywords: robust
Abstract: There is a strong correlation between linguistics and artificial intelligence (AI), best manifested by deep learning language models. This study provides a thorough scientometric analysis of this correlation, synthesizing the intellectual production during 51 years, from 1974 to 2024. It involves 5750 Web of Science-indexed articles published in 2124 journals, which are written by 20835 authors belonging to 13773 research centers in 794 countries. Two powerful software, viz., CiteSpace and VOSviewer, were used to generate mapping visualizations of the intellectual landscape, trending issues and (re)emerging hotspots. The results indicate that in the 1980s and 1990s, linguistics and AI research was not robust, characterized by unstable publication over time. It has, however, witnessed a remarkable increase of publication since then, reaching 1478 articles in 2023, and 546 articles in January-March timespan in 2024, involving emerging issues and hotspots, addressing new horizons, new topics, and launching new applications and powerful deep learning language models including ChatGPT.

Title: SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection

Authors: Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Felix Fent, Gerhard Rigoll
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19860
Pdf URL: https://arxiv.org/pdf/2411.19860
Copy Paste: [[2411.19860]] SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection(https://arxiv.org/abs/2411.19860)
Keywords: transformer
Abstract: In this work, we present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird's Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance metrics of 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at this https URL.

Title: Reverse Thinking Makes LLMs Stronger Reasoners

Authors: Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19865
Pdf URL: https://arxiv.org/pdf/2411.19865
Copy Paste: [[2411.19865]] Reverse Thinking Makes LLMs Stronger Reasoners(https://arxiv.org/abs/2411.19865)
Keywords: large language model
Abstract: Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and (c) generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model's zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency -- using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.

Title: AIDetx: a compression-based method for identification of machine-learning generated text

Authors: Leonardo Almeida, Pedro Rodrigues, Diogo Magalhães, Armando J. Pinho, Diogo Pratas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19869
Pdf URL: https://arxiv.org/pdf/2411.19869
Copy Paste: [[2411.19869]] AIDetx: a compression-based method for identification of machine-learning generated text(https://arxiv.org/abs/2411.19869)
Keywords: interpretability, large language model
Abstract: This paper introduces AIDetx, a novel method for detecting machine-generated text using data compression techniques. Traditional approaches, such as deep learning classifiers, often suffer from high computational costs and limited interpretability. To address these limitations, we propose a compression-based classification framework that leverages finite-context models (FCMs). AIDetx constructs distinct compression models for human-written and AI-generated text, classifying new inputs based on which model achieves a higher compression ratio. We evaluated AIDetx on two benchmark datasets, achieving F1 scores exceeding 97% and 99%, respectively, highlighting its high accuracy. Compared to current methods, such as large language models (LLMs), AIDetx offers a more interpretable and computationally efficient solution, significantly reducing both training time and hardware requirements (e.g., no GPUs needed). The full implementation is publicly available at this https URL.

Title: LUMIA: Linear probing for Unimodal and MultiModal Membership Inference A!acks leveraging internal LLM states

Authors: Luis Ibanez-Lissen, Lorena Gonzalez-Manzano, Jose Maria de Fuentes, Nicolas Anciaux, Joaquin Garcia-Alfaro
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19876
Pdf URL: https://arxiv.org/pdf/2411.19876
Copy Paste: [[2411.19876]] LUMIA: Linear probing for Unimodal and MultiModal Membership Inference A!acks leveraging internal LLM states(https://arxiv.org/abs/2411.19876)
Keywords: attack, membership infer, large language model
Abstract: Large Language Models (LLMs) are increasingly used in a variety of applications, but concerns around membership inference have grown in parallel. Previous efforts focus on black-to-grey-box models, thus neglecting the potential benefit from internal LLM information. To address this, we propose the use of Linear Probes (LPs) as a method to detect Membership Inference Attacks (MIAs) by examining internal activations of LLMs. Our approach, dubbed LUMIA, applies LPs layer-by-layer to get fine-grained data on the model inner workings. We test this method across several model architectures, sizes and datasets, including unimodal and multimodal tasks. In unimodal MIA, LUMIA achieves an average gain of 15.71 % in Area Under the Curve (AUC) over previous techniques. Remarkably, LUMIA reaches AUC>60% in 65.33% of cases -- an increment of 46.80% against the state of the art. Furthermore, our approach reveals key insights, such as the model layers where MIAs are most detectable. In multimodal models, LPs indicate that visual inputs can significantly contribute to detect MIAs -- AUC>60% is reached in 85.90% of experiments.

Title: Open source Differentiable ODE Solving Infrastructure

Authors: Rakshit Kr. Singh, Aaron Rock Menezes, Rida Irfan, Bharath Ramsundar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.19882
Pdf URL: https://arxiv.org/pdf/2411.19882
Copy Paste: [[2411.19882]] Open source Differentiable ODE Solving Infrastructure(https://arxiv.org/abs/2411.19882)
Keywords: diffusion
Abstract: Ordinary Differential Equations (ODEs) are widely used in physics, chemistry, and biology to model dynamic systems, including reaction kinetics, population dynamics, and biological processes. In this work, we integrate GPU-accelerated ODE solvers into the open-source DeepChem framework, making these tools easily accessible. These solvers support multiple numerical methods and are fully differentiable, enabling easy integration into more complex differentiable programs. We demonstrate the capabilities of our implementation through experiments on Lotka-Volterra predator-prey dynamics, pharmacokinetic compartment models, neural ODEs, and solving PDEs using reaction-diffusion equations. Our solvers achieved high accuracy with mean squared errors ranging from $10^{-4}$ to $10^{-6}$ and showed scalability in solving large systems with up to 100 compartments.

Title: FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For Anomaly Segmentation

Authors: Chang Won Lee, Selina Leveugle, Svetlana Stolpner, Chris Langley, Paul Grouchy, Jonathan Kelly, Steven L. Waslander
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19888
Pdf URL: https://arxiv.org/pdf/2411.19888
Copy Paste: [[2411.19888]] FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For Anomaly Segmentation(https://arxiv.org/abs/2411.19888)
Keywords: segmentation
Abstract: Anomaly segmentation is a valuable computer vision task for safety-critical applications that need to be aware of unexpected events. Current state-of-the-art (SOTA) scene-level anomaly segmentation approaches rely on diverse inlier class labels during training, limiting their ability to leverage vast unlabeled datasets and pre-trained vision encoders. These methods may underperform in domains with reduced color diversity and limited object classes. Conversely, existing unsupervised methods struggle with anomaly segmentation with the diverse scenes of less restricted domains. To address these challenges, we introduce FlowCLAS, a novel self-supervised framework that utilizes vision foundation models to extract rich features and employs a normalizing flow network to learn their density distribution. We enhance the model's discriminative power by incorporating Outlier Exposure and contrastive learning in the latent space. FlowCLAS significantly outperforms all existing methods on the ALLO anomaly segmentation benchmark for space robotics and demonstrates competitive results on multiple road anomaly segmentation benchmarks for autonomous driving, including Fishyscapes Lost&Found and Road Anomaly. These results highlight FlowCLAS's effectiveness in addressing the unique challenges of space anomaly segmentation while retaining SOTA performance in the autonomous driving domain without reliance on inlier segmentation labels.

Title: GuardSplat: Robust and Efficient Watermarking for 3D Gaussian Splatting

Authors: Zixuan Chen, Guangcong Wang, Jiahao Zhu, Jianhuang Lai, Xiaohua Xie
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2411.19895
Pdf URL: https://arxiv.org/pdf/2411.19895
Copy Paste: [[2411.19895]] GuardSplat: Robust and Efficient Watermarking for 3D Gaussian Splatting(https://arxiv.org/abs/2411.19895)
Keywords: security, protect, robust, extraction, watermark
Abstract: 3D Gaussian Splatting (3DGS) has recently created impressive assets for various applications. However, the copyright of these assets is not well protected as existing watermarking methods are not suited for 3DGS considering security, capacity, and invisibility. Besides, these methods often require hours or even days for optimization, limiting the application scenarios. In this paper, we propose GuardSplat, an innovative and efficient framework that effectively protects the copyright of 3DGS assets. Specifically, 1) We first propose a CLIP-guided Message Decoupling Optimization module for training the message decoder, leveraging CLIP's aligning capability and rich representations to achieve a high extraction accuracy with minimal optimization costs, presenting exceptional capability and efficiency. 2) Then, we propose a Spherical-harmonic-aware (SH-aware) Message Embedding module tailored for 3DGS, which employs a set of SH offsets to seamlessly embed the message into the SH features of each 3D Gaussian while maintaining the original 3D structure. It enables the 3DGS assets to be watermarked with minimal fidelity trade-offs and prevents malicious users from removing the messages from the model files, meeting the demands for invisibility and security. 3) We further propose an Anti-distortion Message Extraction module to improve robustness against various visual distortions. Extensive experiments demonstrate that GuardSplat outperforms the state-of-the-art methods and achieves fast optimization speed.

Title: $C^{3}$-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual Neural Radiance Fields

Authors: Prajwal Singh, Ashish Tiwari, Gautam Vashishtha, Shanmuganathan Raman
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19903
Pdf URL: https://arxiv.org/pdf/2411.19903
Copy Paste: [[2411.19903]] $C^{3}$-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual Neural Radiance Fields(https://arxiv.org/abs/2411.19903)
Keywords: generative
Abstract: Neural radiance fields (NeRF) have exhibited highly photorealistic rendering of novel views through per-scene optimization over a single 3D scene. With the growing popularity of NeRF and its variants, they have become ubiquitous and have been identified as efficient 3D resources. However, they are still far from being scalable since a separate model needs to be stored for each scene, and the training time increases linearly with every newly added scene. Surprisingly, the idea of encoding multiple 3D scenes into a single NeRF model is heavily under-explored. In this work, we propose a novel conditional-cum-continual framework, called $C^{3}$-NeRF, to accommodate multiple scenes into the parameters of a single neural radiance field. Unlike conventional approaches that leverage feature extractors and pre-trained priors for scene conditioning, we use simple pseudo-scene labels to model multiple scenes in NeRF. Interestingly, we observe the framework is also inherently continual (via generative replay) with minimal, if not no, forgetting of the previously learned scenes. Consequently, the proposed framework adapts to multiple new scenes without necessarily accessing the old data. Through extensive qualitative and quantitative evaluation using synthetic and real datasets, we demonstrate the inherent capacity of the NeRF model to accommodate multiple scenes with high-quality novel-view renderings without adding additional parameters. We provide implementation details and dynamic visualizations of our results in the supplementary file.

Title: Quantifying the synthetic and real domain gap in aerial scene understanding

Authors: Alina Marcu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19913
Pdf URL: https://arxiv.org/pdf/2411.19913
Copy Paste: [[2411.19913]] Quantifying the synthetic and real domain gap in aerial scene understanding(https://arxiv.org/abs/2411.19913)
Keywords: robust, transformer
Abstract: Quantifying the gap between synthetic and real-world imagery is essential for improving both transformer-based models - that rely on large volumes of data - and datasets, especially in underexplored domains like aerial scene understanding where the potential impact is significant. This paper introduces a novel methodology for scene complexity assessment using Multi-Model Consensus Metric (MMCM) and depth-based structural metrics, enabling a robust evaluation of perceptual and structural disparities between domains. Our experimental analysis, utilizing real-world (Dronescapes) and synthetic (Skyscenes) datasets, demonstrates that real-world scenes generally exhibit higher consensus among state-of-the-art vision transformers, while synthetic scenes show greater variability and challenge model adaptability. The results underline the inherent complexities and domain gaps, emphasizing the need for enhanced simulation fidelity and model generalization. This work provides critical insights into the interplay between domain characteristics and model performance, offering a pathway for improved domain adaptation strategies in aerial scene understanding.

Title: SIMS: Simulating Human-Scene Interactions with Real World Script Planning

Authors: Wenjia Wang, Liang Pan, Zhiyang Dou, Zhouyingcheng Liao, Yuke Lou, Lei Yang, Jingbo Wang, Taku Komura
Subjects: cs.CV, cs.AI, cs.CL, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19921
Pdf URL: https://arxiv.org/pdf/2411.19921
Copy Paste: [[2411.19921]] SIMS: Simulating Human-Scene Interactions with Real World Script Planning(https://arxiv.org/abs/2411.19921)
Keywords: large language model
Abstract: Simulating long-term human-scene interaction is a challenging yet fascinating task. Previous works have not effectively addressed the generation of long-term human scene interactions with detailed narratives for physics-based animation. This paper introduces a novel framework for the planning and controlling of long-horizon physical plausible human-scene interaction. On the one hand, films and shows with stylish human locomotions or interactions with scenes are abundantly available on the internet, providing a rich source of data for script planning. On the other hand, Large Language Models (LLMs) can understand and generate logical storylines. This motivates us to marry the two by using an LLM-based pipeline to extract scripts from videos, and then employ LLMs to imitate and create new scripts, capturing complex, time-series human behaviors and interactions with environments. By leveraging this, we utilize a dual-aware policy that achieves both language comprehension and scene understanding to guide character motions within contextual and spatial constraints. To facilitate training and evaluation, we contribute a comprehensive planning dataset containing diverse motion sequences extracted from real-world videos and expand them with large language models. We also collect and re-annotate motion clips from existing kinematic datasets to enable our policy learn diverse skills. Extensive experiments demonstrate the effectiveness of our framework in versatile task execution and its generalization ability to various scenarios, showing remarkably enhanced performance compared with existing methods. Our code and data will be publicly available soon.

Title: Scalable Out-of-distribution Robustness in the Presence of Unobserved Confounders

Authors: Parjanya Prashant, Seyedeh Baharan Khatami, Bruno Ribeiro, Babak Salimi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.19923
Pdf URL: https://arxiv.org/pdf/2411.19923
Copy Paste: [[2411.19923]] Scalable Out-of-distribution Robustness in the Presence of Unobserved Confounders(https://arxiv.org/abs/2411.19923)
Keywords: robust
Abstract: We consider the task of out-of-distribution (OOD) generalization, where the distribution shift is due to an unobserved confounder ($Z$) affecting both the covariates ($X$) and the labels ($Y$). In this setting, traditional assumptions of covariate and label shift are unsuitable due to the confounding, which introduces heterogeneity in the predictor, i.e., $\hat{Y} = f_Z(X)$. OOD generalization differs from traditional domain adaptation by not assuming access to the covariate distribution ($X^\text{te}$) of the test samples during training. These conditions create a challenging scenario for OOD robustness: (a) $Z^\text{tr}$ is an unobserved confounder during training, (b) $P^\text{te}{Z} \neq P^\text{tr}{Z}$, (c) $X^\text{te}$ is unavailable during training, and (d) the posterior predictive distribution depends on $P^\text{te}(Z)$, i.e., $\hat{Y} = E_{P^\text{te}(Z)}[f_Z(X)]$. In general, accurate predictions are unattainable in this scenario, and existing literature has proposed complex predictors based on identifiability assumptions that require multiple additional variables. Our work investigates a set of identifiability assumptions that tremendously simplify the predictor, whose resulting elegant simplicity outperforms existing approaches.

Title: On Domain-Specific Post-Training for Multimodal Large Language Models

Authors: Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, Zhenliang Zhang
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19930
Pdf URL: https://arxiv.org/pdf/2411.19930
Copy Paste: [[2411.19930]] On Domain-Specific Post-Training for Multimodal Large Language Models(https://arxiv.org/abs/2411.19930)
Keywords: large language model
Abstract: Recent years have witnessed the rapid development of general multimodal large language models (MLLMs). However, adapting general MLLMs to specific domains, such as scientific fields and industrial applications, remains less explored. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks. To support further research in MLLM domain adaptation, we will open-source our implementations.

Title: VLSBench: Unveiling Visual Leakage in Multimodal Safety

Authors: Xuhao Hu, Dongrui Liu, Hao Li, Xuanjing Huang, Jing Shao
Subjects: cs.CR, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19939
Pdf URL: https://arxiv.org/pdf/2411.19939
Copy Paste: [[2411.19939]] VLSBench: Unveiling Visual Leakage in Multimodal Safety(https://arxiv.org/abs/2411.19939)
Keywords: large language model
Abstract: Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counter-intuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs trained with image-text pairs. To explain such a counter-intuitive phenomenon, we discover a visual safety information leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky and sensitive content in the image has been revealed in the textual query. In this way, MLLMs can easily refuse these sensitive text-image queries according to textual queries. However, image-text pairs without VSIL are common in real-world scenarios and are overlooked by existing multimodal safety benchmarks. To this end, we construct multimodal visual leakless safety benchmark (VLSBench) preventing visual safety leakage from image to textual query with 2.4k image-text pairs. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o. This study demonstrates that textual alignment is enough for multimodal safety scenarios with VSIL, while multimodal alignment is a more promising solution for multimodal safety scenarios without VSIL. Please see our code and data at: this http URL

Title: Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

Authors: Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, Zhaopeng Tu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19943
Pdf URL: https://arxiv.org/pdf/2411.19943
Copy Paste: [[2411.19943]] Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability(https://arxiv.org/abs/2411.19943)
Keywords: large language model
Abstract: Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO this http URL results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.

Title: T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Authors: Shukang Yin, Chaoyou Fu, Sirui Zhao, Yunhang Shen, Chunjiang Ge, Yan Yang, Zuwei Long, Yuhan Dai, Tong Xu, Xing Sun, Ran He, Caifeng Shan, Enhong Chen
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19951
Pdf URL: https://arxiv.org/pdf/2411.19951
Copy Paste: [[2411.19951]] T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs(https://arxiv.org/abs/2411.19951)
Keywords: large language model
Abstract: The success of Multimodal Large Language Models (MLLMs) in the image domain has garnered wide attention from the research community. Drawing on previous successful experiences, researchers have recently explored extending the success to the video understanding realms. Apart from training from scratch, an efficient way is to utilize the pre-trained image-LLMs, leading to two mainstream approaches, i.e. zero-shot inference and further fine-tuning with video data. In this work, our study of these approaches harvests an effective data augmentation method. We first make a deeper inspection of the zero-shot inference way and identify two limitations, i.e. limited generalization and lack of temporal understanding capabilities. Thus, we further investigate the fine-tuning approach and find a low learning efficiency when simply using all the video data samples, which can be attributed to a lack of instruction diversity. Aiming at this issue, we develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus. Integrating these data enables a simple and efficient training scheme, which achieves performance comparable to or even superior to using full video datasets by training with just 15% the sample size. Meanwhile, we find that the proposed scheme can boost the performance of long video understanding without training with long video samples. We hope our study will spark more thinking about using MLLMs for video understanding and curation of high-quality data. The code is released at this https URL.