2024-11-28

Title: UVCG: Leveraging Temporal Consistency for Universal Video Protection

Authors: KaiZhou Li, Jindong Gu, Xinchun Yu, Junjie Cao, Yansong Tang, Xiao-Ping Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17746
Pdf URL: https://arxiv.org/pdf/2411.17746
Copy Paste: [[2411.17746]] UVCG: Leveraging Temporal Consistency for Universal Video Protection(https://arxiv.org/abs/2411.17746)
Keywords: security, protect, diffusion
Abstract: The security risks of AI-driven video editing have garnered significant attention. Although recent studies indicate that adding perturbations to images can protect them from malicious edits, directly applying image-based methods to perturb each frame in a video becomes ineffective, as video editing techniques leverage the consistency of inter-frame information to restore individually perturbed content. To address this challenge, we leverage the temporal consistency of video content to propose a straightforward and efficient, yet highly effective and broadly applicable approach, Universal Video Consistency Guard (UVCG). UVCG embeds the content of another video(target video) within a protected video by introducing continuous, imperceptible perturbations which has the ability to force the encoder of editing models to map continuous inputs to misaligned continuous outputs, thereby inhibiting the generation of videos consistent with the intended textual prompts. Additionally leveraging similarity in perturbations between adjacent frames, we improve the computational efficiency of perturbation generation by employing a perturbation-reuse strategy. We applied UVCG across various versions of Latent Diffusion Models (LDM) and assessed its effectiveness and generalizability across multiple LDM-based editing pipelines. The results confirm the effectiveness, transferability, and efficiency of our approach in safeguarding video content from unauthorized modifications.

Title: Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Authors: Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, Yapeng Tian
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17760
Pdf URL: https://arxiv.org/pdf/2411.17760
Copy Paste: [[2411.17760]] Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach(https://arxiv.org/abs/2411.17760)
Keywords: robust, large language model
Abstract: Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness. However, current methods often rely heavily on MLLMs themselves as judges, leading to high computational costs and potential pitfalls like reward hacking and model collapse. This paper introduces a novel, model-level judge-free self-improvement framework. Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop. We generate preference learning pairs using a controllable hallucination mechanism and optimize data quality by leveraging lightweight, contrastive language-image encoders to evaluate and reverse pairs when necessary. Evaluations across public benchmarks and our newly introduced IC dataset designed to challenge hallucination control demonstrate that our model outperforms conventional techniques. We achieve superior precision and recall with significantly lower computational demands. This method offers an efficient pathway to scalable self-improvement in MLLMs, balancing performance gains with reduced resource requirements.

Title: OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

Authors: Zhongyu Xia, Jishuo Li, Zhiwei Lin, Xinhao Wang, Yongtao Wang, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17761
Pdf URL: https://arxiv.org/pdf/2411.17761
Copy Paste: [[2411.17761]] OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection(https://arxiv.org/abs/2411.17761)
Keywords: large language model
Abstract: Open-world autonomous driving encompasses domain generalization and open-vocabulary. Domain generalization refers to the capabilities of autonomous driving systems across different scenarios and sensor parameter configurations. Open vocabulary pertains to the ability to recognize various semantic categories not encountered during training. In this paper, we introduce OpenAD, the first real-world open-world autonomous driving benchmark for 3D object detection. OpenAD is built on a corner case discovery and annotation pipeline integrating with a multimodal large language model (MLLM). The proposed pipeline annotates corner case objects in a unified format for five autonomous driving perception datasets with 2000 scenarios. In addition, we devise evaluation methodologies and evaluate various 2D and 3D open-world and specialized models. Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark. Annotations, toolkit code, and all evaluation codes will be released.

Title: Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Authors: Xiang Li, Zixuan Huang, Anh Thai, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17763
Pdf URL: https://arxiv.org/pdf/2411.17763
Copy Paste: [[2411.17763]] Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation(https://arxiv.org/abs/2411.17763)
Keywords: robust, diffusion, transformer, generative
Abstract: Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reveals its significant benefit on single-image 3D generation. We introduce Reflect3D, a scalable, zero-shot symmetry detector capable of robust generalization to diverse and real-world scenarios. Inspired by the success of foundation models, our method scales up symmetry detection with a transformer-based architecture. We also leverage generative priors from multi-view diffusion models to address the inherent ambiguity in single-view symmetry detection. Extensive evaluations on various data sources demonstrate that Reflect3D establishes a new state-of-the-art in single-image symmetry detection. Furthermore, we show the practical benefit of incorporating detected symmetry into single-image 3D generation pipelines through a symmetry-aware optimization process. The integration of symmetry significantly enhances the structural accuracy, cohesiveness, and visual fidelity of the reconstructed 3D geometry and textures, advancing the capabilities of 3D content creation.

Title: Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

Authors: Xinyu Hou, Zongsheng Yue, Xiaoming Li, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17769
Pdf URL: https://arxiv.org/pdf/2411.17769
Copy Paste: [[2411.17769]] Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis(https://arxiv.org/abs/2411.17769)
Keywords: diffusion
Abstract: In this work, we introduce a single parameter $\omega$, to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model's reverse process. Our approach does not require model retraining, architectural modifications, or additional computational overhead during inference, yet enables precise control over the level of details in the generated outputs. Moreover, spatial masks or denoising schedules with varying $\omega$ values can be applied to achieve region-specific or timestep-specific granularity control. Prior knowledge of image composition from control signals or reference images further facilitates the creation of precise $\omega$ masks for granularity control on specific objects. To highlight the parameter's role in controlling subtle detail variations, the technique is named Omegance, combining "omega" and "nuance". Our method demonstrates impressive performance across various image and video synthesis tasks and is adaptable to advanced diffusion models. The code is available at this https URL.

Title: MTS-UNMixers: Multivariate Time Series Forecasting via Channel-Time Dual Unmixing

Authors: Xuanbing Zhu, Dunbin Shen, Zhongwen Rao, Huiyi Ma, Yingguang Hao, Hongyu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17770
Pdf URL: https://arxiv.org/pdf/2411.17770
Copy Paste: [[2411.17770]] MTS-UNMixers: Multivariate Time Series Forecasting via Channel-Time Dual Unmixing(https://arxiv.org/abs/2411.17770)
Keywords: robust, interpretability
Abstract: Multivariate time series data provide a robust framework for future predictions by leveraging information across multiple dimensions, ensuring broad applicability in practical scenarios. However, their high dimensionality and mixing patterns pose significant challenges in establishing an interpretable and explicit mapping between historical and future series, as well as extracting long-range feature dependencies. To address these challenges, we propose a channel-time dual unmixing network for multivariate time series forecasting (named MTS-UNMixer), which decomposes the entire series into critical bases and coefficients across both the time and channel dimensions. This approach establishes a robust sharing mechanism between historical and future series, enabling accurate representation and enhancing physical interpretability. Specifically, MTS-UNMixers represent sequences over time as a mixture of multiple trends and cycles, with the time-correlated representation coefficients shared across both historical and future time periods. In contrast, sequence over channels can be decomposed into multiple tick-wise bases, which characterize the channel correlations and are shared across the whole series. To estimate the shared time-dependent coefficients, a vanilla Mamba network is employed, leveraging its alignment with directional causality. Conversely, a bidirectional Mamba network is utilized to model the shared channel-correlated bases, accommodating noncausal relationships. Experimental results show that MTS-UNMixers significantly outperform existing methods on multiple benchmark datasets. The code is available at this https URL.

Title: MVBoost: Boost 3D Reconstruction with Multi-View Refinement

Authors: Xiangyu Liu, Xiaomei Zhang, Zhiyuan Ma, Xiangyu Zhu, Zhen Lei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17772
Pdf URL: https://arxiv.org/pdf/2411.17772
Copy Paste: [[2411.17772]] MVBoost: Boost 3D Reconstruction with Multi-View Refinement(https://arxiv.org/abs/2411.17772)
Keywords: robust, diffusion
Abstract: Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user's input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.

Title: Efficient Multi-modal Large Language Models via Visual Token Grouping

Authors: Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17773
Pdf URL: https://arxiv.org/pdf/2411.17773
Copy Paste: [[2411.17773]] Efficient Multi-modal Large Language Models via Visual Token Grouping(https://arxiv.org/abs/2411.17773)
Keywords: large language model, segmentation
Abstract: The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27\% inference time.

Title: Network Inversion and Its Applications

Authors: Pirzada Suhail, Hao Tang, Amit Sethi
Subjects: cs.LG, cs.CV, cs.LO
Abstract URL: https://arxiv.org/abs/2411.17777
Pdf URL: https://arxiv.org/pdf/2411.17777
Copy Paste: [[2411.17777]] Network Inversion and Its Applications(https://arxiv.org/abs/2411.17777)
Keywords: interpretability
Abstract: Neural networks have emerged as powerful tools across various applications, yet their decision-making process often remains opaque, leading to them being perceived as "black boxes." This opacity raises concerns about their interpretability and reliability, especially in safety-critical scenarios. Network inversion techniques offer a solution by allowing us to peek inside these black boxes, revealing the features and patterns learned by the networks behind their decision-making processes and thereby provide valuable insights into how neural networks arrive at their conclusions, making them more interpretable and trustworthy. This paper presents a simple yet effective approach to network inversion using a meticulously conditioned generator that learns the data distribution in the input space of the trained neural network, enabling the reconstruction of inputs that would most likely lead to the desired outputs. To capture the diversity in the input space for a given output, instead of simply revealing the conditioning labels to the generator, we encode the conditioning label information into vectors and intermediate matrices and further minimize the cosine similarity between features of the generated images. Additionally, we incorporate feature orthogonality as a regularization term to boost image diversity which penalises the deviations of the Gram matrix of the features from the identity matrix, ensuring orthogonality and promoting distinct, non-redundant representations for each label. The paper concludes by exploring immediate applications of the proposed network inversion approach in interpretability, out-of-distribution detection, and training data reconstruction.

Title: Diffusion Autoencoders for Few-shot Image Generation in Hyperbolic Space

Authors: Lingxiao Li, Kaixuan Fan, Boqing Gong, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17784
Pdf URL: https://arxiv.org/pdf/2411.17784
Copy Paste: [[2411.17784]] Diffusion Autoencoders for Few-shot Image Generation in Hyperbolic Space(https://arxiv.org/abs/2411.17784)
Keywords: diffusion
Abstract: Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. However, existing methods often suffer from a trade-off between image quality and diversity while offering limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images and texts from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying semantic codes or guided by textual instructions. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a superior balance between quality and diversity with limited data and offers a highly controllable and interpretable generation process.

Title: DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

Authors: Emanuele Aiello, Umberto Michieli, Diego Valsesia, Mete Ozay, Enrico Magli
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17786
Pdf URL: https://arxiv.org/pdf/2411.17786
Copy Paste: [[2411.17786]] DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching(https://arxiv.org/abs/2411.17786)
Keywords: diffusion, generative
Abstract: Personalized image generation requires text-to-image generative models that capture the core features of a reference subject to allow for controlled generation across different contexts. Existing methods face challenges due to complex training requirements, high inference costs, limited flexibility, or a combination of these issues. In this paper, we introduce DreamCache, a scalable approach for efficient and high-quality personalized image generation. By caching a small number of reference image features from a subset of layers and a single timestep of the pretrained diffusion denoiser, DreamCache enables dynamic modulation of the generated image features through lightweight, trained conditioning adapters. DreamCache achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.

Title: Geometric Point Attention Transformer for 3D Shape Reassembly

Authors: Jiahan Li, Chaoran Cheng, Jianzhu Ma, Ge Liu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17788
Pdf URL: https://arxiv.org/pdf/2411.17788
Copy Paste: [[2411.17788]] Geometric Point Attention Transformer for 3D Shape Reassembly(https://arxiv.org/abs/2411.17788)
Keywords: transformer
Abstract: Shape assembly, which aims to reassemble separate parts into a complete object, has gained significant interest in recent years. Existing methods primarily rely on networks to predict the poses of individual parts, but often fail to effectively capture the geometric interactions between the parts and their poses. In this paper, we present the Geometric Point Attention Transformer (GPAT), a network specifically designed to address the challenges of reasoning about geometric relationships. In the geometric point attention module, we integrate both global shape information and local pairwise geometric features, along with poses represented as rotation and translation vectors for each part. To enable iterative updates and dynamic reasoning, we introduce a geometric recycling scheme, where each prediction is fed into the next iteration for refinement. We evaluate our model on both the semantic and geometric assembly tasks, showing that it outperforms previous methods in absolute pose estimation, achieving accurate pose predictions and high alignment accuracy.

Title: Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

Authors: Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17790
Pdf URL: https://arxiv.org/pdf/2411.17790
Copy Paste: [[2411.17790]] Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors(https://arxiv.org/abs/2411.17790)
Keywords: robust, generative
Abstract: Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

Title: $H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

Authors: Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Zachary Yahn, Ling Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17792
Pdf URL: https://arxiv.org/pdf/2411.17792
Copy Paste: [[2411.17792]] $H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs(https://arxiv.org/abs/2411.17792)
Keywords: robust
Abstract: Alignment of pretrained LLMs using instruction-based datasets is critical for creating fine-tuned models that reflect human preference. A growing number of alignment-based fine-tuning algorithms and benchmarks emerged recently, fueling the efforts on effective alignments of pre-trained LLMs to ensure helpful, harmless, and honest answers from both open-source and closed-source LLMs. This paper tackles this problem by developing an alignment fusion approach, coined as $H^3$Fusion, with three unique characteristics. First, $H^3$Fusion ensembles multiple individually aligned LLMs to create a final fine-tuned alignment model with enhanced capabilities beyond those of individual models, delivering robust alignment through promoting helpful, harmless, honest fusion. Second, $H^3$Fusion leverages the mixture-of-experts (MoE) methodology in two steps. We first freeze the multi-head attention weights of each individual model while tuning the FFN layer during alignment fusion. Then we merge the aligned model weights with an expert router according to the type of input instruction and dynamically select a subset of experts that are best suited for producing the output response. Finally, we boost the performance of the resulting $H^3$3Fusion model by introducing gating loss and regularization terms. The former penalizes the selection errors of the expert-router, and the latter mediates the expert weights drifting during fine-tuning and dynamically adjusts the fusion behavior of the resulting model by canalizing the activations on the experts. Extensive evaluations on three benchmark datasets show that $H^3$3Fusion is more helpful, less harmful, and more honest from two aspects: it outperforms each individually aligned model by $11.37\%$, and it provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77\%$. Code is available at this http URL.

Title: NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

Authors: Jiaxuan Li, Junwen Mo, MinhDuc Vo, Akihiro Sugimoto, Hideki Nakayama
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17794
Pdf URL: https://arxiv.org/pdf/2411.17794
Copy Paste: [[2411.17794]] NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?(https://arxiv.org/abs/2411.17794)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have made notable advances in visual understanding, yet their abilities to recognize objects modified by specific attributes remain an open question. To address this, we explore MLLMs' reasoning capabilities in object recognition, ranging from commonsense to beyond-commonsense scenarios. We introduce a novel benchmark, NEMO, which comprises 900 images of origiNal fruits and their corresponding attributE-MOdified ones; along with a set of 2,700 questions including open-, multiple-choice-, unsolvable types. We assess 26 recent open-sourced and commercial models using our benchmark. The findings highlight pronounced performance gaps in recognizing objects in NEMO and reveal distinct answer preferences across different models. Although stronger vision encoders improve performance, MLLMs still lag behind standalone vision encoders. Interestingly, scaling up the model size does not consistently yield better outcomes, as deeper analysis reveals that larger LLMs can weaken vision encoders during fine-tuning. These insights shed light on critical limitations in current MLLMs and suggest potential pathways toward developing more versatile and resilient multimodal models.

Title: Scalable iterative pruning of large language and vision models using block coordinate descent

Authors: Gili Rosenberg, J. Kyle Brubaker, Martin J. A. Schuetz, Elton Yechao Zhu, Serdar Kadıoğlu, Sima E. Borujeni, Helmut G. Katzgraber
Subjects: cs.LG, math.OC, quant-ph
Abstract URL: https://arxiv.org/abs/2411.17796
Pdf URL: https://arxiv.org/pdf/2411.17796
Copy Paste: [[2411.17796]] Scalable iterative pruning of large language and vision models using block coordinate descent(https://arxiv.org/abs/2411.17796)
Keywords: large language model
Abstract: Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an optimization problem over a subset of the network weights in an iterative, block-wise manner using block coordinate descent. The iterative, block-based nature of this pruning technique, which we dub ``iterative Combinatorial Brain Surgeon'' (iCBS) allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach. When applied to large models like Mistral and DeiT, iCBS achieves higher performance metrics at the same density levels compared to existing pruning methods such as Wanda. This demonstrates the effectiveness of this iterative, block-wise pruning method in compressing and optimizing the performance of large deep learning models, even while optimizing over only a small fraction of the weights. Moreover, our approach allows for a quality-time (or cost) tradeoff that is not available when using a one-shot pruning technique alone. The block-wise formulation of the optimization problem enables the use of hardware accelerators, potentially offsetting the increased computational costs compared to one-shot pruning methods like Wanda. In particular, the optimization problem solved for each block is quantum-amenable in that it could, in principle, be solved by a quantum computer.

Title: Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

Authors: Ronglai Zuo, Rolandos Alexandros Potamias, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17799
Pdf URL: https://arxiv.org/pdf/2411.17799
Copy Paste: [[2411.17799]] Signs as Tokens: An Autoregressive Multilingual Sign Language Generator(https://arxiv.org/abs/2411.17799)
Keywords: diffusion
Abstract: Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at this https URL.

Title: STAR: Synthesis of Tailored Architectures

Authors: Armin W. Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, Michael Poli
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2411.17800
Pdf URL: https://arxiv.org/pdf/2411.17800
Copy Paste: [[2411.17800]] STAR: Synthesis of Tailored Architectures(https://arxiv.org/abs/2411.17800)
Keywords: transformer
Abstract: Iterative improvement of model architectures is fundamental to deep learning: Transformers first enabled scaling, and recent advances in model hybridization have pushed the quality-efficiency frontier. However, optimizing architectures remains challenging and expensive. Current automated or manual approaches fall short, largely due to limited progress in the design of search spaces and due to the simplicity of resulting patterns and heuristics. In this work, we propose a new approach for the synthesis of tailored architectures (STAR). Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics. Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.

Title: From memorization to generalization: a theoretical framework for diffusion-based generative models

Authors: Indranil Halder
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17807
Pdf URL: https://arxiv.org/pdf/2411.17807
Copy Paste: [[2411.17807]] From memorization to generalization: a theoretical framework for diffusion-based generative models(https://arxiv.org/abs/2411.17807)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models demonstrate a transition from memorizing the training dataset to a non-memorization regime as the size of the training set increases. Here, we begin by introducing a mathematically precise definition of this transition in terms of a relative distance: the model is said to be in the non-memorization/`generalization' regime if the generated distribution is almost surely far from the probability distribution associated with a Gaussian kernel approximation to the training dataset, relative to the sampling distribution. Then, we develop an analytically tractable diffusion model and establish a lower bound on Kullback-Leibler divergence between the generated and sampling distribution. The model also features the transition, according to our definition in terms of the relative distance, when the training data is sampled from an isotropic Gaussian distribution. Further, our study reveals that this transition occurs when the individual distance between the generated and underlying sampling distribution begins to decrease with the addition of more training samples. This is to be contrasted with an alternative scenario, where the model's memorization performance degrades, but generalization performance doesn't improve. We also provide empirical evidence indicating that realistic diffusion models exhibit the same alignment of scales.

Title: Low-rank Adaptation-based All-Weather Removal for Autonomous Navigation

Authors: Sudarshan Rajagopalan, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17814
Pdf URL: https://arxiv.org/pdf/2411.17814
Copy Paste: [[2411.17814]] Low-rank Adaptation-based All-Weather Removal for Autonomous Navigation(https://arxiv.org/abs/2411.17814)
Keywords: segmentation
Abstract: All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out-of-distribution (OoD) samples or unseen degradations which limits their effectiveness for real-world autonomous navigation. To overcome this issue, existing models must either be retrained or fine-tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine-tuning involving many parameters. In this paper, we propose using Low-Rank Adaptation (LoRA) to efficiently adapt a pre-trained all-weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre-trained restoration tasks. To address this issue, we introduce a LoRA-based fine-tuning method called LoRA-Align (LoRA-A) which seeks to align the singular vectors of the fine-tuned and pre-trained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model's knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA-A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation.

Title: CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Authors: Xinhao Liu, Jintong Li, Yichen Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, Chen Feng
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2411.17820
Pdf URL: https://arxiv.org/pdf/2411.17820
Copy Paste: [[2411.17820]] CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos(https://arxiv.org/abs/2411.17820)
Keywords: robust
Abstract: Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. this https URL

Title: Rapid Distributed Fine-tuning of a Segmentation Model Onboard Satellites

Authors: Meghan Plumridge, Rasmus Maråk, Chiara Ceccobello, Pablo Gómez, Gabriele Meoni, Filip Svoboda, Nicholas D. Lane
Subjects: cs.LG, cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2411.17831
Pdf URL: https://arxiv.org/pdf/2411.17831
Copy Paste: [[2411.17831]] Rapid Distributed Fine-tuning of a Segmentation Model Onboard Satellites(https://arxiv.org/abs/2411.17831)
Keywords: segmentation
Abstract: Segmentation of Earth observation (EO) satellite data is critical for natural hazard analysis and disaster response. However, processing EO data at ground stations introduces delays due to data transmission bottlenecks and communication windows. Using segmentation models capable of near-real-time data analysis onboard satellites can therefore improve response times. This study presents a proof-of-concept using MobileSAM, a lightweight, pre-trained segmentation model, onboard Unibap iX10-100 satellite hardware. We demonstrate the segmentation of water bodies from Sentinel-2 satellite imagery and integrate MobileSAM with PASEOS, an open-source Python module that simulates satellite operations. This integration allows us to evaluate MobileSAM's performance under simulated conditions of a satellite constellation. Our research investigates the potential of fine-tuning MobileSAM in a decentralised way onboard multiple satellites in rapid response to a disaster. Our findings show that MobileSAM can be rapidly fine-tuned and benefits from decentralised learning, considering the constraints imposed by the simulated orbital environment. We observe improvements in segmentation performance with minimal training data and fast fine-tuning when satellites frequently communicate model updates. This study contributes to the field of onboard AI by emphasising the benefits of decentralised learning and fine-tuning pre-trained models for rapid response scenarios. Our work builds on recent related research at a critical time; as extreme weather events increase in frequency and magnitude, rapid response with onboard data analysis is essential.

Title: Adaptive Client Selection with Personalization for Communication Efficient Federated Learning

Authors: Allan M. de Souza, Filipe Maciel, Joahannes B. D. da Costa, Luiz F. Bittencourt, Eduardo Cerqueira, Antonio A. F. Loureiro, Leandro A. Villas
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2411.17833
Pdf URL: https://arxiv.org/pdf/2411.17833
Copy Paste: [[2411.17833]] Adaptive Client Selection with Personalization for Communication Efficient Federated Learning(https://arxiv.org/abs/2411.17833)
Keywords: federate
Abstract: Federated Learning (FL) is a distributed approach to collaboratively training machine learning models. FL requires a high level of communication between the devices and a central server, thus imposing several challenges, including communication bottlenecks and network scalability. This article introduces ACSP-FL (this https URL), a solution to reduce the overall communication and computation costs for training a model in FL environments. ACSP-FL employs a client selection strategy that dynamically adapts the number of devices training the model and the number of rounds required to achieve convergence. Moreover, ACSP-FL enables model personalization to improve clients performance. A use case based on human activity recognition datasets aims to show the impact and benefits of ACSP-FL when compared to state-of-the-art approaches. Experimental evaluations show that ACSP-FL minimizes the overall communication and computation overheads to train a model and converges the system efficiently. In particular, ACSP-FL reduces communication up to 95% compared to literature approaches while providing good convergence even in scenarios where data is distributed differently, non-independent and identical way between client devices.

Title: Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction

Authors: Mohamed Rashad
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17835
Pdf URL: https://arxiv.org/pdf/2411.17835
Copy Paste: [[2411.17835]] Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction(https://arxiv.org/abs/2411.17835)
Keywords: extraction, transformer
Abstract: We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta's Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our best-performing model, providing a valuable resource for Arabic OCR research. All models, datasets, and code are open-sourced and available at this https URL.

Title: OracleSage: Towards Unified Visual-Linguistic Understanding of Oracle Bone Scripts through Cross-Modal Knowledge Fusion

Authors: Hanqi Jiang, Yi Pan, Junhao Chen, Zhengliang Liu, Yifan Zhou, Peng Shu, Yiwei Li, Huaqin Zhao, Stephen Mihm, Lewis C Howe, Tianming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17837
Pdf URL: https://arxiv.org/pdf/2411.17837
Copy Paste: [[2411.17837]] OracleSage: Towards Unified Visual-Linguistic Understanding of Oracle Bone Scripts through Cross-Modal Knowledge Fusion(https://arxiv.org/abs/2411.17837)
Keywords: extraction
Abstract: Oracle bone script (OBS), as China's earliest mature writing system, present significant challenges in automatic recognition due to their complex pictographic structures and divergence from modern Chinese characters. We introduce OracleSage, a novel cross-modal framework that integrates hierarchical visual understanding with graph-based semantic reasoning. Specifically, we propose (1) a Hierarchical Visual-Semantic Understanding module that enables multi-granularity feature extraction through progressive fine-tuning of LLaVA's visual backbone, (2) a Graph-based Semantic Reasoning Framework that captures relationships between visual components and semantic concepts through dynamic message passing, and (3) OracleSem, a semantically enriched OBS dataset with comprehensive pictographic and semantic annotations. Experimental results demonstrate that OracleSage significantly outperforms state-of-the-art vision-language models. This research establishes a new paradigm for ancient text interpretation while providing valuable technical support for archaeological studies.

Title: Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering

Authors: Christopher Holder, Anthony Bagnall
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17838
Pdf URL: https://arxiv.org/pdf/2411.17838
Copy Paste: [[2411.17838]] Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering(https://arxiv.org/abs/2411.17838)
Keywords: segmentation
Abstract: Time series data has become increasingly prevalent across numerous domains, driving a growing demand for time series machine learning techniques. Among these, time series clustering (TSCL) stands out as one of the most popular machine learning tasks. TSCL serves as a powerful exploratory analysis tool and is also employed as a preprocessing step or subroutine for various tasks, including anomaly detection, segmentation, and classification. The most popular TSCL algorithms are either fast (in terms of run time) but perform poorly on benchmark problems, or perform well on benchmarks but scale poorly. We present a new TSCL algorithm, the $k$-means (K) accelerated (A) Stochastic subgradient (S) Barycentre (B) Average (A) (KASBA) clustering algorithm. KASBA is a $k$-means clustering algorithm that uses the Move-Split-Merge (MSM) elastic distance at all stages of clustering, applies a randomised stochastic subgradient gradient descent to find barycentre centroids, links each stage of clustering to accelerate convergence and exploits the metric property of MSM distance to avoid a large proportion of distance calculations. It is a versatile and scalable clusterer designed for real-world TSCL applications. It allows practitioners to balance run time and clustering performance. We demonstrate through extensive experimentation that KASBA produces significantly better clustering than the faster state of the art clusterers and is offers orders of magnitude improvement in run time over the most performant $k$-means alternatives.

Title: LongKey: Keyphrase Extraction for Long Documents

Authors: Jeovane Honorio Alves, Radu State, Cinthia Obladen de Almendra Freitas, Jean Paul Barddal
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17863
Pdf URL: https://arxiv.org/pdf/2411.17863
Copy Paste: [[2411.17863]] LongKey: Keyphrase Extraction for Long Documents(https://arxiv.org/abs/2411.17863)
Keywords: extraction
Abstract: In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

Title: Generative Image Layer Decomposition with Visual Effects

Authors: Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakhomov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie, Yuyin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17864
Pdf URL: https://arxiv.org/pdf/2411.17864
Copy Paste: [[2411.17864]] Generative Image Layer Decomposition with Visual Effects(https://arxiv.org/abs/2411.17864)
Keywords: diffusion, generative
Abstract: Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose $\textbf{LayerDecomp}$, a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available. Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing. The project page is this https URL.

Title: Distributed Sign Momentum with Local Steps for Training Transformers

Authors: Shuhua Yu, Ding Zhou, Cong Xie, An Xu, Zhi Zhang, Xin Liu, Soummya Kar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17866
Pdf URL: https://arxiv.org/pdf/2411.17866
Copy Paste: [[2411.17866]] Distributed Sign Momentum with Local Steps for Training Transformers(https://arxiv.org/abs/2411.17866)
Keywords: federate, transformer
Abstract: Pre-training Transformer models is resource-intensive, and recent studies have shown that sign momentum is an efficient technique for training large-scale deep learning models, particularly Transformers. However, its application in distributed training or federated learning remains underexplored. This paper investigates a novel communication-efficient distributed sign momentum method with local updates. Our proposed method allows for a broad class of base optimizers for local updates, and uses sign momentum in global updates, where momentum is generated from differences accumulated during local steps. We evaluate our method on the pre-training of various GPT-2 models, and the empirical results show significant improvement compared to other distributed methods with local updates. Furthermore, by approximating the sign operator with a randomized version that acts as a continuous analog in expectation, we present an $O(1/\sqrt{T})$ convergence for one instance of the proposed method for nonconvex smooth functions.

Title: Leveraging Large Language Models and Topic Modeling for Toxicity Classification

Authors: Haniyeh Ehsani Oskouie, Christina Chance, Claire Huang, Margaret Capetz, Elizabeth Eyeson, Majid Sarrafzadeh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17876
Pdf URL: https://arxiv.org/pdf/2411.17876
Copy Paste: [[2411.17876]] Leveraging Large Language Models and Topic Modeling for Toxicity Classification(https://arxiv.org/abs/2411.17876)
Keywords: large language model
Abstract: Content moderation and toxicity classification represent critical tasks with significant social implications. However, studies have shown that major classification models exhibit tendencies to magnify or reduce biases and potentially overlook or disadvantage certain marginalized groups within their classification processes. Researchers suggest that the positionality of annotators influences the gold standard labels in which the models learned from propagate annotators' bias. To further investigate the impact of annotator positionality, we delve into fine-tuning BERTweet and HateBERT on the dataset while using topic-modeling strategies for content moderation. The results indicate that fine-tuning the models on specific topics results in a notable improvement in the F1 score of the models when compared to the predictions generated by other prominent classification models such as GPT-4, PerspectiveAPI, and RewireAPI. These findings further reveal that the state-of-the-art large language models exhibit significant limitations in accurately detecting and interpreting text toxicity contrasted with earlier methodologies. Code is available at this https URL.

Title: Multimodal Crash Likelihood Prediction: A Complexity-Infused Approach Integrating Semantic, Contextual, and Driving Features

Authors: Meng Wang, Zach Noonan, Pnina Gershon, Shannon C. Roberts
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17886
Pdf URL: https://arxiv.org/pdf/2411.17886
Copy Paste: [[2411.17886]] Multimodal Crash Likelihood Prediction: A Complexity-Infused Approach Integrating Semantic, Contextual, and Driving Features(https://arxiv.org/abs/2411.17886)
Keywords: large language model
Abstract: Predicting crash likelihood in complex driving environments is essential for improving traffic safety and advancing autonomous driving. Previous studies have used statistical models and deep learning to predict crashes based on semantic, contextual, or driving features, but none have examined the combined influence of these factors, termed roadway complexity in this study. This paper introduces a two-stage framework that integrates roadway complexity features for crash prediction. In the first stage, an encoder extracts hidden contextual information from these features, generating complexity-infused features. The second stage uses both original and complexity-infused features to predict crash likelihood, achieving an accuracy of 87.98% with original features alone and 90.15% with the added complexity-infused features. Ablation studies confirm that a combination of semantic, driving, and contextual features yields the best results, which emphasize their role in capturing roadway complexity. Additionally, complexity index annotations generated by Large Language Models outperform those by Amazon Mechanical Turk, highlighting the potential of automated tools for accurate, scalable crash prediction systems.

Title: HOPPR Medical-Grade Platform for Medical Imaging AI

Authors: Kalina P. Slavkova, Melanie Traughber, Oliver Chen, Robert Bakos, Shayna Goldstein, Dan Harms, Bradley J. Erickson, Khan M. Siddiqui
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17891
Pdf URL: https://arxiv.org/pdf/2411.17891
Copy Paste: [[2411.17891]] HOPPR Medical-Grade Platform for Medical Imaging AI(https://arxiv.org/abs/2411.17891)
Keywords: secure, robust
Abstract: Technological advances in artificial intelligence (AI) have enabled the development of large vision language models (LVLMs) that are trained on millions of paired image and text samples. Subsequent research efforts have demonstrated great potential of LVLMs to achieve high performance in medical imaging use cases (e.g., radiology report generation), but there remain barriers that hinder the ability to deploy these solutions broadly. These include the cost of extensive computational requirements for developing large scale models, expertise in the development of sophisticated AI models, and the difficulty in accessing substantially large, high-quality datasets that adequately represent the population in which the LVLM solution is to be deployed. The HOPPR Medical-Grade Platform addresses these barriers by providing powerful computational infrastructure, a suite of foundation models on top of which developers can fine-tune for their specific use cases, and a robust quality management system that sets a standard for evaluating fine-tuned models for deployment in clinical settings. The HOPPR Platform has access to millions of imaging studies and text reports sourced from hundreds of imaging centers from diverse populations to pretrain foundation models and enable use case-specific cohorts for fine-tuning. All data are deidentified and securely stored for HIPAA compliance. Additionally, developers can securely host models on the HOPPR platform and access them via an API to make inferences using these models within established clinical workflows. With the Medical-Grade Platform, HOPPR's mission is to expedite the deployment of LVLM solutions for medical imaging and ultimately optimize radiologist's workflows and meet the growing demands of the field.

Title: Automating grapevine LAI features estimation with UAV imagery and machine learning

Authors: Muhammad Waseem Akram, Marco Vannucci, Giorgio Buttazzo, Valentina Colla, Stefano Roccella, Andrea Vannini, Giovanni Caruso, Simone Nesi, Alessandra Francini, Luca Sebastiani
Subjects: cs.CV, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17897
Pdf URL: https://arxiv.org/pdf/2411.17897
Copy Paste: [[2411.17897]] Automating grapevine LAI features estimation with UAV imagery and machine learning(https://arxiv.org/abs/2411.17897)
Keywords: extraction
Abstract: The leaf area index determines crop health and growth. Traditional methods for calculating it are time-consuming, destructive, costly, and limited to a scale. In this study, we automate the index estimation method using drone image data of grapevine plants and a machine learning model. Traditional feature extraction and deep learning methods are used to obtain helpful information from the data and enhance the performance of the different machine learning models employed for the leaf area index prediction. The results showed that deep learning based feature extraction is more effective than traditional methods. The new approach is a significant improvement over old methods, offering a faster, non-destructive, and cost-effective leaf area index calculation, which enhances precision agriculture practices.

Title: Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey

Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2411.17911
Pdf URL: https://arxiv.org/pdf/2411.17911
Copy Paste: [[2411.17911]] Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey(https://arxiv.org/abs/2411.17911)
Keywords: security, robust, interpretability, generative
Abstract: In recent years, deepfakes (DFs) have been utilized for malicious purposes, such as individual impersonation, misinformation spreading, and artists' style imitation, raising questions about ethical and security concerns. However, existing surveys have focused on accuracy performance of passive DF detection approaches for single modalities, such as image, video or audio. This comprehensive survey explores passive approaches across multiple modalities, including image, video, audio, and multi-modal domains, and extend our discussion beyond detection accuracy, including generalization, robustness, attribution, and interpretability. Additionally, we discuss threat models for passive approaches, including potential adversarial strategies and different levels of adversary knowledge and capabilities. We also highlights current challenges in DF detection, including the lack of generalization across different generative models, the need for comprehensive trustworthiness evaluation, and the limitations of existing multi-modal approaches. Finally, we propose future research directions that address these unexplored and emerging issues in the field of passive DF detection, such as adaptive learning, dynamic benchmark, holistic trustworthiness evaluation, and multi-modal detectors for talking-face video generation.

Title: DECODE: Domain-aware Continual Domain Expansion for Motion Prediction

Authors: Boqi Li, Haojie Zhu, Henry X. Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2411.17917
Pdf URL: https://arxiv.org/pdf/2411.17917
Copy Paste: [[2411.17917]] DECODE: Domain-aware Continual Domain Expansion for Motion Prediction(https://arxiv.org/abs/2411.17917)
Keywords: robust
Abstract: Motion prediction is critical for autonomous vehicles to effectively navigate complex environments and accurately anticipate the behaviors of other traffic participants. As autonomous driving continues to evolve, the need to assimilate new and varied driving scenarios necessitates frequent model updates through retraining. To address these demands, we introduce DECODE, a novel continual learning framework that begins with a pre-trained generalized model and incrementally develops specialized models for distinct domains. Unlike existing continual learning approaches that attempt to develop a unified model capable of generalizing across diverse scenarios, DECODE uniquely balances specialization with generalization, dynamically adjusting to real-time demands. The proposed framework leverages a hypernetwork to generate model parameters, significantly reducing storage requirements, and incorporates a normalizing flow mechanism for real-time model selection based on likelihood estimation. Furthermore, DECODE merges outputs from the most relevant specialized and generalized models using deep Bayesian uncertainty estimation techniques. This integration ensures optimal performance in familiar conditions while maintaining robustness in unfamiliar scenarios. Extensive evaluations confirm the effectiveness of the framework, achieving a notably low forgetting rate of 0.044 and an average minADE of 0.584 m, significantly surpassing traditional learning strategies and demonstrating adaptability across a wide range of driving conditions.

Title: Exploring Superpixel Segmentation Methods in the Context of Citizen Science and Deforestation Detection

Authors: Hugo Resende, Isabela Borlido, Victor Sundermann, Eduardo B. Neto, Silvio Jamil F. Guimarães, Fabio Faria, Alvaro Luiz Fazenda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17922
Pdf URL: https://arxiv.org/pdf/2411.17922
Copy Paste: [[2411.17922]] Exploring Superpixel Segmentation Methods in the Context of Citizen Science and Deforestation Detection(https://arxiv.org/abs/2411.17922)
Keywords: segmentation
Abstract: Tropical forests play an essential role in the planet's ecosystem, making the conservation of these biomes a worldwide priority. However, ongoing deforestation and degradation pose a significant threat to their existence, necessitating effective monitoring and the proposal of actions to mitigate the damage caused by these processes. In this regard, initiatives range from government and private sector monitoring programs to solutions based on citizen science campaigns, for example. Particularly in the context of citizen science campaigns, the segmentation of remote sensing images to identify deforested areas and subsequently submit them to analysis by non-specialized volunteers is necessary. Thus, segmentation using superpixel-based techniques proves to be a viable solution for this important task. Therefore, this paper presents an analysis of 22 superpixel-based segmentation methods applied to remote sensing images, aiming to identify which of them are more suitable for generating segments for citizen science campaigns. The results reveal that seven of the segmentation methods outperformed the baseline method (SLIC) currently employed in the ForestEyes citizen science project, indicating an opportunity for improvement in this important stage of campaign development.

Title: A Practical Approach to Formal Methods: An Eclipse Integrated Development Environment (IDE) for Security Protocols

Authors: Rémi Garcia, Paolo Modesti
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2411.17926
Pdf URL: https://arxiv.org/pdf/2411.17926
Copy Paste: [[2411.17926]] A Practical Approach to Formal Methods: An Eclipse Integrated Development Environment (IDE) for Security Protocols(https://arxiv.org/abs/2411.17926)
Keywords: security
Abstract: To develop trustworthy distributed systems, verification techniques and formal methods, including lightweight and practical approaches, have been employed to certify the design or implementation of security protocols. Lightweight formal methods offer a more accessible alternative to traditional fully formalised techniques by focusing on simplified models and tool support, making them more applicable in practical settings. The technical advantages of formal verification over manual testing are increasingly recognised in the cybersecurity community. However, for practitioners, formal modelling and verification are often too complex and unfamiliar to be used routinely. In this paper, we present an Eclipse IDE for the design, verification, and implementation of security protocols and evaluate its effectiveness, including feedback from users in educational settings. It offers user-friendly assistance in the formalisation process as part of a Model-Driven Development approach. This IDE centres around the Alice & Bob (AnB) notation, the AnBx Compiler and Code Generator, the OFMC model checker, and the ProVerif cryptographic protocol verifier. For the evaluation, we identify the six most prominent limiting factors for formal method adoption, based on relevant literature in this field, and we consider the IDE's effectiveness against those criteria. Additionally, we conducted a structured survey to collect feedback from university students who have used the toolkit for their projects. The findings demonstrate that this contribution is valuable as a workflow aid and helps users grasp essential cybersecurity concepts, even for those with limited knowledge of formal methods or cryptography. Crucially, users reported that the IDE has been an important component to complete their projects and that they would use again in the future, given the opportunity.

Title: Combining Threat Intelligence with IoT Scanning to Predict Cyber Attack

Authors: Jubin Abhishek Soni
Subjects: cs.CR, cs.AI, cs.CY, cs.NI
Abstract URL: https://arxiv.org/abs/2411.17931
Pdf URL: https://arxiv.org/pdf/2411.17931
Copy Paste: [[2411.17931]] Combining Threat Intelligence with IoT Scanning to Predict Cyber Attack(https://arxiv.org/abs/2411.17931)
Keywords: security, attack
Abstract: While the Web has become a worldwide platform for communication, hackers and hacktivists share their ideology and communicate with members on the "Dark Web" - the reverse of the Web. Currently, the problems of information overload and difficulty to obtain a comprehensive picture of hackers and cyber-attackers hinder the effective analysis of predicting their activities on the Web. Also, there are currently more objects connected to the internet than there are people in the world and this gap will continue to grow as more and more objects gain ability to directly interface with the Internet. Many technical communities are vigorously pursuing research topics that contribute to the Internet of Things (IoT). In this paper we have proposed a novel methodology for collecting and analyzing the Dark Web information to identify websites of hackers from the Web sea, and how this information can help us in predicting IoT vulnerabilities. This methodology incorporates information collection, analysis, visualization techniques, and exploits some of the IoT devices. Through this research we want to contribute to the existing literature on cyber-security that could potentially guide in both policy-making and intelligence research.

Title: Neural Networks Use Distance Metrics

Authors: Alan Oursland
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2411.17932
Pdf URL: https://arxiv.org/pdf/2411.17932
Copy Paste: [[2411.17932]] Neural Networks Use Distance Metrics(https://arxiv.org/abs/2411.17932)
Keywords: robust
Abstract: We present empirical evidence that neural networks with ReLU and Absolute Value activations learn distance-based representations. We independently manipulate both distance and intensity properties of internal activations in trained models, finding that both architectures are highly sensitive to small distance-based perturbations while maintaining robust performance under large intensity-based perturbations. These findings challenge the prevailing intensity-based interpretation of neural network activations and offer new insights into their learning and decision-making processes.

Title: Stealthy Multi-Task Adversarial Attacks

Authors: Jiacheng Guo, Tianyun Zhang, Lei Li, Haochen Yang, Hongkai Yu, Minghai Qin
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17936
Pdf URL: https://arxiv.org/pdf/2411.17936
Copy Paste: [[2411.17936]] Stealthy Multi-Task Adversarial Attacks(https://arxiv.org/abs/2411.17936)
Keywords: security, attack, steal
Abstract: Deep Neural Networks exhibit inherent vulnerabilities to adversarial attacks, which can significantly compromise their outputs and reliability. While existing research primarily focuses on attacking single-task scenarios or indiscriminately targeting all tasks in multi-task environments, we investigate selectively targeting one task while preserving performance in others within a multi-task framework. This approach is motivated by varying security priorities among tasks in real-world applications, such as autonomous driving, where misinterpreting critical objects (e.g., signs, traffic lights) poses a greater security risk than minor depth miscalculations. Consequently, attackers may hope to target security-sensitive tasks while avoiding non-critical tasks from being compromised, thus evading being detected before compromising crucial functions. In this paper, we propose a method for the stealthy multi-task attack framework that utilizes multiple algorithms to inject imperceptible noise into the input. This novel method demonstrates remarkable efficacy in compromising the target task while simultaneously maintaining or even enhancing performance across non-targeted tasks - a criterion hitherto unexplored in the field. Additionally, we introduce an automated approach for searching the weighting factors in the loss function, further enhancing attack efficiency. Experimental results validate our framework's ability to successfully attack the target task while preserving the performance of non-targeted tasks. The automated loss function weight searching method demonstrates comparable efficacy to manual tuning, establishing a state-of-the-art multi-task attack framework.

Title: Spatio-temporal Causal Learning for Streamflow Forecasting

Authors: Shu Wan, Reepal Shah, Qi Deng, John Sabo, Huan Liu, K. Selçuk
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17937
Pdf URL: https://arxiv.org/pdf/2411.17937
Copy Paste: [[2411.17937]] Spatio-temporal Causal Learning for Streamflow Forecasting(https://arxiv.org/abs/2411.17937)
Keywords: robust
Abstract: Streamflow plays an essential role in the sustainable planning and management of national water resources. Traditional hydrologic modeling approaches simulate streamflow by establishing connections across multiple physical processes, such as rainfall and runoff. These data, inherently connected both spatially and temporally, possess intrinsic causal relations that can be leveraged for robust and accurate forecasting. Recently, spatio-temporal graph neural networks (STGNNs) have been adopted, excelling in various domains, such as urban traffic management, weather forecasting, and pandemic control, and they also promise advances in streamflow management. However, learning causal relationships directly from vast observational data is theoretically and computationally challenging. In this study, we employ a river flow graph as prior knowledge to facilitate the learning of the causal structure and then use the learned causal graph to predict streamflow at targeted sites. The proposed model, Causal Streamflow Forecasting (CSF) is tested in a real-world study in the Brazos River basin in Texas. Our results demonstrate that our method outperforms regular spatio-temporal graph neural networks and achieves higher computational efficiency compared to traditional simulation methods. By effectively integrating river flow graphs with STGNNs, this research offers a novel approach to streamflow prediction, showcasing the potential of combining advanced neural network techniques with domain-specific knowledge for enhanced performance in hydrologic modeling.

Title: Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative, Quantitative, and Mixed-Methods Approaches

Authors: Saman Sarraf
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17943
Pdf URL: https://arxiv.org/pdf/2411.17943
Copy Paste: [[2411.17943]] Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative, Quantitative, and Mixed-Methods Approaches(https://arxiv.org/abs/2411.17943)
Keywords: robust, generative
Abstract: Generative AI (GenAI) has revolutionized content generation, offering transformative capabilities for improving language coherence, readability, and overall quality. This manuscript explores the application of qualitative, quantitative, and mixed-methods research approaches to evaluate the performance of GenAI models in enhancing scientific writing. Using a hypothetical use case involving a collaborative medical imaging manuscript, we demonstrate how each method provides unique insights into the impact of GenAI. Qualitative methods gather in-depth feedback from expert reviewers, analyzing their responses using thematic analysis tools to capture nuanced improvements and identify limitations. Quantitative approaches employ automated metrics such as BLEU, ROUGE, and readability scores, as well as user surveys, to objectively measure improvements in coherence, fluency, and structure. Mixed-methods research integrates these strengths, combining statistical evaluations with detailed qualitative insights to provide a comprehensive assessment. These research methods enable quantifying improvement levels in GenAI-generated content, addressing critical aspects of linguistic quality and technical accuracy. They also offer a robust framework for benchmarking GenAI tools against traditional editing processes, ensuring the reliability and effectiveness of these technologies. By leveraging these methodologies, researchers can evaluate the performance boost driven by GenAI, refine its applications, and guide its responsible adoption in high-stakes domains like healthcare and scientific research. This work underscores the importance of rigorous evaluation frameworks for advancing trust and innovation in GenAI.

Title: MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Authors: Sankalp Sinha, Mohammad Sadil Khan, Muhammad Usama, Shino Sam, Didier Stricker, Sk Aziz Ali, Muhammad Zeshan Afzal
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17945
Pdf URL: https://arxiv.org/pdf/2411.17945
Copy Paste: [[2411.17945]] MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation(https://arxiv.org/abs/2411.17945)
Keywords: diffusion
Abstract: Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators.

Title: ROICtrl: Boosting Instance Control for Visual Generation

Authors: Yuchao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17949
Pdf URL: https://arxiv.org/pdf/2411.17949
Copy Paste: [[2411.17949]] ROICtrl: Boosting Instance Control for Visual Generation(https://arxiv.org/abs/2411.17949)
Keywords: diffusion
Abstract: Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.

Title: Optimization-Free Image Immunization Against Diffusion-Based Editing

Authors: Tarik Can Ozden, Ozgur Kara, Oguzhan Akcin, Kerem Zaman, Shashank Srivastava, Sandeep P. Chinchali, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17957
Pdf URL: https://arxiv.org/pdf/2411.17957
Copy Paste: [[2411.17957]] Optimization-Free Image Immunization Against Diffusion-Based Editing(https://arxiv.org/abs/2411.17957)
Keywords: protect, defense, attack, robust, diffusion
Abstract: Current image immunization defense techniques against diffusion-based editing embed imperceptible noise in target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optimization for each image-taking hours for small batches. To address these challenges, we introduce DiffVax, a scalable, lightweight, and optimization-free framework for image immunization, specifically designed to prevent diffusion-based editing. Our approach enables effective generalization to unseen content, reducing computational costs and cutting immunization time from days to milliseconds-achieving a 250,000x speedup. This is achieved through a loss term that ensures the failure of editing attempts and the imperceptibility of the perturbations. Extensive qualitative and quantitative results demonstrate that our model is scalable, optimization-free, adaptable to various diffusion-based editing tools, robust against counter-attacks, and, for the first time, effectively protects video content from editing. Our code is provided in our project webpage.

Title: Adversarial Training in Low-Label Regimes with Margin-Based Interpolation

Authors: Tian Ye, Rajgopal Kannan, Viktor Prasanna
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17959
Pdf URL: https://arxiv.org/pdf/2411.17959
Copy Paste: [[2411.17959]] Adversarial Training in Low-Label Regimes with Margin-Based Interpolation(https://arxiv.org/abs/2411.17959)
Keywords: attack, robust
Abstract: Adversarial training has emerged as an effective approach to train robust neural network models that are resistant to adversarial attacks, even in low-label regimes where labeled data is scarce. In this paper, we introduce a novel semi-supervised adversarial training approach that enhances both robustness and natural accuracy by generating effective adversarial examples. Our method begins by applying linear interpolation between clean and adversarial examples to create interpolated adversarial examples that cross decision boundaries by a controlled margin. This sample-aware strategy tailors adversarial examples to the characteristics of each data point, enabling the model to learn from the most informative perturbations. Additionally, we propose a global epsilon scheduling strategy that progressively adjusts the upper bound of perturbation strengths during training. The combination of these strategies allows the model to develop increasingly complex decision boundaries with better robustness and natural accuracy. Empirical evaluations show that our approach effectively enhances performance against various adversarial attacks, such as PGD and AutoAttack.

Title: Optimized Tradeoffs for Private Prediction with Majority Ensembling

Authors: Shuli Jiang, Qiuyi (Richard)Zhang, Gauri Joshi
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.17965
Pdf URL: https://arxiv.org/pdf/2411.17965
Copy Paste: [[2411.17965]] Optimized Tradeoffs for Private Prediction with Majority Ensembling(https://arxiv.org/abs/2411.17965)
Keywords: privacy
Abstract: We study a classical problem in private prediction, the problem of computing an $(m\epsilon, \delta)$-differentially private majority of $K$ $(\epsilon, \Delta)$-differentially private algorithms for $1 \leq m \leq K$ and $1 > \delta \geq \Delta \geq 0$. Standard methods such as subsampling or randomized response are widely used, but do they provide optimal privacy-utility tradeoffs? To answer this, we introduce the Data-dependent Randomized Response Majority (DaRRM) algorithm. It is parameterized by a data-dependent noise function $\gamma$, and enables efficient utility optimization over the class of all private algorithms, encompassing those standard methods. We show that maximizing the utility of an $(m\epsilon, \delta)$-private majority algorithm can be computed tractably through an optimization problem for any $m \leq K$ by a novel structural result that reduces the infinitely many privacy constraints into a polynomial set. In some settings, we show that DaRRM provably enjoys a privacy gain of a factor of 2 over common baselines, with fixed utility. Lastly, we demonstrate the strong empirical effectiveness of our first-of-its-kind privacy-constrained utility optimization for ensembling labels for private prediction from private teachers in image classification. Notably, our DaRRM framework with an optimized $\gamma$ exhibits substantial utility gains when compared against several baselines.

Title: QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions

Authors: Ramez Kouzy, Roxanna Attar-Olyaee, Michael K. Rooney, Comron J. Hassanzadeh, Junyi Jessy Li, Osama Mohamad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17967
Pdf URL: https://arxiv.org/pdf/2411.17967
Copy Paste: [[2411.17967]] QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions(https://arxiv.org/abs/2411.17967)
Keywords: extraction, large language model
Abstract: Health-related discussions on social media like Reddit offer valuable insights, but extracting quantitative data from unstructured text is challenging. In this work, we present an adapted framework from QuaLLM into QuaLLM-Health for extracting clinically relevant quantitative data from Reddit discussions about glucagon-like peptide-1 (GLP-1) receptor agonists using large language models (LLMs). We collected 410k posts and comments from five GLP-1-related communities using the Reddit API in July 2024. After filtering for cancer-related discussions, 2,059 unique entries remained. We developed annotation guidelines to manually extract variables such as cancer survivorship, family cancer history, cancer types mentioned, risk perceptions, and discussions with physicians. Two domain-experts independently annotated a random sample of 100 entries to create a gold-standard dataset. We then employed iterative prompt engineering with OpenAI's "GPT-4o-mini" on the gold-standard dataset to build an optimized pipeline that allowed us to extract variables from the large dataset. The optimized LLM achieved accuracies above 0.85 for all variables, with precision, recall and F1 score macro averaged > 0.90, indicating balanced performance. Stability testing showed a 95% match rate across runs, confirming consistency. Applying the framework to the full dataset enabled efficient extraction of variables necessary for downstream analysis, costing under $3 and completing in approximately one hour. QuaLLM-Health demonstrates that LLMs can effectively and efficiently extract clinically relevant quantitative data from unstructured social media content. Incorporating human expertise and iterative prompt refinement ensures accuracy and reliability. This methodology can be adapted for large-scale analysis of patient-generated data across various health domains, facilitating valuable insights for healthcare research.

Title: Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagery

Authors: Zhenyu Yu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17973
Pdf URL: https://arxiv.org/pdf/2411.17973
Copy Paste: [[2411.17973]] Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagery(https://arxiv.org/abs/2411.17973)
Keywords: robust, extraction, diffusion, generative
Abstract: The forest serves as the most significant terrestrial carbon stock mechanism, effectively reducing atmospheric CO$_2$ concentrations and mitigating climate change. Remote sensing provides high data accuracy and enables large-scale observations. Optical images facilitate long-term monitoring, which is crucial for future carbon stock estimation studies. This study focuses on Huize County, Qujing City, Yunnan Province, China, utilizing GF-1 WFV satellite imagery. The KD-VGG and KD-UNet modules were introduced for initial feature extraction, and the improved implicit diffusion model (IIDM) was proposed. The results showed: (1) The VGG module improved initial feature extraction, improving accuracy, and reducing inference time with optimized model parameters. (2) The Cross-attention + MLPs module enabled effective feature fusion, establishing critical relationships between global and local features, achieving high-accuracy estimation. (3) The IIDM model, a novel contribution, demonstrated the highest estimation accuracy with an RMSE of 12.17\%, significantly improving by 41.69\% to 42.33\% compared to the regression model. In carbon stock estimation, the generative model excelled in extracting deeper features, significantly outperforming other models, demonstrating the feasibility of AI-generated content in quantitative remote sensing. The 16-meter resolution estimates provide a robust basis for tailoring forest carbon sink regulations, enhancing regional carbon stock management.

Title: RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

Authors: Huiyang Hu, Peijin Wang, Hanbo Bi, Boyuan Tong, Zhaozhi Wang, Wenhui Diao, Hao Chang, Yingchao Feng, Ziqi Zhang, Qixiang Ye, Kun Fu, Xian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17984
Pdf URL: https://arxiv.org/pdf/2411.17984
Copy Paste: [[2411.17984]] RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model(https://arxiv.org/abs/2411.17984)
Keywords: interpretability, diffusion
Abstract: Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with high-resolution remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduces memory consumption by 84%, decreases FLOPs by 24% and improves throughput by 2.7 times.

Title: Regularized Multi-LLMs Collaboration for Enhanced Score-based Causal Discovery

Authors: Xiaoxuan Li, Yao Liu, Ruoyu Wang, Lina Yao
Subjects: cs.LG, cs.AI, stat.ME
Abstract URL: https://arxiv.org/abs/2411.17989
Pdf URL: https://arxiv.org/pdf/2411.17989
Copy Paste: [[2411.17989]] Regularized Multi-LLMs Collaboration for Enhanced Score-based Causal Discovery(https://arxiv.org/abs/2411.17989)
Keywords: large language model
Abstract: As the significance of understanding the cause-and-effect relationships among variables increases in the development of modern systems and algorithms, learning causality from observational data has become a preferred and efficient approach over conducting randomized control trials. However, purely observational data could be insufficient to reconstruct the true causal graph. Consequently, many researchers tried to utilise some form of prior knowledge to improve causal discovery process. In this context, the impressive capabilities of large language models (LLMs) have emerged as a promising alternative to the costly acquisition of prior expert knowledge. In this work, we further explore the potential of using LLMs to enhance causal discovery approaches, particularly focusing on score-based methods, and we propose a general framework to utilise the capacity of not only one but multiple LLMs to augment the discovery process.

Title: VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Authors: Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17991
Pdf URL: https://arxiv.org/pdf/2411.17991
Copy Paste: [[2411.17991]] VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format(https://arxiv.org/abs/2411.17991)
Keywords: large language model
Abstract: Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays. Code, data and demo are available at: this https URL.

Title: New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

Authors: Andreas Madsen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17992
Pdf URL: https://arxiv.org/pdf/2411.17992
Copy Paste: [[2411.17992]] New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing(https://arxiv.org/abs/2411.17992)
Keywords: interpretability, large language model
Abstract: As machine learning becomes more widespread and is used in more critical applications, it's important to provide explanations for these models, to prevent unintended behavior. Unfortunately, many current interpretability methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates the question "How to provide and ensure faithful explanations for complex general-purpose neural NLP models?" The main thesis is that we should develop new paradigms in interpretability. This is achieved by first developing solid faithfulness metrics and then applying the lessons learned from this investigation to develop new paradigms. The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations. The idea in self-explanations is to have large language models explain themselves, we identify that current models are not capable of doing this consistently. However, we suggest how this could be achieved. The idea of FMMs is to create models that are designed such that measuring faithfulness is cheap and precise. This makes it possible to optimize an explanation towards maximum faithfulness, which makes FMMs designed to be explained. We find that FMMs yield explanations that are near theoretical optimal in terms of faithfulness. Overall, from all investigations of faithfulness, results show that post-hoc and intrinsic explanations are by default model and task-dependent. However, this was not the case when using FMMs, even with the same post-hoc explanation methods. This shows, that even simple modifications to the model, such as randomly masking the training dataset, as was done in FMMs, can drastically change the situation and result in consistently faithful explanations. This answers the question of how to provide and ensure faithful explanations.

Title: DRS: Deep Question Reformulation With Structured Output

Authors: Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.17993
Pdf URL: https://arxiv.org/pdf/2411.17993
Copy Paste: [[2411.17993]] DRS: Deep Question Reformulation With Structured Output(https://arxiv.org/abs/2411.17993)
Keywords: large language model
Abstract: Question answering is a fundamental capability of large language models (LLMs). However, when people encounter completely new knowledge texts, they often ask questions that the text cannot answer due to a lack of understanding of the knowledge. Recent research shows that large language models identify the unanswerability of questions, but they lack the ability to help people reformulate their questions. Even powerful models like GPT-3.5 perform poorly in this regard. To enhance the ability of LLMs to assist humans in reformulating questions to extract relevant knowledge from new documents, we propose a zero-shot method called DRS: Deep Question Reformulation With Structured Output. Our proposed method leverages large language models and the DFS-based algorithm to iteratively search for possible entity combinations and constrain the output with certain entities, effectively improving the capabilities of large language models in this area. Extensive experimental results show that our zero-shot DRS method significantly improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42% and effectively improves the score of open-source large language models, such as Gemma2-9B, from 26.35% to 56.75%.

Title: Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Authors: Shuyang Hao, Bryan Hooi, Jun Liu, Kai-Wei Chang, Zi Huang, Yujun Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18000
Pdf URL: https://arxiv.org/pdf/2411.18000
Copy Paste: [[2411.18000]] Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models(https://arxiv.org/abs/2411.18000)
Keywords: security, defense, attack, robust
Abstract: Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming existing methods by margins of 34.37% and 12.77% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.

Title: AI-Driven Smartphone Solution for Digitizing Rapid Diagnostic Test Kits and Enhancing Accessibility for the Visually Impaired

Authors: R. B. Dastagir, J. T. Jami, S. Chanda, F. Hafiz, M. Rahman, K. Dey, M. M. Rahman, M. Qureshi, M. M. Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18007
Pdf URL: https://arxiv.org/pdf/2411.18007
Copy Paste: [[2411.18007]] AI-Driven Smartphone Solution for Digitizing Rapid Diagnostic Test Kits and Enhancing Accessibility for the Visually Impaired(https://arxiv.org/abs/2411.18007)
Keywords: robust, extraction
Abstract: Rapid diagnostic tests are crucial for timely disease detection and management, yet accurate interpretation of test results remains challenging. In this study, we propose a novel approach to enhance the accuracy and reliability of rapid diagnostic test result interpretation by integrating artificial intelligence (AI) algorithms, including convolutional neural networks (CNN), within a smartphone-based application. The app enables users to take pictures of their test kits, which YOLOv8 then processes to precisely crop and extract the membrane region, even if the test kit is not centered in the frame or is positioned at the very edge of the image. This capability offers greater accessibility, allowing even visually impaired individuals to capture test images without needing perfect alignment, thus promoting user independence and inclusivity. The extracted image is analyzed by an additional CNN classifier that determines if the results are positive, negative, or invalid, providing users with the results and a confidence level. Through validation experiments with commonly used rapid test kits across various diagnostic applications, our results demonstrate that the synergistic integration of AI significantly improves sensitivity and specificity in test result interpretation. This improvement can be attributed to the extraction of the membrane zones from the test kit images using the state-of-the-art YOLO algorithm. Additionally, we performed SHapley Additive exPlanations (SHAP) analysis to investigate the factors influencing the model's decisions, identifying reasons behind both correct and incorrect classifications. By facilitating the differentiation of genuine test lines from background noise and providing valuable insights into test line intensity and uniformity, our approach offers a robust solution to challenges in rapid test interpretation.

Title: Causal and Local Correlations Based Network for Multivariate Time Series Classification

Authors: Mingsen Du, Yanxuan Wei, Xiangwei Zheng, Cun Ji
Subjects: cs.LG, cs.AI, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2411.18008
Pdf URL: https://arxiv.org/pdf/2411.18008
Copy Paste: [[2411.18008]] Causal and Local Correlations Based Network for Multivariate Time Series Classification(https://arxiv.org/abs/2411.18008)
Keywords: extraction
Abstract: Recently, time series classification has attracted the attention of a large number of researchers, and hundreds of methods have been proposed. However, these methods often ignore the spatial correlations among dimensions and the local correlations among features. To address this issue, the causal and local correlations based network (CaLoNet) is proposed in this study for multivariate time series classification. First, pairwise spatial correlations between dimensions are modeled using causality modeling to obtain the graph structure. Then, a relationship extraction network is used to fuse local correlations to obtain long-term dependency features. Finally, the graph structure and long-term dependency features are integrated into the graph neural network. Experiments on the UEA datasets show that CaLoNet can obtain competitive performance compared with state-of-the-art methods.

Title: Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

Authors: Jiahao Zhang, Anoop Cherian, Cristian Rodriguez, Weijian Deng, Stephen Gould
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18011
Pdf URL: https://arxiv.org/pdf/2411.18011
Copy Paste: [[2411.18011]] Manual-PA: Learning 3D Part Assembly from Instruction Diagrams(https://arxiv.org/abs/2411.18011)
Keywords: transformer
Abstract: Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly on the IKEA-Manual dataset.

Title: Can bidirectional encoder become the ultimate winner for downstream applications of foundation models?

Authors: Lewen Yang, Xuanyu Zhou, Juao Fan, Xinyi Xie, Shengxin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18021
Pdf URL: https://arxiv.org/pdf/2411.18021
Copy Paste: [[2411.18021]] Can bidirectional encoder become the ultimate winner for downstream applications of foundation models?(https://arxiv.org/abs/2411.18021)
Keywords: extraction, transformer, generative
Abstract: Over the past few decades, Artificial Intelligence(AI) has progressed from the initial machine learning stage to the deep learning stage, and now to the stage of foundational models. Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning, and pre-trained models can be fine-tuned and applied to various downstream tasks. Under the framework of foundational models, models such as Bidirectional Encoder Representations from Transformers(BERT) and Generative Pre-trained Transformer(GPT) have greatly advanced the development of natural language processing(NLP), especially the emergence of many models based on BERT. BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model. It can capture bidirectional context information to predict the masked words in the sequence, this can improve the feature extraction ability of the model. This makes the model very useful for downstream tasks, especially for specialized applications. The model using the bidirectional encoder can better understand the domain knowledge and be better applied to these downstream tasks. So we hope to help understand how this technology has evolved and improved model performance in various natural language processing tasks under the background of foundational models and reveal its importance in capturing context information and improving the model's performance on downstream tasks. This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model. It also briefly analyzes BERT and the improvements of some models based on BERT. The model's performance on the Stanford Question Answering Dataset(SQuAD) and General Language Understanding Evaluation(GLUE) was compared.

Title: Leveraging A New GAN-based Transformer with ECDH Crypto-system for Enhancing Energy Theft Detection in Smart Grid

Authors: Yang Yang, Xun Yuan, Arwa Alromih, Aryan Mohammadi Pasikhani, Prosanta Gope, Biplab Sikdar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.18023
Pdf URL: https://arxiv.org/pdf/2411.18023
Copy Paste: [[2411.18023]] Leveraging A New GAN-based Transformer with ECDH Crypto-system for Enhancing Energy Theft Detection in Smart Grid(https://arxiv.org/abs/2411.18023)
Keywords: secure, privacy, protect, attack, transformer
Abstract: Detecting energy theft is vital for effectively managing power grids, as it ensures precise billing and prevents financial losses. Split-learning emerges as a promising decentralized machine learning technique for identifying energy theft while preserving user data confidentiality. Nevertheless, traditional split learning approaches are vulnerable to privacy leakage attacks, which significantly threaten data confidentiality. To address this challenge, we propose a novel GAN-Transformer-based split learning framework in this paper. This framework leverages the strengths of the transformer architecture, which is known for its capability to process long-range dependencies in energy consumption data. Thus, it enhances the accuracy of energy theft detection without compromising user privacy. A distinctive feature of our approach is the deployment of a novel mask-based method, marking a first in its field to effectively combat privacy leakage in split learning scenarios targeted at AI-enabled adversaries. This method protects sensitive information during the model's training phase. Our experimental evaluations indicate that the proposed framework not only achieves accuracy levels comparable to conventional methods but also significantly enhances privacy protection. The results underscore the potential of the GAN-Transformer split learning framework as an effective and secure tool in the domain of energy theft detection.

Title: Privacy-preserving Robotic-based Multi-factor Authentication Scheme for Secure Automated Delivery System

Authors: Yang Yang, Aryan Mohammadi Pasikhani, Prosanta Gope, Biplab Sikdar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.18027
Pdf URL: https://arxiv.org/pdf/2411.18027
Copy Paste: [[2411.18027]] Privacy-preserving Robotic-based Multi-factor Authentication Scheme for Secure Automated Delivery System(https://arxiv.org/abs/2411.18027)
Keywords: secure, security, privacy, attack, transformer
Abstract: Package delivery is a critical aspect of various industries, but it often incurs high financial costs and inefficiencies when relying solely on human resources. The last-mile transport problem, in particular, contributes significantly to the expenditure of human resources in major companies. Robot-based delivery systems have emerged as a potential solution for last-mile delivery to address this challenge. However, robotic delivery systems still face security and privacy issues, like impersonation, replay, man-in-the-middle attacks (MITM), unlinkability, and identity theft. In this context, we propose a privacy-preserving multi-factor authentication scheme specifically designed for robot delivery systems. Additionally, AI-assisted robotic delivery systems are susceptible to machine learning-based attacks (e.g. FGSM, PGD, etc.). We introduce the \emph{first} transformer-based audio-visual fusion defender to tackle this issue, which effectively provides resilience against adversarial samples. Furthermore, we provide a rigorous formal analysis of the proposed protocol and also analyse the protocol security using a popular symbolic proof tool called ProVerif and Scyther. Finally, we present a real-world implementation of the proposed robotic system with the computation cost and energy consumption analysis. Code and pre-trained models are available at: this https URL

Title: RL for Mitigating Cascading Failures: Targeted Exploration via Sensitivity Factors

Authors: Anmol Dwivedi, Ali Tajer, Santiago Paternain, Nurali Virani
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2411.18050
Pdf URL: https://arxiv.org/pdf/2411.18050
Copy Paste: [[2411.18050]] RL for Mitigating Cascading Failures: Targeted Exploration via Sensitivity Factors(https://arxiv.org/abs/2411.18050)
Keywords: security
Abstract: Electricity grid's resiliency and climate change strongly impact one another due to an array of technical and policy-related decisions that impact both. This paper introduces a physics-informed machine learning-based framework to enhance grid's resiliency. Specifically, when encountering disruptive events, this paper designs remedial control actions to prevent blackouts. The proposed Physics-Guided Reinforcement Learning (PG-RL) framework determines effective real-time remedial line-switching actions, considering their impact on power balance, system security, and grid reliability. To identify an effective blackout mitigation policy, PG-RL leverages power-flow sensitivity factors to guide the RL exploration during agent training. Comprehensive evaluations using the Grid2Op platform demonstrate that incorporating physical signals into RL significantly improves resource utilization within electric grids and achieves better blackout mitigation policies - both of which are critical in addressing climate change.

Title: ORIS: Online Active Learning Using Reinforcement Learning-based Inclusive Sampling for Robust Streaming Analytics System

Authors: Rahul Pandey, Ziwei Zhu, Hemant Purohit
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18060
Pdf URL: https://arxiv.org/pdf/2411.18060
Copy Paste: [[2411.18060]] ORIS: Online Active Learning Using Reinforcement Learning-based Inclusive Sampling for Robust Streaming Analytics System(https://arxiv.org/abs/2411.18060)
Keywords: robust
Abstract: Effective labeled data collection plays a critical role in developing and fine-tuning robust streaming analytics systems. However, continuously labeling documents to filter relevant information poses significant challenges like limited labeling budget or lack of high-quality labels. There is a need for efficient human-in-the-loop machine learning (HITL-ML) design to improve streaming analytics systems. One particular HITL- ML approach is online active learning, which involves iteratively selecting a small set of the most informative documents for labeling to enhance the ML model performance. The performance of such algorithms can get affected due to human errors in labeling. To address these challenges, we propose ORIS, a method to perform Online active learning using Reinforcement learning-based Inclusive Sampling of documents for labeling. ORIS aims to create a novel Deep Q-Network-based strategy to sample incoming documents that minimize human errors in labeling and enhance the ML model performance. We evaluate the ORIS method on emotion recognition tasks, and it outperforms traditional baselines in terms of both human labeling performance and the ML model performance.

Title: Lightweight Gaze Estimation Model Via Fusion Global Information

Authors: Zhang Cheng, Yanxia Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18064
Pdf URL: https://arxiv.org/pdf/2411.18064
Copy Paste: [[2411.18064]] Lightweight Gaze Estimation Model Via Fusion Global Information(https://arxiv.org/abs/2411.18064)
Keywords: transformer
Abstract: Deep learning-based appearance gaze estimation methods are gaining popularity due to their high accuracy and fewer constraints from the environment. However, existing high-precision models often rely on deeper networks, leading to problems such as large parameters, long training time, and slow convergence. In terms of this issue, this paper proposes a novel lightweight gaze estimation model FGI-Net(Fusion Global Information). The model fuses global information into the CNN, effectively compensating for the need of multi-layer convolution and pooling to indirectly capture global information, while reducing the complexity of the model, improving the model accuracy and convergence speed. To validate the performance of the model, a large number of experiments are conducted, comparing accuracy with existing classical models and lightweight models, comparing convergence speed with models of different architectures, and conducting ablation experiments. Experimental results show that compared with GazeCaps, the latest gaze estimation model, FGI-Net achieves a smaller angle error with 87.1% and 79.1% reduction in parameters and FLOPs, respectively (MPIIFaceGaze is 3.74°, EyeDiap is 5.15°, Gaze360 is 10.50° and RT-Gene is 6.02°). Moreover, compared with different architectural models such as CNN and Transformer, FGI-Net is able to quickly converge to a higher accuracy range with fewer iterations of training, when achieving optimal accuracy on the Gaze360 and EyeDiap datasets, the FGI-Net model has 25% and 37.5% fewer iterations of training compared to GazeTR, respectively.

Title: GLS: Geometry-aware 3D Language Gaussian Splatting

Authors: Jiaxiong Qiu, Liu Liu, Zhizhong Su, Tianwei Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18066
Pdf URL: https://arxiv.org/pdf/2411.18066
Copy Paste: [[2411.18066]] GLS: Geometry-aware 3D Language Gaussian Splatting(https://arxiv.org/abs/2411.18066)
Keywords: segmentation
Abstract: Recently, 3D Gaussian Splatting (3DGS) has achieved significant performance on indoor surface reconstruction and open-vocabulary segmentation. This paper presents GLS, a unified framework of surface reconstruction and open-vocabulary segmentation based on 3DGS. GLS extends two fields by exploring the correlation between them. For indoor surface reconstruction, we introduce surface normal prior as a geometric cue to guide the rendered normal, and use the normal error to optimize the rendered depth. For open-vocabulary segmentation, we employ 2D CLIP features to guide instance features and utilize DEVA masks to enhance their view consistency. Extensive experiments demonstrate the effectiveness of jointly optimizing surface reconstruction and open-vocabulary segmentation, where GLS surpasses state-of-the-art approaches of each task on MuSHRoom, ScanNet++, and LERF-OVS datasets. Code will be available at this https URL.

Title: PersonaCraft: Personalized Full-Body Image Synthesis for Multiple Identities from Single References Using 3D-Model-Conditioned Diffusion

Authors: Gwanghyun Kim, Suh Yoon Jeon, Seunggyu Lee, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18068
Pdf URL: https://arxiv.org/pdf/2411.18068
Copy Paste: [[2411.18068]] PersonaCraft: Personalized Full-Body Image Synthesis for Multiple Identities from Single References Using 3D-Model-Conditioned Diffusion(https://arxiv.org/abs/2411.18068)
Keywords: diffusion
Abstract: Personalized image generation has been significantly advanced, enabling the creation of highly realistic and customized images. However, existing methods often struggle with generating images of multiple people due to occlusions and fail to accurately personalize full-body shapes. In this paper, we propose PersonaCraft, a novel approach that combines diffusion models with 3D human modeling to address these limitations. Our method effectively manages occlusions by incorporating 3D-aware pose conditioning with SMPLx-ControlNet and accurately personalizes human full-body shapes through SMPLx fitting. Additionally, PersonaCraft enables user-defined body shape adjustments, adding flexibility for individual body customization. Experimental results demonstrate the superior performance of PersonaCraft in generating high-quality, realistic images of multiple individuals while resolving occlusion issues, thus establishing a new standard for multi-person personalized image synthesis. Project page: this https URL

Title: Large Scale Evaluation of Deep Learning-based Explainable Solar Flare Forecasting Models with Attribution-based Proximity Analysis

Authors: Temitope Adeyeha, Chetraj Pandey, Berkay Aydin
Subjects: cs.LG, astro-ph.SR, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2411.18070
Pdf URL: https://arxiv.org/pdf/2411.18070
Copy Paste: [[2411.18070]] Large Scale Evaluation of Deep Learning-based Explainable Solar Flare Forecasting Models with Attribution-based Proximity Analysis(https://arxiv.org/abs/2411.18070)
Keywords: interpretability
Abstract: Accurate and reliable predictions of solar flares are essential due to their potentially significant impact on Earth and space-based infrastructure. Although deep learning models have shown notable predictive capabilities in this domain, current evaluations often focus on accuracy while neglecting interpretability and reliability--factors that are especially critical in operational settings. To address this gap, we propose a novel proximity-based framework for analyzing post hoc explanations to assess the interpretability of deep learning models for solar flare prediction. Our study compares two models trained on full-disk line-of-sight (LoS) magnetogram images to predict $\geq$M-class solar flares within a 24-hour window. We employ the Guided Gradient-weighted Class Activation Mapping (Guided Grad-CAM) method to generate attribution maps from these models, which we then analyze to gain insights into their decision-making processes. To support the evaluation of explanations in operational systems, we introduce a proximity-based metric that quantitatively assesses the accuracy and relevance of local explanations when regions of interest are known. Our findings indicate that the models' predictions align with active region characteristics to varying degrees, offering valuable insights into their behavior. This framework enhances the evaluation of model interpretability in solar flare forecasting and supports the development of more transparent and reliable operational systems.

Title: Dual-Level Boost Network for Long-Tail Prohibited Items Detection in X-ray Security Inspection

Authors: Renshuai Tao, Haoyu Wang, Wei Wang, Yunchao Wei, Yao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18078
Pdf URL: https://arxiv.org/pdf/2411.18078
Copy Paste: [[2411.18078]] Dual-Level Boost Network for Long-Tail Prohibited Items Detection in X-ray Security Inspection(https://arxiv.org/abs/2411.18078)
Keywords: security
Abstract: The detection of prohibited items in X-ray security inspections is vital for ensuring public safety. However, the long-tail distribution of item categories, where certain prohibited items are far less common, poses a big challenge for detection models, as rare categories often lack sufficient training data. Existing methods struggle to classify these rare items accurately due to this imbalance. In this paper, we propose a Dual-level Boost Network (DBNet) specifically designed to overcome these challenges in X-ray security screening. Our approach introduces two key innovations: (1) a specific data augmentation strategy employing Poisson blending, inspired by the characteristics of X-ray images, to generate realistic synthetic instances of rare items which can effectively mitigate data imbalance; and (2) a context-aware feature enhancement module that captures the spatial and semantic interactions between objects and their surroundings, enhancing classification accuracy for underrepresented categories. Extensive experimental results demonstrate that DBNet improves detection performance for tail categories, outperforming sota methods in X-ray security inspection scenarios by a large margin 17.2%, thereby ensuring enhanced public safety.

Title: Training Noise Token Pruning

Authors: Mingxing Rao, Bohan Jiang, Daniel Moyer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18092
Pdf URL: https://arxiv.org/pdf/2411.18092
Copy Paste: [[2411.18092]] Training Noise Token Pruning(https://arxiv.org/abs/2411.18092)
Keywords: transformer
Abstract: In the present work we present Training Noise Token (TNT) Pruning for vision transformers. Our method relaxes the discrete token dropping condition to continuous additive noise, providing smooth optimization in training, while retaining discrete dropping computational gains in deployment settings. We provide theoretical connections to Rate-Distortion literature, and empirical evaluations on the ImageNet dataset using ViT and DeiT architectures demonstrating TNT's advantages over previous pruning methods.

Title: Comprehensive Kernel Safety in the Spectre Era: Mitigations and Performance Evaluation (Extended Version)

Authors: Davide Davoli, Martin Avanzini, Tamara Rezk
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.18094
Pdf URL: https://arxiv.org/pdf/2411.18094
Copy Paste: [[2411.18094]] Comprehensive Kernel Safety in the Spectre Era: Mitigations and Performance Evaluation (Extended Version)(https://arxiv.org/abs/2411.18094)
Keywords: attack
Abstract: The efficacy of address space layout randomization has been formally demonstrated in a shared-memory model by Abadi et al., contingent on specific assumptions about victim programs. However, modern operating systems, implementing layout randomization in the kernel, diverge from these assumptions and operate on a separate memory model with communication through system calls. In this work, we relax Abadi et al.'s language assumptions while demonstrating that layout randomization offers a comparable safety guarantee in a system with memory separation. However, in practice, speculative execution and side-channels are recognized threats to layout randomization. We show that kernel safety cannot be restored for attackers capable of using side-channels and speculative execution, and introduce enforcement mechanisms that can guarantee speculative kernel safety for safe system calls in the Spectre era. We implement two suitable mechanisms and we use them to compile the Linux kernel in order to evaluate their performance overhead.

Title: Training and Evaluating Language Models with Template-based Data Generation

Authors: Yifan Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18104
Pdf URL: https://arxiv.org/pdf/2411.18104
Copy Paste: [[2411.18104]] Training and Evaluating Language Models with Template-based Data Generation(https://arxiv.org/abs/2411.18104)
Keywords: large language model
Abstract: The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems--each accompanied by code-based and natural language solutions--with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures. The TemplateMath Part I: TemplateGSM dataset is publicly available at this https URL. The code is available at this https URL.

Title: Training Data Synthesis with Difficulty Controlled Diffusion Model

Authors: Zerun Wang, Jiafeng Mao, Xueting Wang, Toshihiko Yamasaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18109
Pdf URL: https://arxiv.org/pdf/2411.18109
Copy Paste: [[2411.18109]] Training Data Synthesis with Difficulty Controlled Diffusion Model(https://arxiv.org/abs/2411.18109)
Keywords: diffusion, generative
Abstract: Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges of RS-SSL. RSMatch effectively identifies unlabeled synthetic data and further utilizes them for improvement. Extensive experimental results show that RSMatch can transfer synthetic unlabeled data from `obstacles' to `resources.' The effectiveness is further verified through ablation studies and visualization.

Title: When Large Vision-Language Models Meet Person Re-Identification

Authors: Qizao Wang, Bin Li, Xiangyang Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18111
Pdf URL: https://arxiv.org/pdf/2411.18111
Copy Paste: [[2411.18111]] When Large Vision-Language Models Meet Person Re-Identification(https://arxiv.org/abs/2411.18111)
Keywords: extraction, generative, large language model
Abstract: Large Vision-Language Models (LVLMs) that incorporate visual models and Large Language Models (LLMs) have achieved impressive results across various cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one pedestrian semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the pedestrian identity representation. Our framework integrates the semantic understanding and generation capabilities of LVLMs into end-to-end ReID training, allowing LVLMs to capture rich semantic cues from pedestrian images during both training and inference. Our method achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID and offering a promising direction for future research.

Title: Spectral-Spatial Transformer with Active Transfer Learning for Hyperspectral Image Classification

Authors: Muhammad Ahmad, Manuel Mazzara, Salvatore Distefano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18115
Pdf URL: https://arxiv.org/pdf/2411.18115
Copy Paste: [[2411.18115]] Spectral-Spatial Transformer with Active Transfer Learning for Hyperspectral Image Classification(https://arxiv.org/abs/2411.18115)
Keywords: robust, transformer
Abstract: The classification of hyperspectral images (HSI) is a challenging task due to the high spectral dimensionality and limited labeled data typically available for training. In this study, we propose a novel multi-stage active transfer learning (ATL) framework that integrates a Spatial-Spectral Transformer (SST) with an active learning process for efficient HSI classification. Our approach leverages a pre-trained (initially trained) SST model, fine-tuned iteratively on newly acquired labeled samples using an uncertainty-diversity (Spatial-Spectral Neighborhood Diversity) querying mechanism. This mechanism identifies the most informative and diverse samples, thereby optimizing the transfer learning process to reduce both labeling costs and model uncertainty. We further introduce a dynamic freezing strategy, selectively freezing layers of the SST model to minimize computational overhead while maintaining adaptability to spectral variations in new data. One of the key innovations in our work is the self-calibration of spectral and spatial attention weights, achieved through uncertainty-guided active learning. This not only enhances the model's robustness in handling dynamic and disjoint spectral profiles but also improves generalization across multiple HSI datasets. Additionally, we present a diversity-promoting sampling strategy that ensures the selected samples span distinct spectral regions, preventing overfitting to particular spectral classes. Experiments on benchmark HSI datasets demonstrate that the SST-ATL framework significantly outperforms existing CNN and SST-based methods, offering superior accuracy, efficiency, and computational performance. The source code can be accessed at \url{this https URL}.

Title: A Machine Learning-based Framework towards Assessment of Decision-Makers' Biases

Authors: Wanxue Dong, Maria De-arteaga, Maytal Saar-Tsechansky
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18122
Pdf URL: https://arxiv.org/pdf/2411.18122
Copy Paste: [[2411.18122]] A Machine Learning-based Framework towards Assessment of Decision-Makers' Biases(https://arxiv.org/abs/2411.18122)
Keywords: fair
Abstract: Biased human decisions have consequential impacts across various domains, yielding unfair treatment of individuals and resulting in suboptimal outcomes for organizations and society. In recognition of this fact, organizations regularly design and deploy interventions aimed at mitigating these biases. However, measuring human decision biases remains an important but elusive task. Organizations are frequently concerned with mistaken decisions disproportionately affecting one group. In practice, however, this is typically not possible to assess due to the scarcity of a gold standard: a label that indicates what the correct decision would have been. In this work, we propose a machine learning-based framework to assess bias in human-generated decisions when gold standard labels are scarce. We provide theoretical guarantees and empirical evidence demonstrating the superiority of our method over existing alternatives. This proposed methodology establishes a foundation for transparency in human decision-making, carrying substantial implications for managerial duties, and offering potential for alleviating algorithmic biases when human decisions are used as labels to train algorithms.

Title: Curriculum Demonstration Selection for In-Context Learning

Authors: Duc Anh Vu, Nguyen Tran Cong Duy, Xiaobao Wu, Hoang Minh Nhat, Du Mingzhe, Nguyen Thanh Thong, Anh Tuan Luu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18126
Pdf URL: https://arxiv.org/pdf/2411.18126
Copy Paste: [[2411.18126]] Curriculum Demonstration Selection for In-Context Learning(https://arxiv.org/abs/2411.18126)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown strong in-context learning (ICL) abilities with a few demonstrations. However, one critical challenge is how to select demonstrations to elicit the full potential of LLMs. In this paper, we propose Curriculum Demonstration Selection (CDS), a novel demonstration selection method for ICL. Instead of merely using similarity, CDS additionally partitions samples by their complexity measurements. Following curriculum learning, CDS then selects demonstrations from easy to difficult. Thus the selected demonstrations cover a wide range of difficulty levels, enabling LLMs to learn from varied complexities within the training set. Experiments demonstrate that our CDS consistently outperforms baseline methods, achieving notable improvements across nine LLMs on three benchmarks. Moreover, CDS proves especially effective in enhancing LLM performance in solving challenging problems.

Title: ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

Authors: Uy Dieu Tran, Minh Luu, Phong Ha Nguyen, Khoi Nguyen, Binh-Son Hua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18135
Pdf URL: https://arxiv.org/pdf/2411.18135
Copy Paste: [[2411.18135]] ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts(https://arxiv.org/abs/2411.18135)
Keywords: diffusion
Abstract: Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.

Title: Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Authors: Jingming Liu, Yumeng Li, Boyuan Xiao, Yichang Jian, Ziang Qin, Tianjia Shao, Yao-Xiang Ding, Kun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18142
Pdf URL: https://arxiv.org/pdf/2411.18142
Copy Paste: [[2411.18142]] Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models(https://arxiv.org/abs/2411.18142)
Keywords: large language model
Abstract: There have been recent efforts to extend the Chain-of-Thought (CoT) paradigm to Multimodal Large Language Models (MLLMs) by finding visual clues in the input scene, advancing the visual reasoning ability of MLLMs. However, current approaches are specially designed for the tasks where clue finding plays a major role in the whole reasoning process, leading to the difficulty in handling complex visual scenes where clue finding does not actually simplify the whole reasoning task. To deal with this challenge, we propose a new visual reasoning paradigm enabling MLLMs to autonomously modify the input scene to new ones based on its reasoning status, such that CoT is reformulated as conducting simple closed-loop decision-making and reasoning steps under a sequence of imagined visual scenes, leading to natural and general CoT construction. To implement this paradigm, we introduce a novel plug-and-play imagination space, where MLLMs conduct visual modifications through operations like focus, ignore, and transform based on their native reasoning ability without specific training. We validate our approach through a benchmark spanning dense counting, simple jigsaw puzzle solving, and object placement, challenging the reasoning ability beyond clue finding. The results verify that while existing techniques fall short, our approach enables MLLMs to effectively reason step by step through autonomous imagination. Project page: this https URL.

Title: Harnessing Large Language Models for Seed Generation in Greybox Fuzzing

Authors: Wenxuan Shi, Yunhang Zhang, Xinyu Xing, Jun Xu
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2411.18143
Pdf URL: https://arxiv.org/pdf/2411.18143
Copy Paste: [[2411.18143]] Harnessing Large Language Models for Seed Generation in Greybox Fuzzing(https://arxiv.org/abs/2411.18143)
Keywords: large language model
Abstract: Greybox fuzzing has emerged as a preferred technique for discovering software bugs, striking a balance between efficiency and depth of exploration. While research has focused on improving fuzzing techniques, the importance of high-quality initial seeds remains critical yet often overlooked. Existing methods for seed generation are limited, especially for programs with non-standard or custom input formats. Large Language Models (LLMs) has revolutionized numerous domains, showcasing unprecedented capabilities in understanding and generating complex patterns across various fields of knowledge. This paper introduces SeedMind, a novel system that leverages LLMs to boost greybox fuzzing through intelligent seed generation. Unlike previous approaches, SeedMind employs LLMs to create test case generators rather than directly producing test cases. Our approach implements an iterative, feedback-driven process that guides the LLM to progressively refine test case generation, aiming for increased code coverage depth and breadth. In developing SeedMind, we addressed key challenges including input format limitations, context window constraints, and ensuring consistent, progress-aware behavior. Intensive evaluations with real-world applications show that SeedMind effectively harnesses LLMs to generate high-quality test cases and facilitate fuzzing in bug finding, presenting utility comparable to human-created seeds and significantly outperforming the existing LLM-based solutions.

Title: MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Authors: Thai-Binh Nguyen, Alexander Waibel
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.18152
Pdf URL: https://arxiv.org/pdf/2411.18152
Copy Paste: [[2411.18152]] MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models(https://arxiv.org/abs/2411.18152)
Keywords: robust
Abstract: Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. Existing methods often rely on complex modular systems or require extensive fine-tuning of joint modules, limiting their adaptability and general efficiency. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. Our method involves training a speaker module to predict speaker embeddings based on weak labels without requiring additional ASR model modifications. Despite being trained exclusively with non-overlapping monolingual data, our approach effectively extracts speaker attributes across diverse multilingual datasets, including those with overlapping speech. Experimental results demonstrate competitive performance compared to strong baselines, highlighting the model's robustness and potential for practical applications.

Title: A survey on cutting-edge relation extraction techniques based on language models

Authors: Jose A. Diaz-Garcia, Julio Amador Diaz Lopez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18157
Pdf URL: https://arxiv.org/pdf/2411.18157
Copy Paste: [[2411.18157]] A survey on cutting-edge relation extraction techniques based on language models(https://arxiv.org/abs/2411.18157)
Keywords: extraction, large language model
Abstract: This comprehensive survey delves into the latest advancements in Relation Extraction (RE), a pivotal task in natural language processing essential for applications across biomedical, financial, and legal sectors. This study highlights the evolution and current state of RE techniques by analyzing 137 papers presented at the Association for Computational Linguistics (ACL) conferences over the past four years, focusing on models that leverage language models. Our findings underscore the dominance of BERT-based methods in achieving state-of-the-art results for RE while also noting the promising capabilities of emerging large language models (LLMs) like T5, especially in few-shot relation extraction scenarios where they excel in identifying previously unseen relations.

Title: Type-R: Automatically Retouching Typos for Text-to-Image Generation

Authors: Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seichi Uchida, Kota Yamaguchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18159
Pdf URL: https://arxiv.org/pdf/2411.18159
Copy Paste: [[2411.18159]] Type-R: Automatically Retouching Typos for Text-to-Image Generation(https://arxiv.org/abs/2411.18159)
Keywords: diffusion
Abstract: While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.

Title: SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment

Authors: Jie Wang, Yichen Wang, Zhilin Zhang, Jianhao Zeng, Kaidi Wang, Zhiyang Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18162
Pdf URL: https://arxiv.org/pdf/2411.18162
Copy Paste: [[2411.18162]] SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment(https://arxiv.org/abs/2411.18162)
Keywords: generative, large language model
Abstract: With strong expressive capabilities in Large Language Models(LLMs), generative models effectively capture sentiment structures and deep semantics, however, challenges remain in fine-grained sentiment classification across multi-lingual and complex contexts. To address this, we propose the Sentiment Cross-Lingual Recognition and Logic Framework (SentiXRL), which incorporates two modules,an emotion retrieval enhancement module to improve sentiment classification accuracy in complex contexts through historical dialogue and logical reasoning,and a self-circulating analysis negotiation mechanism (SANM)to facilitates autonomous decision-making within a single model for classification this http URL have validated SentiXRL's superiority on multiple standard datasets, outperforming existing models on CPED and CH-SIMS,and achieving overall better performance on MELD,Emorynlp and IEMOCAP. Notably, we unified labels across several fine-grained sentiment annotation datasets and conducted category confusion experiments, revealing challenges and impacts of class imbalance in standard datasets.

Title: RPEE-HEADS: A Novel Benchmark for Pedestrian Head Detection in Crowd Videos

Authors: Mohamad Abubaker, Zubayda Alsadder, Hamed Abdelhaq, Maik Boltes, Ahmed Alia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18164
Pdf URL: https://arxiv.org/pdf/2411.18164
Copy Paste: [[2411.18164]] RPEE-HEADS: A Novel Benchmark for Pedestrian Head Detection in Crowd Videos(https://arxiv.org/abs/2411.18164)
Keywords: transformer
Abstract: The automatic detection of pedestrian heads in crowded environments is essential for crowd analysis and management tasks, particularly in high-risk settings such as railway platforms and event entrances. These environments, characterized by dense crowds and dynamic movements, are underrepresented in public datasets, posing challenges for existing deep learning models. To address this gap, we introduce the Railway Platforms and Event Entrances-Heads (RPEE-Heads) dataset, a novel, diverse, high-resolution, and accurately annotated resource. It includes 109,913 annotated pedestrian heads across 1,886 images from 66 video recordings, with an average of 56.2 heads per image. Annotations include bounding boxes for visible head regions. In addition to introducing the RPEE-Heads dataset, this paper evaluates eight state-of-the-art object detection algorithms using the RPEE-Heads dataset and analyzes the impact of head size on detection accuracy. The experimental results show that You Only Look Once v9 and Real-Time Detection Transformer outperform the other algorithms, achieving mean average precisions of 90.7% and 90.8%, with inference times of 11 and 14 milliseconds, respectively. Moreover, the findings underscore the need for specialized datasets like RPEE-Heads for training and evaluating accurate models for head detection in railway platforms and event entrances. The dataset and pretrained models are available at this https URL.

Title: KAN See Your Face

Authors: Dong Han, Yong Li, Joachim Denzler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18165
Pdf URL: https://arxiv.org/pdf/2411.18165
Copy Paste: [[2411.18165]] KAN See Your Face(https://arxiv.org/abs/2411.18165)
Keywords: secure, privacy, protect, attack, robust
Abstract: With the advancement of face reconstruction (FR) systems, privacy-preserving face recognition (PPFR) has gained popularity for its secure face recognition, enhanced facial privacy protection, and robustness to various attacks. Besides, specific models and algorithms are proposed for face embedding protection by mapping embeddings to a secure space. However, there is a lack of studies on investigating and evaluating the possibility of extracting face images from embeddings of those systems, especially for PPFR. In this work, we introduce the first approach to exploit Kolmogorov-Arnold Network (KAN) for conducting embedding-to-face attacks against state-of-the-art (SOTA) FR and PPFR systems. Face embedding mapping (FEM) models are proposed to learn the distribution mapping relation between the embeddings from the initial domain and target domain. In comparison with Multi-Layer Perceptrons (MLP), we provide two variants, FEM-KAN and FEM-MLP, for efficient non-linear embedding-to-embedding mapping in order to reconstruct realistic face images from the corresponding face embedding. To verify our methods, we conduct extensive experiments with various PPFR and FR models. We also measure reconstructed face images with different metrics to evaluate the image quality. Through comprehensive experiments, we demonstrate the effectiveness of FEMs in accurate embedding mapping and face reconstruction.

Title: PDZSeg: Adapting the Foundation Model for Dissection Zone Segmentation with Visual Prompts in Robot-assisted Endoscopic Submucosal Dissection

Authors: Mengya Xu, Wenjin Mo, Guankun Wang, Huxin Gao, An Wang, Zhen Li, Xiaoxiao Yang, Hongliang Ren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18169
Pdf URL: https://arxiv.org/pdf/2411.18169
Copy Paste: [[2411.18169]] PDZSeg: Adapting the Foundation Model for Dissection Zone Segmentation with Visual Prompts in Robot-assisted Endoscopic Submucosal Dissection(https://arxiv.org/abs/2411.18169)
Keywords: robust, segmentation
Abstract: Purpose: Endoscopic surgical environments present challenges for dissection zone segmentation due to unclear boundaries between tissue types, leading to segmentation errors where models misidentify or overlook edges. This study aims to provide precise dissection zone suggestions during endoscopic submucosal dissection (ESD) procedures, enhancing ESD safety. Methods: We propose the Prompted-based Dissection Zone Segmentation (PDZSeg) model, designed to leverage diverse visual prompts such as scribbles and bounding boxes. By overlaying these prompts onto images and fine-tuning a foundational model on a specialized dataset, our approach improves segmentation performance and user experience through flexible input methods. Results: The PDZSeg model was validated using three experimental setups: in-domain evaluation, variability in visual prompt availability, and robustness assessment. Using the ESD-DZSeg dataset, results show that our method outperforms state-of-the-art segmentation approaches. This is the first study to integrate visual prompt design into dissection zone segmentation. Conclusion: The PDZSeg model effectively utilizes visual prompts to enhance segmentation performance and user experience, supported by the novel ESD-DZSeg dataset as a benchmark for dissection zone segmentation in ESD. Our work establishes a foundation for future research.

Title: Machine Unlearning reveals that the Gender-based Violence Victim Condition can be detected from Speech in a Speaker-Agnostic Setting

Authors: Emma Reyner-Fuentes, Esther Rituerto-Gonzalez, Carmen Pelaez-Moreno
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18177
Pdf URL: https://arxiv.org/pdf/2411.18177
Copy Paste: [[2411.18177]] Machine Unlearning reveals that the Gender-based Violence Victim Condition can be detected from Speech in a Speaker-Agnostic Setting(https://arxiv.org/abs/2411.18177)
Keywords: robust
Abstract: This study addresses the critical issue of gender-based violence's (GBV) impact on women's mental health. GBV, encompassing physical and sexual aggression, often results in long-lasting adverse effects for the victims, including anxiety, depression, post-traumatic stress disorder (PTSD), and substance abuse. Artificial Intelligence (AI)-based speech technologies have proven valuable for mental health assessments. However, these technologies experience performance challenges when confronted with speakers whose data has not been used for training. Our research presents a novel approach to speaker-agnostic detection of the gender-based violence victim condition (GBVVC), focusing on the development of robust AI models capable of generalization across diverse speakers. Leveraging advanced deep learning models and domain-adversarial training techniques, we minimize speaker identity's influence, achieving a 26.95% relative reduction in speaker identification ability while enhancing the GBVVC detection by a 6.37% relative improvement in the accuracy. This shows that models can focus on discriminative paralinguistic biomarkers that enhance the GBVVC prediction, and reduce the subject-specific traits' impact. Additionally, our model's predictions moderately correlate with pre-clinical PTSD symptoms, emphasizing the link between GBV and mental health. This work paves the way for AI-powered tools to aid mental health professionals in addressing this societal issue, offering a promising baseline for further research.

Title: InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks

Authors: Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Qi Guo, Xing Hu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.18191
Pdf URL: https://arxiv.org/pdf/2411.18191
Copy Paste: [[2411.18191]] InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks(https://arxiv.org/abs/2411.18191)
Keywords: security, privacy, attack, steal, large language model
Abstract: Large language models (LLMs) possess extensive knowledge and question-answering capabilities, having been widely deployed in privacy-sensitive domains like finance and medical consultation. During LLM inferences, cache-sharing methods are commonly employed to enhance efficiency by reusing cached states or responses for the same or similar inference requests. However, we identify that these cache mechanisms pose a risk of private input leakage, as the caching can result in observable variations in response times, making them a strong candidate for a timing-based attack hint. In this study, we propose a novel timing-based side-channel attack to execute input theft in LLMs inference. The cache-based attack faces the challenge of constructing candidate inputs in a large search space to hit and steal cached user queries. To address these challenges, we propose two primary components. The input constructor employs machine learning techniques and LLM-based approaches for vocabulary correlation learning while implementing optimized search mechanisms for generalized input construction. The time analyzer implements statistical time fitting with outlier elimination to identify cache hit patterns, continuously providing feedback to refine the constructor's search strategy. We conduct experiments across two cache mechanisms and the results demonstrate that our approach consistently attains high attack success rates in various applications. Our work highlights the security vulnerabilities associated with performance optimizations, underscoring the necessity of prioritizing privacy and security alongside enhancements in LLM inference.

Title: Scalable Multi-Objective Reinforcement Learning with Fairness Guarantees using Lorenz Dominance

Authors: Dimitris Michailidis, Willem Röpke, Diederik M. Roijers, Sennay Ghebreab, Fernando P. Santos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18195
Pdf URL: https://arxiv.org/pdf/2411.18195
Copy Paste: [[2411.18195]] Scalable Multi-Objective Reinforcement Learning with Fairness Guarantees using Lorenz Dominance(https://arxiv.org/abs/2411.18195)
Keywords: fair
Abstract: Multi-Objective Reinforcement Learning (MORL) aims to learn a set of policies that optimize trade-offs between multiple, often conflicting objectives. MORL is computationally more complex than single-objective RL, particularly as the number of objectives increases. Additionally, when objectives involve the preferences of agents or groups, ensuring fairness is socially desirable. This paper introduces a principled algorithm that incorporates fairness into MORL while improving scalability to many-objective problems. We propose using Lorenz dominance to identify policies with equitable reward distributions and introduce {\lambda}-Lorenz dominance to enable flexible fairness preferences. We release a new, large-scale real-world transport planning environment and demonstrate that our method encourages the discovery of fair policies, showing improved scalability in two large cities (Xi'an and Amsterdam). Our methods outperform common multi-objective approaches, particularly in high-dimensional objective spaces.

Title: Semantic Edge Computing and Semantic Communications in 6G Networks: A Unifying Survey and Research Challenges

Authors: Milin Zhang, Mohammad Abdi, Venkat R. Dasari, Francesco Restuccia
Subjects: cs.LG, cs.NI, eess.SP
Abstract URL: https://arxiv.org/abs/2411.18199
Pdf URL: https://arxiv.org/pdf/2411.18199
Copy Paste: [[2411.18199]] Semantic Edge Computing and Semantic Communications in 6G Networks: A Unifying Survey and Research Challenges(https://arxiv.org/abs/2411.18199)
Keywords: robust
Abstract: Semantic Edge Computing (SEC) and Semantic Communications (SemComs) have been proposed as viable approaches to achieve real-time edge-enabled intelligence in sixth-generation (6G) wireless networks. On one hand, SemCom leverages the strength of Deep Neural Networks (DNNs) to encode and communicate the semantic information only, while making it robust to channel distortions by compensating for wireless effects. Ultimately, this leads to an improvement in the communication efficiency. On the other hand, SEC has leveraged distributed DNNs to divide the computation of a DNN across different devices based on their computational and networking constraints. Although significant progress has been made in both fields, the literature lacks a systematic view to connect both fields. In this work, we fulfill the current gap by unifying the SEC and SemCom fields. We summarize the research problems in these two fields and provide a comprehensive review of the state of the art with a focus on their technical strengths and challenges.

Title: TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

Authors: Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18211
Pdf URL: https://arxiv.org/pdf/2411.18211
Copy Paste: [[2411.18211]] TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability(https://arxiv.org/abs/2411.18211)
Keywords: large language model
Abstract: Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal localization and struggle with videos of varying lengths. We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization. TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos. It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos. Additionally, TimeMarker utilizes diverse datasets, including further transformed temporal-related video QA datasets, to bolster its temporal understanding capabilities. Image and interleaved data are also employed to further enhance the model's semantic perception ability. Evaluations demonstrate that TimeMarker achieves state-of-the-art performance across multiple benchmarks, excelling in both short and long video categories. Our project page is at \url{this https URL}.

Title: PATHS: A Hierarchical Transformer for Efficient Whole Slide Image Analysis

Authors: Zak Buzzard, Konstantin Hemker, Nikola Simidjievski, Mateja Jamnik
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18225
Pdf URL: https://arxiv.org/pdf/2411.18225
Copy Paste: [[2411.18225]] PATHS: A Hierarchical Transformer for Efficient Whole Slide Image Analysis(https://arxiv.org/abs/2411.18225)
Keywords: transformer
Abstract: Computational analysis of whole slide images (WSIs) has seen significant research progress in recent years, with applications ranging across important diagnostic and prognostic tasks such as survival or cancer subtype prediction. Many state-of-the-art models process the entire slide - which may be as large as $150,000 \times 150,000$ pixels - as a bag of many patches, the size of which necessitates computationally cheap feature aggregation methods. However, a large proportion of these patches are uninformative, such as those containing only healthy or adipose tissue, adding significant noise and size to the bag. We propose Pathology Transformer with Hierarchical Selection (PATHS), a novel top-down method for hierarchical weakly supervised representation learning on slide-level tasks in computational pathology. PATHS is inspired by the cross-magnification manner in which a human pathologist examines a slide, recursively filtering patches at each magnification level to a small subset relevant to the diagnosis. Our method overcomes the complications of processing the entire slide, enabling quadratic self-attention and providing a simple interpretable measure of region importance. We apply PATHS to five datasets of The Cancer Genome Atlas (TCGA), and achieve superior performance on slide-level prediction tasks when compared to previous methods, despite processing only a small proportion of the slide.

Title: SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

Authors: Duc-Hai Pham, Tung Do, Phong Nguyen, Binh-Son Hua, Khoi Nguyen, Rang Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18229
Pdf URL: https://arxiv.org/pdf/2411.18229
Copy Paste: [[2411.18229]] SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation(https://arxiv.org/abs/2411.18229)
Keywords: diffusion, generative
Abstract: We propose SharpDepth, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods (e.g., Metric3D, UniDepth) with the fine-grained boundary sharpness typically achieved by generative methods (e.g., Marigold, Lotus). Traditional discriminative models trained on real-world data with sparse ground-truth depth can accurately predict metric depth but often produce over-smoothed or low-detail depth maps. Generative models, in contrast, are trained on synthetic data with dense ground truth, generating depth maps with sharp boundaries yet only providing relative depth with low accuracy. Our approach bridges these limitations by integrating metric accuracy with detailed boundary preservation, resulting in depth predictions that are both metrically precise and visually sharp. Our extensive zero-shot evaluations on standard depth estimation benchmarks confirm SharpDepth effectiveness, showing its ability to achieve both high depth accuracy and detailed representation, making it well-suited for applications requiring high-quality depth perception across diverse, real-world environments.

Title: Thai Financial Domain Adaptation of THaLLE -- Technical Report

Authors: KBTG Labs, Atthakorn Petchsod, Pornchanan Balee, Danupat Khamnuansin, Anuruth Lertpiya, Chanatip Saetia, Tawunrat Chalothorn, Thadpong Pongthawornkamol, Monchai Lertsutthiwong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18242
Pdf URL: https://arxiv.org/pdf/2411.18242
Copy Paste: [[2411.18242]] Thai Financial Domain Adaptation of THaLLE -- Technical Report(https://arxiv.org/abs/2411.18242)
Keywords: large language model
Abstract: Large Language Models (LLMs) excel in general tasks but struggle with domain-specific challenges, such as specialized terminology and localized regulations. Existing financial LLMs, like FinGPT and BloombergGPT, lack support for the Thai financial domain. We developed a Thai Financial LLM using the Investment Consultant (IC) exam dataset from the Stock Exchange of Thailand. To address dataset limitations, we applied data augmentation, ReLoRA for efficient training, Continued Pretraining (CPT) for domain knowledge, and Rank-Stabilized LoRA (rsLoRA) for fine-tuning. Supervised Fine-Tuning (SFT) simulated exam scenarios, while Direct Preference Optimization (DPO) refined the model using feedback. The model achieved scores of 72%, 72%, and 84% on IC exam levels P1, P2, and P3, respectively, demonstrating its effectiveness in Thai financial advisory tasks and its potential for specialized applications.

Title: Multimodal Integration of Longitudinal Noninvasive Diagnostics for Survival Prediction in Immunotherapy Using Deep Learning

Authors: Melda Yeghaian, Zuhir Bodalal, Daan van den Broek, John B A G Haanen, Regina G H Beets-Tan, Stefano Trebeschi, Marcel A J van Gerven
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.18253
Pdf URL: https://arxiv.org/pdf/2411.18253
Copy Paste: [[2411.18253]] Multimodal Integration of Longitudinal Noninvasive Diagnostics for Survival Prediction in Immunotherapy Using Deep Learning(https://arxiv.org/abs/2411.18253)
Keywords: transformer
Abstract: Purpose: Analyzing noninvasive longitudinal and multimodal data using artificial intelligence could potentially transform immunotherapy for cancer patients, paving the way towards precision medicine. Methods: In this study, we integrated pre- and on-treatment blood measurements, prescribed medications and CT-based volumes of organs from a large pan-cancer cohort of 694 patients treated with immunotherapy to predict short and long-term overall survival. By leveraging a combination of recent developments, different variants of our extended multimodal transformer-based simple temporal attention (MMTSimTA) network were trained end-to-end to predict mortality at three, six, nine and twelve months. These models were also compared to baseline methods incorporating intermediate and late fusion based integration methods. Results: The strongest prognostic performance was demonstrated using the extended transformer-based multimodal model with area under the curves (AUCs) of $0.84 \pm $0.04, $0.83 \pm $0.02, $0.82 \pm $0.02, $0.81 \pm $0.03 for 3-, 6-, 9-, and 12-month survival prediction, respectively. Conclusion: Our findings suggest that analyzing integrated early treatment data has potential for predicting survival of immunotherapy patients. Integrating complementary noninvasive modalities into a jointly trained model, using our extended transformer-based architecture, demonstrated an improved multimodal prognostic performance, especially in short term survival prediction.

Title: TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

Authors: Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18263
Pdf URL: https://arxiv.org/pdf/2411.18263
Copy Paste: [[2411.18263]] TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution(https://arxiv.org/abs/2411.18263)
Keywords: diffusion
Abstract: Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.

Title: Hidden Data Privacy Breaches in Federated Learning

Authors: Xueluan Gong, Yuji Wang, Shuaike Li, Mengyuan Sun, Songze Li, Qian Wang, Kwok-Yan Lam, Chen Chen
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2411.18269
Pdf URL: https://arxiv.org/pdf/2411.18269
Copy Paste: [[2411.18269]] Hidden Data Privacy Breaches in Federated Learning(https://arxiv.org/abs/2411.18269)
Keywords: privacy, defense, attack, steal, federate
Abstract: Federated Learning (FL) emerged as a paradigm for conducting machine learning across broad and decentralized datasets, promising enhanced privacy by obviating the need for direct data sharing. However, recent studies show that attackers can steal private data through model manipulation or gradient analysis. Existing attacks are constrained by low theft quantity or low-resolution data, and they are often detected through anomaly monitoring in gradients or weights. In this paper, we propose a novel data-reconstruction attack leveraging malicious code injection, supported by two key techniques, i.e., distinctive and sparse encoding design and block partitioning. Unlike conventional methods that require detectable changes to the model, our method stealthily embeds a hidden model using parameter sharing to systematically extract sensitive data. The Fibonacci-based index design ensures efficient, structured retrieval of memorized data, while the block partitioning method enhances our method's capability to handle high-resolution images by dividing them into smaller, manageable units. Extensive experiments on 4 datasets confirmed that our method is superior to the five state-of-the-art data-reconstruction attacks under the five respective detection methods. Our method can handle large-scale and high-resolution data without being detected or mitigated by state-of-the-art data reconstruction defense methods. In contrast to baselines, our method can be directly applied to both FedAVG and FedSGD scenarios, underscoring the need for developers to devise new defenses against such vulnerabilities. We will open-source our code upon acceptance.

Title: Grid-augumented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents

Authors: Joongwon Chae, Zhenyu Wang, Peiwu Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18270
Pdf URL: https://arxiv.org/pdf/2411.18270
Copy Paste: [[2411.18270]] Grid-augumented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents(https://arxiv.org/abs/2411.18270)
Keywords: transformer
Abstract: Recent advances in multimodal models have demonstrated impressive capabilities in object recognition and scene understanding. However, these models often struggle with precise spatial localization - a critical capability for real-world applications. Inspired by how humans use grid-based references like chess boards and maps, we propose introducing explicit visual position encoding through a simple grid overlay approach. By adding a 9x9 black grid pattern onto input images, our method provides visual spatial guidance analogous to how positional encoding works in transformers, but in an explicit, visual form. Experiments on the COCO 2017 dataset demonstrate that our grid-based approach achieves significant improvements in localization accuracy, with a 107.4% increase in IoU (from 0.27 to 0.56) and a 194.4% improvement in GIoU (from 0.18 to 0.53) compared to baseline performance. Through attention visualization analysis, we show how this visual position encoding helps models better ground spatial relationships. Our method's simplicity and effectiveness make it particularly valuable for applications requiring accurate spatial reasoning, such as robotic manipulation, medical imaging, and autonomous navigation.

Title: Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

Authors: Tianyuan Zhang, Lu Wang, Xinwei Zhang, Yitong Zhang, Boyi Jia, Siyuan Liang, Shengshan Hu, Qiang Fu, Aishan Liu, Xianglong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18275
Pdf URL: https://arxiv.org/pdf/2411.18275
Copy Paste: [[2411.18275]] Visual Adversarial Attack on Vision-Language Models for Autonomous Driving(https://arxiv.org/abs/2411.18275)
Keywords: attack, large language model
Abstract: Vision-language models (VLMs) have significantly advanced autonomous driving (AD) by enhancing reasoning capabilities. However, these models remain highly vulnerable to adversarial attacks. While existing research has primarily focused on general VLM attacks, the development of attacks tailored to the safety-critical AD context has been largely overlooked. In this paper, we take the first step toward designing adversarial attacks specifically targeting VLMs in AD, exposing the substantial risks these attacks pose within this critical domain. We identify two unique challenges for effective adversarial attacks on AD VLMs: the variability of textual instructions and the time-series nature of visual scenarios. To this end, we propose ADvLM, the first visual adversarial attack framework specifically designed for VLMs in AD. Our framework introduces Semantic-Invariant Induction, which uses a large language model to create a diverse prompt library of textual instructions with consistent semantic content, guided by semantic entropy. Building on this, we introduce Scenario-Associated Enhancement, an approach where attention mechanisms select key frames and perspectives within driving scenarios to optimize adversarial perturbations that generalize across the entire scenario. Extensive experiments on several AD VLMs over multiple benchmarks show that ADvLM achieves state-of-the-art attack effectiveness. Moreover, real-world attack studies further validate its applicability and potential in practice.

Title: Neutralizing Backdoors through Information Conflicts for Large Language Models

Authors: Chen Chen, Yuchen Sun, Xueluan Gong, Jiaxin Gao, Kwok-Yan Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18280
Pdf URL: https://arxiv.org/pdf/2411.18280
Copy Paste: [[2411.18280]] Neutralizing Backdoors through Information Conflicts for Large Language Models(https://arxiv.org/abs/2411.18280)
Keywords: defense, attack, robust, large language model
Abstract: Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model's parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model's internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.

Title: MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

Authors: Haopeng Fang, Di Qiu, Binjie Mao, Pengfei Yan, He Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18281
Pdf URL: https://arxiv.org/pdf/2411.18281
Copy Paste: [[2411.18281]] MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation(https://arxiv.org/abs/2411.18281)
Keywords: large language model
Abstract: Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts, and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.

Title: Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

Authors: Chen Zhou, Peng Cheng, Junfeng Fang, Yifan Zhang, Yibo Yan, Xiaojun Jia, Yanyan Xu, Kun Wang, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18288
Pdf URL: https://arxiv.org/pdf/2411.18288
Copy Paste: [[2411.18288]] Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks(https://arxiv.org/abs/2411.18288)
Keywords: robust, extraction, fair
Abstract: Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these "optimization techniques". Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training "techniques", which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques.

Title: HiFiVFS: High Fidelity Video Face Swapping

Authors: Xu Chen, Keke He, Junwei Zhu, Yanhao Ge, Wei Li, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18293
Pdf URL: https://arxiv.org/pdf/2411.18293
Copy Paste: [[2411.18293]] HiFiVFS: High Fidelity Video Face Swapping(https://arxiv.org/abs/2411.18293)
Keywords: diffusion, generative
Abstract: Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.

Title: Aligning Pre-trained Models for Spoken Language Translation

Authors: Šimon Sedláček, Santosh Kesiraju, Alexander Polok, Jan Černocký
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18294
Pdf URL: https://arxiv.org/pdf/2411.18294
Copy Paste: [[2411.18294]] Aligning Pre-trained Models for Spoken Language Translation(https://arxiv.org/abs/2411.18294)
Keywords: transformer
Abstract: This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( < 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.

Title: Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Authors: Tianyi Wei, Dongdong Chen, Yifan Zhou, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18301
Pdf URL: https://arxiv.org/pdf/2411.18301
Copy Paste: [[2411.18301]] Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation(https://arxiv.org/abs/2411.18301)
Keywords: diffusion, transformer
Abstract: Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at this https URL.

Title: InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

Authors: Wenjie Zhuo, Fan Ma, Hehe Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18303
Pdf URL: https://arxiv.org/pdf/2411.18303
Copy Paste: [[2411.18303]] InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation(https://arxiv.org/abs/2411.18303)
Keywords: diffusion
Abstract: We present InfiniDreamer, a novel framework for arbitrarily long human motion generation. InfiniDreamer addresses the limitations of current motion generation methods, which are typically restricted to short sequences due to the lack of long motion training data. To achieve this, we first generate sub-motions corresponding to each textual description and then assemble them into a coarse, extended sequence using randomly initialized transition segments. We then introduce an optimization-based method called Segment Score Distillation (SSD) to refine the entire long motion sequence. SSD is designed to utilize an existing motion prior, which is trained only on short clips, in a training-free manner. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.

Title: Real-time Video Target Tracking Algorithm Utilizing Convolutional Neural Networks (CNN)

Authors: Chaoyi Tan, Xiangtian Li, Xiaobo Wang, Zhen Qi, Ao Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18314
Pdf URL: https://arxiv.org/pdf/2411.18314
Copy Paste: [[2411.18314]] Real-time Video Target Tracking Algorithm Utilizing Convolutional Neural Networks (CNN)(https://arxiv.org/abs/2411.18314)
Keywords: robust
Abstract: Thispaperaimstoresearchandimplementa real-timevideotargettrackingalgorithmbasedon ConvolutionalNeuralNetworks(CNN),enhancingthe accuracyandrobustnessoftargettrackingincomplex this http URL algorithmsinhandlingissuessuchastargetocclusion,morphologicalchanges,andbackgroundinterference,our this http URL continuouslyupdatesthetargetmodelthroughanonline learningmechanismtoadapttochangesinthetarget's this http URL,when dealingwithsituationsinvolvingrapidmotion,partial occlusion,andcomplexbackgrounds,theproposedalgorithm exhibitshighertrackingsuccessratesandlowerfailurerates this http URL studysuccessfullyappliesCNNtoreal-timevideotarget tracking,improvingtheaccuracyandstabilityofthetracking algorithmwhilemaintaininghighprocessingspeeds,thus this http URL isexpectedtoprovidenewsolutionsfortargettrackingtasksin videosurveillanceandintelligenttransportationdomains.

Title: RITA: Automatic Framework for Designing of Resilient IoT Applications

Authors: Luis Eduardo Pessoa, Cristovao Freitas Iglesias Jr, Claudio Miceli
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18324
Pdf URL: https://arxiv.org/pdf/2411.18324
Copy Paste: [[2411.18324]] RITA: Automatic Framework for Designing of Resilient IoT Applications(https://arxiv.org/abs/2411.18324)
Keywords: security, privacy, robust
Abstract: Designing resilient Internet of Things (IoT) systems requires i) identification of IoT Critical Objects (ICOs) such as services, devices, and resources, ii) threat analysis, and iii) mitigation strategy selection. However, the traditional process for designing resilient IoT systems is still manual, leading to inefficiencies and increased risks. In addition, while tools such as ChatGPT could support this manual and highly error-prone process, their use raises concerns over data privacy, inconsistent outputs, and internet dependence. Therefore, we propose RITA, an automated, open-source framework that uses a fine-tuned RoBERTa-based Named Entity Recognition (NER) model to identify ICOs from IoT requirement documents, correlate threats, and recommend countermeasures. RITA operates entirely offline and can be deployed on-site, safeguarding sensitive information and delivering consistent outputs that enhance standardization. In our empirical evaluation, RITA outperformed ChatGPT in four of seven ICO categories, particularly in actuator, sensor, network resource, and service identification, using both human-annotated and ChatGPT-generated test data. These findings indicate that RITA can improve resilient IoT design by effectively supporting key security operations, offering a practical solution for developing robust IoT architectures.

Title: Using Malware Detection Techniques for HPC Application Classification

Authors: Thomas Jakobsche, Florina M. Ciorba
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2411.18327
Pdf URL: https://arxiv.org/pdf/2411.18327
Copy Paste: [[2411.18327]] Using Malware Detection Techniques for HPC Application Classification(https://arxiv.org/abs/2411.18327)
Keywords: security
Abstract: HPC systems face security and compliance challenges, particularly in preventing waste and misuse of computational resources by unauthorized or malicious software that deviates from allocation purpose. Existing methods to classify applications based on job names or resource usage are often unreliable or fail to capture applications that have different behavior due to different inputs or system noise. This research proposes an approach that uses similarity-preserving fuzzy hashes to classify HPC application executables. By comparing the similarity of SSDeep fuzzy hashes, a Random Forest Classifier can accurately label applications executing on HPC systems including unknown samples. We evaluate the Fuzzy Hash Classifier on a dataset of 92 application classes and 5333 distinct application samples. The proposed method achieved a macro f1-score of 90% (micro f1-score: 89%, weighted f1-score: 90%). Our approach addresses the critical need for more effective application classification in HPC environments, minimizing resource waste, and enhancing security and compliance.

Title: EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond

Authors: Meiqi Cao, Xiangbo Shu, Jiachao Zhang, Rui Yan, Zechao Li, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18328
Pdf URL: https://arxiv.org/pdf/2411.18328
Copy Paste: [[2411.18328]] EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond(https://arxiv.org/abs/2411.18328)
Keywords: privacy
Abstract: Event-based Action Recognition (EAR) possesses the advantages of high-temporal resolution capturing and privacy preservation compared with traditional action recognition. Current leading EAR solutions typically follow two regimes: project unconstructed event streams into dense constructed event frames and adopt powerful frame-specific networks, or employ lightweight point-specific networks to handle sparse unconstructed event points directly. However, such two regimes are blind to a fundamental issue: failing to accommodate the unique dense temporal and sparse spatial properties of asynchronous event data. In this article, we present a synergy-aware framework, i.e., EventCrab, that adeptly integrates the "lighter" frame-specific networks for dense event frames with the "heavier" point-specific networks for sparse event points, balancing accuracy and efficiency. Furthermore, we establish a joint frame-text-point representation space to bridge distinct event frames and points. In specific, to better exploit the unique spatiotemporal relationships inherent in asynchronous event points, we devise two strategies for the "heavier" point-specific embedding: i) a Spiking-like Context Learner (SCL) that extracts contextualized event points from raw event streams. ii) an Event Point Encoder (EPE) that further explores event-point long spatiotemporal features in a Hilbert-scan way. Experiments on four datasets demonstrate the significant performance of our proposed EventCrab, particularly gaining improvements of 5.17% on SeAct and 7.01% on HARDVS.

Title: Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

Authors: T.G.D.K. Sumanathilaka, Nicholas Micallef, Julian Hough
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18337
Pdf URL: https://arxiv.org/pdf/2411.18337
Copy Paste: [[2411.18337]] Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation(https://arxiv.org/abs/2411.18337)
Keywords: large language model
Abstract: Ambiguous words are often found in modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD using a novel approach combining a systematic prompt augmentation mechanism with a knowledge base (KB) consisting of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation where prompt is supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering and few-shot prompting to guide the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation was conducted using FEWS test data and sense tags. This research advances accurate word interpretation in social media and digital communication.

Title: FreqX: What neural networks learn is what network designers say

Authors: Zechen Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18343
Pdf URL: https://arxiv.org/pdf/2411.18343
Copy Paste: [[2411.18343]] FreqX: What neural networks learn is what network designers say(https://arxiv.org/abs/2411.18343)
Keywords: privacy, fair, interpretability
Abstract: Personalized Federal learning(PFL) allows clients to cooperatively train a personalized model without disclosing their private dataset. However, PFL suffers from Non-IID, heterogeneous devices, lack of fairness, and unclear contribution which urgently need the interpretability of deep learning model to overcome these challenges. These challenges proposed new demands for interpretability. Low cost, privacy, and detailed information. There is no current interpretability method satisfying them. In this paper, we propose a novel interpretability method \emph{FreqX} by introducing Signal Processing and Information Theory. Our experiments show that the explanation results of FreqX contain both attribution information and concept information. FreqX runs at least 10 times faster than the baselines which contain concept information.

Title: TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Authors: Riza Velioglu, Petra Bevandic, Robin Chan, Barbara Hammer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18350
Pdf URL: https://arxiv.org/pdf/2411.18350
Copy Paste: [[2411.18350]] TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models(https://arxiv.org/abs/2411.18350)
Keywords: diffusion, generative
Abstract: This paper introduces Virtual Try-Off (VTOFF), a novel task focused on generating standardized garment images from single photos of clothed individuals. Unlike traditional Virtual Try-On (VTON), which digitally dresses models, VTOFF aims to extract a canonical garment image, posing unique challenges in capturing garment shape, texture, and intricate patterns. This well-defined target makes VTOFF particularly effective for evaluating reconstruction fidelity in generative models. We present TryOffDiff, a model that adapts Stable Diffusion with SigLIP-based visual conditioning to ensure high fidelity and detail retention. Experiments on a modified VITON-HD dataset show that our approach outperforms baseline methods based on pose transfer and virtual try-on with fewer pre- and post-processing steps. Our analysis reveals that traditional image generation metrics inadequately assess reconstruction quality, prompting us to rely on DISTS for more accurate evaluation. Our results highlight the potential of VTOFF to enhance product imagery in e-commerce applications, advance generative model evaluation, and inspire future work on high-fidelity reconstruction. Demo, code, and models are available at: this https URL

Title: ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Authors: Qing Jiang, Gen luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18363
Pdf URL: https://arxiv.org/pdf/2411.18363
Copy Paste: [[2411.18363]] ChatRex: Taming Multimodal LLM for Joint Perception and Understanding(https://arxiv.org/abs/2411.18363)
Keywords: large language model
Abstract: Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at \url{this https URL}.

Title: GPT as ghostwriter at the White House

Authors: Jacques Savoy
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2411.18365
Pdf URL: https://arxiv.org/pdf/2411.18365
Copy Paste: [[2411.18365]] GPT as ghostwriter at the White House(https://arxiv.org/abs/2411.18365)
Keywords: large language model
Abstract: Recently several large language models (LLMs) have demonstrated their capability to generate a message in response to a user request. Such scientific breakthroughs promote new perspectives but also some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT 3.5 by comparing its generated messages with those of the recent US presidents. To achieve this objective, we compare the State of the Union addresses written by Reagan to Obama with those automatically produced by ChatGPT. We found that ChatGPT tends to overuse the lemma "we" as well as nouns and commas. On the other hand, the generated speeches employ less verbs and include, in mean, longer sentences. Even when imposing a given style to ChatGPT, the resulting speech remains distinct from messages written by the target author. Moreover, ChatGPT opts for a neutral tone with mainly positive emotional expressions and symbolic terms (e.g., freedom, nation). Finally, we show that the GPT's style exposes distinct features compared to real presidential addresses.

Title: Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

Authors: Yiming Wu, Huan Wang, Zhenghao Chen, Dong Xu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.18375
Pdf URL: https://arxiv.org/pdf/2411.18375
Copy Paste: [[2411.18375]] Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models(https://arxiv.org/abs/2411.18375)
Keywords: diffusion
Abstract: The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on \textbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$ and 1.4 $\times$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.

Title: Preserving Deep Representations In One-Shot Pruning: A Hessian-Free Second-Order Optimization Framework

Authors: Ryan Lucas, Rahul Mazumder
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18376
Pdf URL: https://arxiv.org/pdf/2411.18376
Copy Paste: [[2411.18376]] Preserving Deep Representations In One-Shot Pruning: A Hessian-Free Second-Order Optimization Framework(https://arxiv.org/abs/2411.18376)
Keywords: transformer
Abstract: We present SNOWS, a one-shot post-training pruning framework aimed at reducing the cost of vision network inference without retraining. Current leading one-shot pruning methods minimize layer-wise least squares reconstruction error which does not take into account deeper network representations. We propose to optimize a more global reconstruction objective. This objective accounts for nonlinear activations deep in the network to obtain a better proxy for the network loss. This nonlinear objective leads to a more challenging optimization problem -- we demonstrate it can be solved efficiently using a specialized second-order optimization framework. A key innovation of our framework is the use of Hessian-free optimization to compute exact Newton descent steps without needing to compute or store the full Hessian matrix. A distinct advantage of SNOWS is that it can be readily applied on top of any sparse mask derived from prior methods, readjusting their weights to exploit nonlinearities in deep feature representations. SNOWS obtains state-of-the-art results on various one-shot pruning benchmarks including residual networks and Vision Transformers (ViT/B-16 and ViT/L-16, 86m and 304m parameters respectively).

Title: ChatGPT as speechwriter for the French presidents

Authors: Dominique Labbé, Cyril Labbé, Jacques Savoy
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2411.18382
Pdf URL: https://arxiv.org/pdf/2411.18382
Copy Paste: [[2411.18382]] ChatGPT as speechwriter for the French presidents(https://arxiv.org/abs/2411.18382)
Keywords: generative, large language model
Abstract: Generative AI proposes several large language models (LLMs) to automatically generate a message in response to users' requests. Such scientific breakthroughs promote new writing assistants but with some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT by comparing its generated messages with those of the recent French presidents. To achieve this, we compare end-of-the-year addresses written by Chirac, Sarkozy, Hollande, and Macron with those automatically produced by ChatGPT. We found that ChatGPT tends to overuse nouns, possessive determiners, and numbers. On the other hand, the generated speeches employ less verbs, pronouns, and adverbs and include, in mean, too standardized sentences. Considering some words, one can observe that ChatGPT tends to overuse "to must" (devoir), "to continue" or the lemma "we" (nous). Moreover, GPT underuses the auxiliary verb "to be" (^etre), or the modal verbs "to will" (vouloir) or "to have to" (falloir). In addition, when a short text is provided as example to ChatGPT, the machine can generate a short message with a style closed to the original wording. Finally, we reveal that ChatGPT style exposes distinct features compared to real presidential speeches.

Title: Topic Modeling and Sentiment Analysis on Japanese Online Media's Coverage of Nuclear Energy

Authors: Yifan Sun, Hirofumi Tsuruta, Masaya Kumagai, Ken Kurosaki
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2411.18383
Pdf URL: https://arxiv.org/pdf/2411.18383
Copy Paste: [[2411.18383]] Topic Modeling and Sentiment Analysis on Japanese Online Media's Coverage of Nuclear Energy(https://arxiv.org/abs/2411.18383)
Keywords: large language model
Abstract: Thirteen years after the Fukushima Daiichi nuclear power plant accident, Japan's nuclear energy accounts for only approximately 6% of electricity production, as most nuclear plants remain shut down. To revitalize the nuclear industry and achieve sustainable development goals, effective communication with Japanese citizens, grounded in an accurate understanding of public sentiment, is of paramount importance. While nationwide surveys have traditionally been used to gauge public views, the rise of social media in recent years has provided a promising new avenue for understanding public sentiment. To explore domestic sentiment on nuclear energy-related issues expressed online, we analyzed the content and comments of over 3,000 YouTube videos covering topics related to nuclear energy. Topic modeling was used to extract the main topics from the videos, and sentiment analysis with large language models classified user sentiments towards each topic. Additionally, word co-occurrence network analysis was performed to examine the shift in online discussions during August and September 2023 regarding the release of treated water. Overall, our results provide valuable insights into the online discourse on nuclear energy and contribute to a more comprehensive understanding of public sentiment in Japan.

Title: Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization

Authors: Shivam Pal, Aishwarya Gupta, Saqib Sarwar, Piyush Rai
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2411.18385
Pdf URL: https://arxiv.org/pdf/2411.18385
Copy Paste: [[2411.18385]] Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization(https://arxiv.org/abs/2411.18385)
Keywords: federate
Abstract: Federated Learning (FL) has emerged as a promising method to collaboratively learn from decentralized and heterogeneous data available at different clients without the requirement of data ever leaving the clients. Recent works on FL have advocated taking a Bayesian approach to FL as it offers a principled way to account for the model and predictive uncertainty by learning a posterior distribution for the client and/or server models. Moreover, Bayesian FL also naturally enables personalization in FL to handle data heterogeneity across the different clients by having each client learn its own distinct personalized model. In particular, the hierarchical Bayesian approach enables all the clients to learn their personalized models while also taking into account the commonalities via a prior distribution provided by the server. However, despite their promise, Bayesian approaches for FL can be computationally expensive and can have high communication costs as well because of the requirement of computing and sending the posterior distributions. We present a novel Bayesian FL method using an efficient second-order optimization approach, with a computational cost that is similar to first-order optimization methods like Adam, but also provides the various benefits of the Bayesian approach for FL (e.g., uncertainty, personalization), while also being significantly more efficient and accurate than SOTA Bayesian FL methods (both for standard as well as personalized FL settings). Our method achieves improved predictive accuracies as well as better uncertainty estimates as compared to the baselines which include both optimization based as well as Bayesian FL methods.

Title: Politicians vs ChatGPT. A study of presuppositions in French and Italian political communication

Authors: Davide Garassino, Vivana Masia, Nicola Brocca, Alice Delorme Benites
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2411.18403
Pdf URL: https://arxiv.org/pdf/2411.18403
Copy Paste: [[2411.18403]] Politicians vs ChatGPT. A study of presuppositions in French and Italian political communication(https://arxiv.org/abs/2411.18403)
Keywords: large language model
Abstract: This paper aims to provide a comparison between texts produced by French and Italian politicians on polarizing issues, such as immigration and the European Union, and their chatbot counterparts created with ChatGPT 3.5. In this study, we focus on implicit communication, in particular on presuppositions and their functions in discourse, which have been considered in the literature as a potential linguistic feature of manipulation. This study also aims to contribute to the emerging literature on the pragmatic competences of Large Language Models.

Title: Deep Fourier-embedded Network for Bi-modal Salient Object Detection

Authors: Pengfei Lyu, Xiaosheng Yu, Chengdong Wu, Jagath C. Rajapakse
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18409
Pdf URL: https://arxiv.org/pdf/2411.18409
Copy Paste: [[2411.18409]] Deep Fourier-embedded Network for Bi-modal Salient Object Detection(https://arxiv.org/abs/2411.18409)
Keywords: transformer
Abstract: The rapid development of deep learning provides a significant improvement of salient object detection combining both RGB and thermal images. However, existing deep learning-based models suffer from two major shortcomings. First, the computation and memory demands of Transformer-based models with quadratic complexity are unbearable, especially in handling high-resolution bi-modal feature fusion. Second, even if learning converges to an ideal solution, there remains a frequency gap between the prediction and ground truth. Therefore, we propose a purely fast Fourier transform-based model, namely deep Fourier-embedded network (DFENet), for learning bi-modal information of RGB and thermal images. On one hand, fast Fourier transform efficiently fetches global dependencies with low complexity. Inspired by this, we design modal-coordinated perception attention to fuse the frequency gap between RGB and thermal modalities with multi-dimensional representation enhancement. To obtain reliable detailed information during decoding, we design the frequency-decomposed edge-aware module (FEM) to clarify object edges by deeply decomposing low-level features. Moreover, we equip proposed Fourier residual channel attention block in each decoder layer to prioritize high-frequency information while aligning channel global relationships. On the other hand, we propose co-focus frequency loss (CFL) to steer FEM towards minimizing the frequency gap. CFL dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing the bi-modal edge information in the Fourier domain. This frequency-level refinement of edge features further contributes to the quality of the final pixel-level prediction. Extensive experiments on four bi-modal salient object detection benchmark datasets demonstrate our proposed DFENet outperforms twelve existing state-of-the-art models.

Title: Adaptive Blind All-in-One Image Restoration

Authors: David Serrano-Lozano, Luis Herranz, Shaolin Su, Javier Vazquez-Corral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18412
Pdf URL: https://arxiv.org/pdf/2411.18412
Copy Paste: [[2411.18412]] Adaptive Blind All-in-One Image Restoration(https://arxiv.org/abs/2411.18412)
Keywords: segmentation
Abstract: Blind all-in-one image restoration models aim to recover a high-quality image from an input degraded with unknown distortions. However, these models require all the possible degradation types to be defined during the training stage while showing limited generalization to unseen degradations, which limits their practical application in complex cases. In this paper, we propose a simple but effective adaptive blind all-in-one restoration (ABAIR) model, which can address multiple degradations, generalizes well to unseen degradations, and efficiently incorporate new degradations by training a small fraction of parameters. First, we train our baseline model on a large dataset of natural images with multiple synthetic degradations, augmented with a segmentation head to estimate per-pixel degradation types, resulting in a powerful backbone able to generalize to a wide range of degradations. Second, we adapt our baseline model to varying image restoration tasks using independent low-rank adapters. Third, we learn to adaptively combine adapters to versatile images via a flexible and lightweight degradation estimator. Our model is both powerful in handling specific distortions and flexible in adapting to complex tasks, it not only outperforms the state-of-the-art by a large margin on five- and three-task IR setups, but also shows improved generalization to unseen degradations and also composite distortions.

Title: FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Authors: Ao Shen, Zhiyao Li, Mingyu Gao
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2411.18424
Pdf URL: https://arxiv.org/pdf/2411.18424
Copy Paste: [[2411.18424]] FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving(https://arxiv.org/abs/2411.18424)
Keywords: fair, large language model
Abstract: Serving numerous users and requests concurrently requires good fairness in Large Language Models (LLMs) serving system. This ensures that, at the same cost, the system can meet the Service Level Objectives (SLOs) of more users , such as time to first token (TTFT) and time between tokens (TBT), rather than allowing a few users to experience performance far exceeding the SLOs. To achieve better fairness, the preemption-based scheduling policy dynamically adjusts the priority of each request to maintain balance during runtime. However, existing systems tend to overly prioritize throughput, overlooking the overhead caused by preemption-induced context switching, which is crucial for maintaining fairness through priority adjustments. In this work, we identify three main challenges that result in this overhead. 1) Inadequate I/O utilization. 2) GPU idleness. 3) Unnecessary I/O transmission during multi-turn conversations. Our key insight is that the block-based KV cache memory policy in existing systems, while achieving near-zero memory waste, leads to discontinuity and insufficient granularity in the KV cache memory. To respond, we introduce FastSwitch, a fairness-aware serving system that not only aligns with existing KV cache memory allocation policy but also mitigates context switching overhead. Our evaluation shows that FastSwitch outperforms the state-of-the-art LLM serving system vLLM with speedups of 1.4-11.2x across different tail TTFT and TBT.

Title: Streamlining Prediction in Bayesian Deep Learning

Authors: Rui Li, Marcus Klasson, Arno Solin, Martin Trapp
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18425
Pdf URL: https://arxiv.org/pdf/2411.18425
Copy Paste: [[2411.18425]] Streamlining Prediction in Bayesian Deep Learning(https://arxiv.org/abs/2411.18425)
Keywords: transformer
Abstract: The rising interest in Bayesian deep learning (BDL) has led to a plethora of methods for estimating the posterior distribution. However, efficient computation of inferences, such as predictions, has been largely overlooked with Monte Carlo integration remaining the standard. In this work we examine streamlining prediction in BDL through a single forward pass without sampling. For this we use local linearisation on activation functions and local Gaussian approximations at linear layers. Thus allowing us to analytically compute an approximation to the posterior predictive distribution. We showcase our approach for both MLP and transformers, such as ViT and GPT-2, and assess its performance on regression and classification tasks.

Title: Metric-DST: Mitigating Selection Bias Through Diversity-Guided Semi-Supervised Metric Learning

Authors: Yasin I. Tepeli, Mathijs de Wolf, Joana P. Goncalves
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18442
Pdf URL: https://arxiv.org/pdf/2411.18442
Copy Paste: [[2411.18442]] Metric-DST: Mitigating Selection Bias Through Diversity-Guided Semi-Supervised Metric Learning(https://arxiv.org/abs/2411.18442)
Keywords: robust, fair
Abstract: Selection bias poses a critical challenge for fairness in machine learning, as models trained on data that is less representative of the population might exhibit undesirable behavior for underrepresented profiles. Semi-supervised learning strategies like self-training can mitigate selection bias by incorporating unlabeled data into model training to gain further insight into the distribution of the population. However, conventional self-training seeks to include high-confidence data samples, which may reinforce existing model bias and compromise effectiveness. We propose Metric-DST, a diversity-guided self-training strategy that leverages metric learning and its implicit embedding space to counter confidence-based bias through the inclusion of more diverse samples. Metric-DST learned more robust models in the presence of selection bias for generated and real-world datasets with induced bias, as well as a molecular biology prediction task with intrinsic bias. The Metric-DST learning strategy offers a flexible and widely applicable solution to mitigate selection bias and enhance fairness of machine learning models.

Title: Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

Authors: Frederic Kirstein, Terry Ruas, Bela Gipp
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18444
Pdf URL: https://arxiv.org/pdf/2411.18444
Copy Paste: [[2411.18444]] Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator(https://arxiv.org/abs/2411.18444)
Keywords: large language model
Abstract: The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA's components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework's flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.

Title: Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation

Authors: Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas
Subjects: cs.LG, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.18447
Pdf URL: https://arxiv.org/pdf/2411.18447
Copy Paste: [[2411.18447]] Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation(https://arxiv.org/abs/2411.18447)
Keywords: robust, generative
Abstract: Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.

Title: Advancements in Myocardial Infarction Detection and Classification Using Wearable Devices: A Comprehensive Review

Authors: Abhijith S, Arjun Rajesh, Mansi Manoj, Sandra Davis Kollannur, Sujitta R V, Jerrin Thomas Panachakel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18451
Pdf URL: https://arxiv.org/pdf/2411.18451
Copy Paste: [[2411.18451]] Advancements in Myocardial Infarction Detection and Classification Using Wearable Devices: A Comprehensive Review(https://arxiv.org/abs/2411.18451)
Keywords: attack
Abstract: Myocardial infarction (MI), commonly known as a heart attack, is a critical health condition caused by restricted blood flow to the heart. Early-stage detection through continuous ECG monitoring is essential to minimize irreversible damage. This review explores advancements in MI classification methodologies for wearable devices, emphasizing their potential in real-time monitoring and early diagnosis. It critically examines traditional approaches, such as morphological filtering and wavelet decomposition, alongside cutting-edge techniques, including Convolutional Neural Networks (CNNs) and VLSI-based methods. By synthesizing findings on machine learning, deep learning, and hardware innovations, this paper highlights their strengths, limitations, and future prospects. The integration of these techniques into wearable devices offers promising avenues for efficient, accurate, and energy-aware MI detection, paving the way for next-generation wearable healthcare solutions.

Title: Synthetic ECG Generation for Data Augmentation and Transfer Learning in Arrhythmia Classification

Authors: José Fernando Núñez, Jamie Arjona, Javier Béjar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18456
Pdf URL: https://arxiv.org/pdf/2411.18456
Copy Paste: [[2411.18456]] Synthetic ECG Generation for Data Augmentation and Transfer Learning in Arrhythmia Classification(https://arxiv.org/abs/2411.18456)
Keywords: diffusion, generative
Abstract: Deep learning models need a sufficient amount of data in order to be able to find the hidden patterns in it. It is the purpose of generative modeling to learn the data distribution, thus allowing us to sample more data and augment the original dataset. In the context of physiological data, and more specifically electrocardiogram (ECG) data, given its sensitive nature and expensive data collection, we can exploit the benefits of generative models in order to enlarge existing datasets and improve downstream tasks, in our case, classification of heart rhythm. In this work, we explore the usefulness of synthetic data generated with different generative models from Deep Learning namely Diffweave, Time-Diffusion and Time-VQVAE in order to obtain better classification results for two open source multivariate ECG datasets. Moreover, we also investigate the effects of transfer learning, by fine-tuning a synthetically pre-trained model and then progressively adding increasing proportions of real data. We conclude that although the synthetic samples resemble the real ones, the classification improvement when simply augmenting the real dataset is barely noticeable on individual datasets, but when both datasets are merged the results show an increase across all metrics for the classifiers when using synthetic samples as augmented data. From the fine-tuning results the Time-VQVAE generative model has shown to be superior to the others but not powerful enough to achieve results close to a classifier trained with real data only. In addition, methods and metrics for measuring closeness between synthetic data and the real one have been explored as a side effect of the main research questions of this study.

Title: Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

Authors: Ziyin Zhang, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Rui Wang, Zhaopeng Tu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18462
Pdf URL: https://arxiv.org/pdf/2411.18462
Copy Paste: [[2411.18462]] Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding(https://arxiv.org/abs/2411.18462)
Keywords: large language model
Abstract: Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.

Title: Weakly Supervised Framework Considering Multi-temporal Information for Large-scale Cropland Mapping with Satellite Imagery

Authors: Yuze Wang, Aoran Hu, Ji Qi, Yang Liu, Chao Tao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18475
Pdf URL: https://arxiv.org/pdf/2411.18475
Copy Paste: [[2411.18475]] Weakly Supervised Framework Considering Multi-temporal Information for Large-scale Cropland Mapping with Satellite Imagery(https://arxiv.org/abs/2411.18475)
Keywords: robust, extraction
Abstract: Accurately mapping large-scale cropland is crucial for agricultural production management and planning. Currently, the combination of remote sensing data and deep learning techniques has shown outstanding performance in cropland mapping. However, those approaches require massive precise labels, which are labor-intensive. To reduce the label cost, this study presented a weakly supervised framework considering multi-temporal information for large-scale cropland mapping. Specifically, we extract high-quality labels according to their consistency among global land cover (GLC) products to construct the supervised learning signal. On the one hand, to alleviate the overfitting problem caused by the model's over-trust of remaining errors in high-quality labels, we encode the similarity/aggregation of cropland in the visual/spatial domain to construct the unsupervised learning signal, and take it as the regularization term to constrain the supervised part. On the other hand, to sufficiently leverage the plentiful information in the samples without high-quality labels, we also incorporate the unsupervised learning signal in these samples, enriching the diversity of the feature space. After that, to capture the phenological features of croplands, we introduce dense satellite image time series (SITS) to extend the proposed framework in the temporal dimension. We also visualized the high dimensional phenological features to uncover how multi-temporal information benefits cropland extraction, and assessed the method's robustness under conditions of data scarcity. The proposed framework has been experimentally validated for strong adaptability across three study areas (Hunan Province, Southeast France, and Kansas) in large-scale cropland mapping, and the internal mechanism and temporal generalizability are also investigated.

Title: Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

Authors: Jinyang Wu, Mingkuan Feng, Shuai Zhang, Feihu Che, Zengqi Wen, Jianhua Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18478
Pdf URL: https://arxiv.org/pdf/2411.18478
Copy Paste: [[2411.18478]] Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS(https://arxiv.org/abs/2411.18478)
Keywords: large language model
Abstract: In-context Learning (ICL) enables large language models (LLMs) to tackle downstream tasks through sophisticated prompting and high-quality demonstrations. However, this traditional ICL paradigm shows limitations when facing complex mathematical reasoning tasks, primarily due to its heavy dependence on example quality and the necessity for human intervention in challenging scenarios. To address these limitations, this paper presents HiAR-ICL, a \textbf{Hi}gh-level \textbf{A}utomated \textbf{R}easoning paradigm in \textbf{ICL} that shifts focus from specific examples to abstract thinking patterns, extending the conventional concept of context in ICL. HiAR-ICL introduces five atomic reasoning actions as fundamental components for constructing chain-structured patterns. Using Monte Carlo Tree Search, we explore reasoning paths and construct thought cards to guide subsequent inference. We then develop a cognitive complexity framework that dynamically matches problems with appropriate thought cards. Experimental results demonstrate HiAR-ICL's effectiveness, achieving state-of-the-art accuracy (79.6$\%$) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o (76.6$\%$) and Claude 3.5 (71.1$\%$).

Title: SoK: Watermarking for AI-Generated Content

Authors: Xuandong Zhao, Sam Gunn, Miranda Christ, Jaiden Fairoze, Andres Fabrega, Nicholas Carlini, Sanjam Garg, Sanghyun Hong, Milad Nasr, Florian Tramer, Somesh Jha, Lei Li, Yu-Xiang Wang, Dawn Song
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18479
Pdf URL: https://arxiv.org/pdf/2411.18479
Copy Paste: [[2411.18479]] SoK: Watermarking for AI-Generated Content(https://arxiv.org/abs/2411.18479)
Keywords: attack, robust, watermark, generative
Abstract: As the outputs of generative AI (GenAI) techniques improve in quality, it becomes increasingly challenging to distinguish them from human-created content. Watermarking schemes are a promising approach to address the problem of distinguishing between AI and human-generated content. These schemes embed hidden signals within AI-generated content to enable reliable detection. While watermarking is not a silver bullet for addressing all risks associated with GenAI, it can play a crucial role in enhancing AI safety and trustworthiness by combating misinformation and deception. This paper presents a comprehensive overview of watermarking techniques for GenAI, beginning with the need for watermarking from historical and regulatory perspectives. We formalize the definitions and desired properties of watermarking schemes and examine the key objectives and threat models for existing approaches. Practical evaluation strategies are also explored, providing insights into the development of robust watermarking techniques capable of resisting various attacks. Additionally, we review recent representative works, highlight open challenges, and discuss potential directions for this emerging field. By offering a thorough understanding of watermarking in GenAI, this work aims to guide researchers in advancing watermarking methods and applications, and support policymakers in addressing the broader implications of GenAI.

Title: GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Authors: Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18499
Pdf URL: https://arxiv.org/pdf/2411.18499
Copy Paste: [[2411.18499]] GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation(https://arxiv.org/abs/2411.18499)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce GATE OpenING (OpenING), a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The OpenING is open-sourced at this https URL.

Title: LLM-ABBA: Understand time series via symbolic approximation

Authors: Erin Carson, Xinye Chen, Cheng Kang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18506
Pdf URL: https://arxiv.org/pdf/2411.18506
Copy Paste: [[2411.18506]] LLM-ABBA: Understand time series via symbolic approximation(https://arxiv.org/abs/2411.18506)
Keywords: large language model
Abstract: The success of large language models (LLMs) for time series has been demonstrated in previous work. Utilizing a symbolic time series representation, one can efficiently bridge the gap between LLMs and time series. However, the remaining challenge is to exploit the semantic information hidden in time series by using symbols or existing tokens of LLMs, while aligning the embedding space of LLMs according to the hidden information of time series. The symbolic time series approximation (STSA) method called adaptive Brownian bridge-based symbolic aggregation (ABBA) shows outstanding efficacy in preserving salient time series features by modeling time series patterns in terms of amplitude and period while using existing tokens of LLMs. In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA into large language models for various downstream time series tasks. By symbolizing time series, LLM-ABBA compares favorably to the recent state-of-the-art (SOTA) in UCR and three medical time series classification tasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to \kc{avoid obvious drifting} during prediction tasks by significantly mitigating the effects of cumulative error arising from misused symbols during the transition from symbols to numerical values. In time series regression tasks, LLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER) benchmarks. LLM-ABBA also shows competitive prediction capability compared to recent SOTA time series prediction results. We believe this framework can also seamlessly extend to other time series tasks.

Title: Enhancing weed detection performance by means of GenAI-based image augmentation

Authors: Sourav Modak, Anthony Stein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18513
Pdf URL: https://arxiv.org/pdf/2411.18513
Copy Paste: [[2411.18513]] Enhancing weed detection performance by means of GenAI-based image augmentation(https://arxiv.org/abs/2411.18513)
Keywords: robust, diffusion, generative
Abstract: Precise weed management is essential for sustaining crop productivity and ecological balance. Traditional herbicide applications face economic and environmental challenges, emphasizing the need for intelligent weed control systems powered by deep learning. These systems require vast amounts of high-quality training data. The reality of scarcity of well-annotated training data, however, is often addressed through generating more data using data augmentation. Nevertheless, conventional augmentation techniques such as random flipping, color changes, and blurring lack sufficient fidelity and diversity. This paper investigates a generative AI-based augmentation technique that uses the Stable Diffusion model to produce diverse synthetic images that improve the quantity and quality of training datasets for weed detection models. Moreover, this paper explores the impact of these synthetic images on the performance of real-time detection systems, thus focusing on compact CNN-based models such as YOLO nano for edge devices. The experimental results show substantial improvements in mean Average Precision (mAP50 and mAP50-95) scores for YOLO models trained with generative AI-augmented datasets, demonstrating the promising potential of synthetic data to enhance model robustness and accuracy.

Title: Emergence of Self-Identity in AI: A Mathematical Framework and Empirical Study with Generative Large Language Models

Authors: Minhyeok Lee
Subjects: cs.CL, math.MG
Abstract URL: https://arxiv.org/abs/2411.18530
Pdf URL: https://arxiv.org/pdf/2411.18530
Copy Paste: [[2411.18530]] Emergence of Self-Identity in AI: A Mathematical Framework and Empirical Study with Generative Large Language Models(https://arxiv.org/abs/2411.18530)
Keywords: generative, large language model
Abstract: This paper introduces a mathematical framework for defining and quantifying self-identity in artificial intelligence (AI) systems, addressing a critical gap in the theoretical foundations of artificial consciousness. While existing approaches to artificial self-awareness often rely on heuristic implementations or philosophical abstractions, we present a formal framework grounded in metric space theory, measure theory, and functional analysis. Our framework posits that self-identity emerges from two mathematically quantifiable conditions: the existence of a connected continuum of memories $C \subseteq \mathcal{M}$ in a metric space $(\mathcal{M}, d_{\mathcal{M}})$, and a continuous mapping $I: \mathcal{M} \to \mathcal{S}$ that maintains consistent self-recognition across this continuum, where $(\mathcal{S}, d_{\mathcal{S}})$ represents the metric space of possible self-identities. To validate this theoretical framework, we conducted empirical experiments using the Llama 3.2 1B model, employing Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model was trained on a synthetic dataset containing temporally structured memories, designed to capture the complexity of coherent self-identity formation. Our evaluation metrics included quantitative measures of self-awareness, response consistency, and linguistic precision. The experimental results demonstrate substantial improvements in measurable self-awareness metrics, with the primary self-awareness score increasing from 0.276 to 0.801. This enables the structured creation of AI systems with validated self-identity features. The implications of our study are immediately relevant to the fields of humanoid robotics and autonomous systems.

Title: AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

Authors: Dillon Loh, Tomasz Bednarz, Xinxing Xia, Frank Guan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2411.18539
Pdf URL: https://arxiv.org/pdf/2411.18539
Copy Paste: [[2411.18539]] AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans(https://arxiv.org/abs/2411.18539)
Keywords: fair
Abstract: Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.

Title: FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Authors: Haosen Yang, Adrian Bulat, Isma Hadji, Hai X. Pham, Xiatian Zhu, Georgios Tzimiropoulos, Brais Martinez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18552
Pdf URL: https://arxiv.org/pdf/2411.18552
Copy Paste: [[2411.18552]] FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion(https://arxiv.org/abs/2411.18552)
Keywords: diffusion
Abstract: Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.

Title: Retrofitting (Large) Language Models with Dynamic Tokenization

Authors: Darius Feher, Benjamin Minixhofer, Ivan Vulić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.18553
Pdf URL: https://arxiv.org/pdf/2411.18553
Copy Paste: [[2411.18553]] Retrofitting (Large) Language Models with Dynamic Tokenization(https://arxiv.org/abs/2411.18553)
Keywords: fair
Abstract: Current language models (LMs) use a fixed, static subword tokenizer. This choice, often taken for granted, typically results in degraded efficiency and capabilities in languages other than English, and makes it challenging to apply LMs to new domains or languages. To address these issues, we propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text. For encoder-style models, we introduce a subword-merging algorithm inspired by byte-pair encoding (BPE), but at a batch level. We merge frequent subword sequences in a batch, then apply a pretrained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. When applied with word-level boundaries, this on average reduces token sequence lengths by >20% across 14 languages on XNLI with XLM-R while degrading its task performance by less than 2%. For decoder-style models, we apply dynamic tokenization in two ways: 1) for prefilling, maintaining performance of Mistral-7B almost completely with up to 40% sequence reduction - relative to the word-level; and 2) via an approximate nearest neighbor index, achieving fast generation with a one million token vocabulary, demonstrating scalability to even larger, dynamic vocabularies. Overall, our findings show that dynamic tokenization substantially improves inference speed and promotes fairness across languages, making a leap towards overcoming the limitations of static tokenization and enabling more equitable and adaptable LMs.

Title: Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning

Authors: Omkar Khade, Shruti Jagdale, Abhishek Phaltankar, Gauri Takalikar, Raviraj Joshi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18571
Pdf URL: https://arxiv.org/pdf/2411.18571
Copy Paste: [[2411.18571]] Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning(https://arxiv.org/abs/2411.18571)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, yet challenges persist in adapting these models for low-resource languages. In this study, we investigate the effects of Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi, a language with limited resources. Using a translated Alpaca dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation metrics often show a performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts. The observations indicate improvements in target language generation capabilities but a reduction in reasoning abilities following language adaptation. These results underscore the need for improved evaluation methodologies and the creation of high-quality native datasets to accurately assess language-specific model performance in low-resource settings.

Title: Exploring Depth Information for Detecting Manipulated Face Videos

Authors: Haoyue Wang, Sheng Li, Ji He, Zhenxing Qian, Xinpeng Zhang, Shaolin Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18572
Pdf URL: https://arxiv.org/pdf/2411.18572
Copy Paste: [[2411.18572]] Exploring Depth Information for Detecting Manipulated Face Videos(https://arxiv.org/abs/2411.18572)
Keywords: security, robust, transformer
Abstract: Face manipulation detection has been receiving a lot of attention for the reliability and security of the face images/videos. Recent studies focus on using auxiliary information or prior knowledge to capture robust manipulation traces, which are shown to be promising. As one of the important face features, the face depth map, which has shown to be effective in other areas such as face recognition or face detection, is unfortunately paid little attention to in literature for face manipulation detection. In this paper, we explore the possibility of incorporating the face depth map as auxiliary information for robust face manipulation detection. To this end, we first propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from an RGB face image, which is able to capture the local depth anomaly created due to manipulation. The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features using a Multi-head Depth Attention (MDA) mechanism that is newly designed. We also propose an RGB-Depth Inconsistency Attention (RDIA) module to effectively capture the inter-frame inconsistency for multi-frame input. Various experiments demonstrate the advantage of our proposed method for face manipulation detection.

Title: Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

Authors: Nurshat Fateh Ali, Md. Mahdi Mohtasim, Shakil Mosharrof, T. Gopi Krishna
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18583
Pdf URL: https://arxiv.org/pdf/2411.18583
Copy Paste: [[2411.18583]] Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation(https://arxiv.org/abs/2411.18583)
Keywords: transformer, large language model
Abstract: This research presents and compares multiple approaches to automate the generation of literature reviews using several Natural Language Processing (NLP) techniques and retrieval-augmented generation (RAG) with a Large Language Model (LLM). The ever-increasing number of research articles provides a huge challenge for manual literature review. It has resulted in an increased demand for automation. Developing a system capable of automatically generating the literature reviews from only the PDF files as input is the primary objective of this research work. The effectiveness of several Natural Language Processing (NLP) strategies, such as the frequency-based method (spaCy), the transformer model (Simple T5), and retrieval-augmented generation (RAG) with Large Language Model (GPT-3.5-turbo), is evaluated to meet the primary objective. The SciTLDR dataset is chosen for this research experiment and three distinct techniques are utilized to implement three different systems for auto-generating the literature reviews. The ROUGE scores are used for the evaluation of all three systems. Based on the evaluation, the Large Language Model GPT-3.5-turbo achieved the highest ROUGE-1 score, 0.364. The transformer model comes in second place and spaCy is at the last position. Finally, a graphical user interface is created for the best system based on the large language model.

Title: Hierarchical Information Flow for Generalized Efficient Image Restoration

Authors: Yawei Li, Bin Ren, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Nicu Sebe, Ming-Hsuan Yang, Luca Benini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18588
Pdf URL: https://arxiv.org/pdf/2411.18588
Copy Paste: [[2411.18588]] Hierarchical Information Flow for Generalized Efficient Image Restoration(https://arxiv.org/abs/2411.18588)
Keywords: transformer
Abstract: While vision transformers show promise in numerous image restoration (IR) tasks, the challenge remains in efficiently generalizing and scaling up a model for multiple IR tasks. To strike a balance between efficiency and model capacity for a generalized transformer-based IR method, we propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR, which progressively propagates information among pixels in a bottom-up manner. Hi-IR constructs a hierarchical information tree representing the degraded image across three levels. Each level encapsulates different types of information, with higher levels encompassing broader objects and concepts and lower levels focusing on local details. Moreover, the hierarchical tree architecture removes long-range self-attention, improves the computational efficiency and memory utilization, thus preparing it for effective model scaling. Based on that, we explore model scaling to improve our method's capabilities, which is expected to positively impact IR in large-scale training settings. Extensive experimental results show that Hi-IR achieves state-of-the-art performance in seven common image restoration tasks, affirming its effectiveness and generalizability.

Title: Task Arithmetic Through The Lens Of One-Shot Federated Learning

Authors: Zhixu Tao, Ian Mason, Sanjeev Kulkarni, Xavier Boix
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18607
Pdf URL: https://arxiv.org/pdf/2411.18607
Copy Paste: [[2411.18607]] Task Arithmetic Through The Lens Of One-Shot Federated Learning(https://arxiv.org/abs/2411.18607)
Keywords: federate
Abstract: Task Arithmetic is a model merging technique that enables the combination of multiple models' capabilities into a single model through simple arithmetic in the weight space, without the need for additional fine-tuning or access to the original training data. However, the factors that determine the success of Task Arithmetic remain unclear. In this paper, we examine Task Arithmetic for multi-task learning by framing it as a one-shot Federated Learning problem. We demonstrate that Task Arithmetic is mathematically equivalent to the commonly used algorithm in Federated Learning, called Federated Averaging (FedAvg). By leveraging well-established theoretical results from FedAvg, we identify two key factors that impact the performance of Task Arithmetic: data heterogeneity and training heterogeneity. To mitigate these challenges, we adapt several algorithms from Federated Learning to improve the effectiveness of Task Arithmetic. Our experiments demonstrate that applying these algorithms can often significantly boost performance of the merged model compared to the original Task Arithmetic approach. This work bridges Task Arithmetic and Federated Learning, offering new theoretical perspectives on Task Arithmetic and improved practical methodologies for model merging.

Title: Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

Authors: Cheng Tang, Zhishuai Liu, Pan Xu
Subjects: cs.LG, cs.AI, cs.RO, stat.ML
Abstract URL: https://arxiv.org/abs/2411.18612
Pdf URL: https://arxiv.org/pdf/2411.18612
Copy Paste: [[2411.18612]] Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization(https://arxiv.org/abs/2411.18612)
Keywords: robust
Abstract: The Distributionally Robust Markov Decision Process (DRMDP) is a popular framework for addressing dynamics shift in reinforcement learning by learning policies robust to the worst-case transition dynamics within a constrained set. However, solving its dual optimization oracle poses significant challenges, limiting theoretical analysis and computational efficiency. The recently proposed Robust Regularized Markov Decision Process (RRMDP) replaces the uncertainty set constraint with a regularization term on the value function, offering improved scalability and theoretical insights. Yet, existing RRMDP methods rely on unstructured regularization, often leading to overly conservative policies by considering transitions that are unrealistic. To address these issues, we propose a novel framework, the $d$-rectangular linear robust regularized Markov decision process ($d$-RRMDP), which introduces a linear latent structure into both transition kernels and regularization. For the offline RL setting, where an agent learns robust policies from a pre-collected dataset in the nominal environment, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI), employing linear function approximation and $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, showing these bounds depend on how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. This term is further shown to be fundamental to $d$-RRMDPs via information-theoretic lower bounds. Finally, numerical experiments validate that R2PVI learns robust policies and is computationally more efficient than methods for constrained DRMDPs.

Title: CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Authors: Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T. Barron, Aleksander Holynski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18613
Pdf URL: https://arxiv.org/pdf/2411.18613
Copy Paste: [[2411.18613]] CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models(https://arxiv.org/abs/2411.18613)
Keywords: robust, diffusion
Abstract: We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos. See our project page for results and interactive demos: \url{this http URL}.

Title: Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Authors: Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, Gordon Wetzstein
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18616
Pdf URL: https://arxiv.org/pdf/2411.18616
Copy Paste: [[2411.18616]] Diffusion Self-Distillation for Zero-Shot Customized Image Generation(https://arxiv.org/abs/2411.18616)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.

Title: Leveraging Semi-Supervised Learning to Enhance Data Mining for Image Classification under Limited Labeled Data

Authors: Aoran Shen, Minghao Dai, Jiacheng Hu, Yingbin Liang, Shiru Wang, Junliang Du
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18622
Pdf URL: https://arxiv.org/pdf/2411.18622
Copy Paste: [[2411.18622]] Leveraging Semi-Supervised Learning to Enhance Data Mining for Image Classification under Limited Labeled Data(https://arxiv.org/abs/2411.18622)
Keywords: robust, extraction
Abstract: In the 21st-century information age, with the development of big data technology, effectively extracting valuable information from massive data has become a key issue. Traditional data mining methods are inadequate when faced with large-scale, high-dimensional and complex data. Especially when labeled data is scarce, their performance is greatly limited. This study optimizes data mining algorithms by introducing semi-supervised learning methods, aiming to improve the algorithm's ability to utilize unlabeled data, thereby achieving more accurate data analysis and pattern recognition under limited labeled data conditions. Specifically, we adopt a self-training method and combine it with a convolutional neural network (CNN) for image feature extraction and classification, and continuously improve the model prediction performance through an iterative process. The experimental results demonstrate that the proposed method significantly outperforms traditional machine learning techniques such as Support Vector Machine (SVM), XGBoost, and Multi-Layer Perceptron (MLP) on the CIFAR-10 image classification dataset. Notable improvements were observed in key performance metrics, including accuracy, recall, and F1 score. Furthermore, the robustness and noise-resistance capabilities of the semi-supervised CNN model were validated through experiments under varying noise levels, confirming its practical applicability in real-world scenarios.

Title: Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation

Authors: Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilue Wang, Longzan Luo, Lily Lee, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, Shanghang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18623
Pdf URL: https://arxiv.org/pdf/2411.18623
Copy Paste: [[2411.18623]] Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation(https://arxiv.org/abs/2411.18623)
Keywords: robust, extraction
Abstract: 3D geometric information is essential for manipulation tasks, as robots need to perceive the 3D environment, reason about spatial relationships, and interact with intricate spatial configurations. Recent research has increasingly focused on the explicit extraction of 3D features, while still facing challenges such as the lack of large-scale robotic 3D data and the potential loss of spatial geometry. To address these limitations, we propose the Lift3D framework, which progressively enhances 2D foundation models with implicit and explicit 3D robotic representations to construct a robust 3D manipulation policy. Specifically, we first design a task-aware masked autoencoder that masks task-relevant affordance patches and reconstructs depth information, enhancing the 2D foundation model's implicit 3D robotic representation. After self-supervised fine-tuning, we introduce a 2D model-lifting strategy that establishes a positional mapping between the input 3D points and the positional embeddings of the 2D model. Based on the mapping, Lift3D utilizes the 2D foundation model to directly encode point cloud data, leveraging large-scale pretrained knowledge to construct explicit 3D robotic representations while minimizing spatial information loss. In experiments, Lift3D consistently outperforms previous state-of-the-art methods across several simulation benchmarks and real-world scenarios.

Title: GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Authors: Wentao Wang, Hang Ye, Fangzhou Hong, Xue Yang, Jianfu Zhang, Yizhou Wang, Ziwei Liu, Liang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18624
Pdf URL: https://arxiv.org/pdf/2411.18624
Copy Paste: [[2411.18624]] GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data(https://arxiv.org/abs/2411.18624)
Keywords: diffusion
Abstract: Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e.g., SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization-&-Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.