2024-12-13

Title: A feature refinement module for light-weight semantic segmentation network

Authors: Zhiyan Wang, Xin Guo, Song Wang, Peixiao Zheng, Lin Qi
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.08670
Pdf URL: https://arxiv.org/pdf/2412.08670
Copy Paste: [[2412.08670]] A feature refinement module for light-weight semantic segmentation network(https://arxiv.org/abs/2412.08670)
Keywords: transformer, segmentation
Abstract: Low computational complexity and high segmentation accuracy are both essential to the real-world semantic segmentation tasks. However, to speed up the model inference, most existing approaches tend to design light-weight networks with a very limited number of parameters, leading to a considerable degradation in accuracy due to the decrease of the representation ability of the networks. To solve the problem, this paper proposes a novel semantic segmentation method to improve the capacity of obtaining semantic information for the light-weight network. Specifically, a feature refinement module (FRM) is proposed to extract semantics from multi-stage feature maps generated by the backbone and capture non-local contextual information by utilizing a transformer block. On Cityscapes and Bdd100K datasets, the experimental results demonstrate that the proposed method achieves a promising trade-off between accuracy and computational cost, especially for Cityscapes test set where 80.4% mIoU is achieved and only 214.82 GFLOPs are required.

Title: A Deep Semantic Segmentation Network with Semantic and Contextual Refinements

Authors: Zhiyan Wang, Deyin Liu, Lin Yuanbo Wu, Song Wang, Xin Guo, Lin Qi
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.08671
Pdf URL: https://arxiv.org/pdf/2412.08671
Copy Paste: [[2412.08671]] A Deep Semantic Segmentation Network with Semantic and Contextual Refinements(https://arxiv.org/abs/2412.08671)
Keywords: segmentation
Abstract: Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets-Cityscapes, Bdd100K, and ADE20K-demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.

Title: Distinguishing Scams and Fraud with Ensemble Learning

Authors: Isha Chadalavada, Tianhui Huang, Jessica Staddon
Subjects: cs.CR, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08680
Pdf URL: https://arxiv.org/pdf/2412.08680
Copy Paste: [[2412.08680]] Distinguishing Scams and Fraud with Ensemble Learning(https://arxiv.org/abs/2412.08680)
Keywords: protect, defense
Abstract: Users increasingly query LLM-enabled web chatbots for help with scam defense. The Consumer Financial Protection Bureau's complaints database is a rich data source for evaluating LLM performance on user scam queries, but currently the corpus does not distinguish between scam and non-scam fraud. We developed an LLM ensemble approach to distinguishing scam and fraud CFPB complaints and describe initial findings regarding the strengths and weaknesses of LLMs in the scam defense context.

Title: LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Authors: Alexander Pan, Lijie Chen, Jacob Steinhardt
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08686
Pdf URL: https://arxiv.org/pdf/2412.08686
Copy Paste: [[2412.08686]] LatentQA: Teaching LLMs to Decode Activations Into Natural Language(https://arxiv.org/abs/2412.08686)
Keywords: interpretability
Abstract: Interpretability methods seek to understand language model representations, yet the outputs of most such methods -- circuits, vectors, scalars -- are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we extend LatentQA to reveal harmful model capabilities, such as generating recipes for bioweapons and code for hacking.

Title: From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields

Authors: Miltiadis Kofinas, Samuele Papa, Efstratios Gavves
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.08731
Pdf URL: https://arxiv.org/pdf/2412.08731
Copy Paste: [[2412.08731]] From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields(https://arxiv.org/abs/2412.08731)
Keywords: segmentation
Abstract: Neural fields (NeFs) have recently emerged as a state-of-the-art method for encoding spatio-temporal signals of various modalities. Despite the success of NeFs in reconstructing individual signals, their use as representations in downstream tasks, such as classification or segmentation, is hindered by the complexity of the parameter space and its underlying symmetries, in addition to the lack of powerful and scalable conditioning mechanisms. In this work, we draw inspiration from the principles of connectionism to design a new architecture based on MLPs, which we term NeoMLP. We start from an MLP, viewed as a graph, and transform it from a multi-partite graph to a complete graph of input, hidden, and output nodes, equipped with high-dimensional features. We perform message passing on this graph and employ weight-sharing via self-attention among all the nodes. NeoMLP has a built-in mechanism for conditioning through the hidden and output nodes, which function as a set of latent codes, and as such, NeoMLP can be used straightforwardly as a conditional neural field. We demonstrate the effectiveness of our method by fitting high-resolution signals, including multi-modal audio-visual data. Furthermore, we fit datasets of neural representations, by learning instance-specific sets of latent codes using a single backbone architecture, and then use them for downstream tasks, outperforming recent state-of-the-art methods. The source code is open-sourced at this https URL.

Title: Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Authors: Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.08737
Pdf URL: https://arxiv.org/pdf/2412.08737
Copy Paste: [[2412.08737]] Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions(https://arxiv.org/abs/2412.08737)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) -- particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics, medical image analysis, and manufacturing. In this paper, we first introduce Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image. Using this benchmark, we demonstrate the limitations of leading MLLMs, and then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks. Our findings highlight the benefits of certain model architectures, training techniques, and data strategies, including the use of high-fidelity synthetic data and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Leveraging these insights, we develop Euclid, a family of models specifically optimized for strong low-level geometric perception. Although purely trained on synthetic multimodal data, Euclid shows strong generalization ability to novel geometry shapes. For instance, Euclid outperforms the best closed-source model, Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.

Title: In-Context Learning with Topological Information for Knowledge Graph Completion

Authors: Udari Madhushani Sehwag, Kassiani Papasotiriou, Jared Vann, Sumitra Ganesh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08742
Pdf URL: https://arxiv.org/pdf/2412.08742
Copy Paste: [[2412.08742]] In-Context Learning with Topological Information for Knowledge Graph Completion(https://arxiv.org/abs/2412.08742)
Keywords: generative, large language model
Abstract: Knowledge graphs (KGs) are crucial for representing and reasoning over structured information, supporting a wide range of applications such as information retrieval, question answering, and decision-making. However, their effectiveness is often hindered by incompleteness, limiting their potential for real-world impact. While knowledge graph completion (KGC) has been extensively studied in the literature, recent advances in generative AI models, particularly large language models (LLMs), have introduced new opportunities for innovation. In-context learning has recently emerged as a promising approach for leveraging pretrained knowledge of LLMs across a range of natural language processing tasks and has been widely adopted in both academia and industry. However, how to utilize in-context learning for effective KGC remains relatively underexplored. We develop a novel method that incorporates topological information through in-context learning to enhance KGC performance. By integrating ontological knowledge and graph structure into the context of LLMs, our approach achieves strong performance in the transductive setting i.e., nodes in the test graph dataset are present in the training graph dataset. Furthermore, we apply our approach to KGC in the more challenging inductive setting, i.e., nodes in the training graph dataset and test graph dataset are disjoint, leveraging the ontology to infer useful information about missing nodes which serve as contextual cues for the LLM during inference. Our method demonstrates superior performance compared to baselines on the ILPC-small and ILPC-large datasets.

Title: Proactive Adversarial Defense: Harnessing Prompt Tuning in Vision-Language Models to Detect Unseen Backdoored Images

Authors: Kyle Stein, Andrew Arash Mahyari, Guillermo Francia, Eman El-Sheikh
Subjects: cs.CV, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08755
Pdf URL: https://arxiv.org/pdf/2412.08755
Copy Paste: [[2412.08755]] Proactive Adversarial Defense: Harnessing Prompt Tuning in Vision-Language Models to Detect Unseen Backdoored Images(https://arxiv.org/abs/2412.08755)
Keywords: defense, attack
Abstract: Backdoor attacks pose a critical threat by embedding hidden triggers into inputs, causing models to misclassify them into target labels. While extensive research has focused on mitigating these attacks in object recognition models through weight fine-tuning, much less attention has been given to detecting backdoored samples directly. Given the vast datasets used in training, manual inspection for backdoor triggers is impractical, and even state-of-the-art defense mechanisms fail to fully neutralize their impact. To address this gap, we introduce a groundbreaking method to detect unseen backdoored images during both training and inference. Leveraging the transformative success of prompt tuning in Vision Language Models (VLMs), our approach trains learnable text prompts to differentiate clean images from those with hidden backdoor triggers. Experiments demonstrate the exceptional efficacy of this method, achieving an impressive average accuracy of 86% across two renowned datasets for detecting unseen backdoor triggers, establishing a new standard in backdoor defense.

Title: Integrating Optimization Theory with Deep Learning for Wireless Network Design

Authors: Sinem Coleri, Aysun Gurur Onalan, Marco di Renzo
Subjects: cs.LG, cs.AI, cs.NI, eess.SY
Abstract URL: https://arxiv.org/abs/2412.08761
Pdf URL: https://arxiv.org/pdf/2412.08761
Copy Paste: [[2412.08761]] Integrating Optimization Theory with Deep Learning for Wireless Network Design(https://arxiv.org/abs/2412.08761)
Keywords: interpretability
Abstract: Traditional wireless network design relies on optimization algorithms derived from domain-specific mathematical models, which are often inefficient and unsuitable for dynamic, real-time applications due to high complexity. Deep learning has emerged as a promising alternative to overcome complexity and adaptability concerns, but it faces challenges such as accuracy issues, delays, and limited interpretability due to its inherent black-box nature. This paper introduces a novel approach that integrates optimization theory with deep learning methodologies to address these issues. The methodology starts by constructing the block diagram of the optimization theory-based solution, identifying key building blocks corresponding to optimality conditions and iterative solutions. Selected building blocks are then replaced with deep neural networks, enhancing the adaptability and interpretability of the system. Extensive simulations show that this hybrid approach not only reduces runtime compared to optimization theory based approaches but also significantly improves accuracy and convergence rates, outperforming pure deep learning models.

Title: Beyond Knowledge Silos: Task Fingerprinting for Democratization of Medical Imaging AI

Authors: Patrick Godau, Akriti Srivastava, Tim Adler, Lena Maier-Hein
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08763
Pdf URL: https://arxiv.org/pdf/2412.08763
Copy Paste: [[2412.08763]] Beyond Knowledge Silos: Task Fingerprinting for Democratization of Medical Imaging AI(https://arxiv.org/abs/2412.08763)
Keywords: secure, privacy
Abstract: The field of medical imaging AI is currently undergoing rapid transformations, with methodical research increasingly translated into clinical practice. Despite these successes, research suffers from knowledge silos, hindering collaboration and progress: Existing knowledge is scattered across publications and many details remain unpublished, while privacy regulations restrict data sharing. In the spirit of democratizing of AI, we propose a framework for secure knowledge transfer in the field of medical image analysis. The key to our approach is dataset "fingerprints", structured representations of feature distributions, that enable quantification of task similarity. We tested our approach across 71 distinct tasks and 12 medical imaging modalities by transferring neural architectures, pretraining, augmentation policies, and multi-task learning. According to comprehensive analyses, our method outperforms traditional methods for identifying relevant knowledge and facilitates collaborative model training. Our framework fosters the democratization of AI in medical imaging and could become a valuable tool for promoting faster scientific advancement.

Title: Security Properties for Open-Source Hardware Designs

Authors: Jayden Rogers, Niyaz Shakeel, Divya Mankani, Samantha Espinosa, Cade Chabra, Kaki Ryan, Cynthia Sturton
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2412.08769
Pdf URL: https://arxiv.org/pdf/2412.08769
Copy Paste: [[2412.08769]] Security Properties for Open-Source Hardware Designs(https://arxiv.org/abs/2412.08769)
Keywords: security
Abstract: The hardware security community relies on databases of known vulnerabilities and open-source designs to develop formal verification methods for identifying hardware security flaws. While there are plenty of open-source designs and verification tools, there is a gap in open-source properties addressing these flaws, making it difficult to reproduce prior work and slowing research. This paper aims to bridge that gap. We provide SystemVerilog Assertions for four common designs: OR1200, Hack@DAC 2018's buggy PULPissimo SoC, Hack@DAC 2019's CVA6, and Hack@DAC 2021's buggy OpenPiton SoCs. The properties are organized by design and tagged with details about the security flaws and the implicated CWE. To encourage more property reporting, we describe the methodology we use when crafting properties.

Title: LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information

Authors: Ke Wang, Hong Xuan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08771
Pdf URL: https://arxiv.org/pdf/2412.08771
Copy Paste: [[2412.08771]] LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information(https://arxiv.org/abs/2412.08771)
Keywords: large language model
Abstract: Multi-modal large language models (MLLMs) utilizing instruction-following data, such as LLaVA, have achieved great progress in the industry. A major limitation in these models is that visual tokens consume a substantial portion of the maximum token limit in large language models (LLMs), leading to increased computational demands and decreased performance when prompts include multiple images or videos. Industry solutions often mitigate this issue by increasing computational power, but this approach is less feasible in academic environments with limited resources. In this study, we propose Dynamic Feature Map Reduction (DFMR) based on LLaVA-1.5 to address the challenge of visual token overload. DFMR dynamically compresses the visual tokens, freeing up token capacity. Our experimental results demonstrate that integrating DFMR into LLaVA-1.5 significantly improves the performance of LLaVA in varied visual token lengths, offering a promising solution for extending LLaVA to handle multi-image and video scenarios in resource-constrained academic environments and it can also be applied in industry settings for data augmentation to help mitigate the scarcity of open-domain image-text pair datasets in the continued pretraining stage.

Title: ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder

Authors: Jungho Kim, Changwon Kang, Dongyoung Lee, Sehwan Choi, Jun Won Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08774
Pdf URL: https://arxiv.org/pdf/2412.08774
Copy Paste: [[2412.08774]] ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder(https://arxiv.org/abs/2412.08774)
Keywords: robust, transformer
Abstract: In this paper, we introduce ProtoOcc, a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels through a deep semantic understanding of scenes. ProtoOcc consists of two main components: the Dual Branch Encoder (DBE) and the Prototype Query Decoder (PQD). The DBE produces a new 3D voxel representation by combining 3D voxel and BEV representations across multiple scales through a dual branch structure. This design enhances both performance and computational efficiency by providing a large receptive field for the BEV representation while maintaining a smaller receptive field for the voxel representation. The PQD introduces Prototype Queries to accelerate the decoding process. Scene-Adaptive Prototypes are derived from the 3D voxel features of input sample, while Scene-Agnostic Prototypes are computed by applying Scene-Adaptive Prototypes to an Exponential Moving Average during the training phase. By using these prototype-based queries for decoding, we can directly predict 3D occupancy in a single step, eliminating the need for iterative Transformer decoding. Additionally, we propose the Robust Prototype Learning, which injects noise into prototype generation process and trains the model to denoise during the training phase. ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark. For single-frame method, it reaches 39.56% mIoU with an inference speed of 12.83 FPS on an NVIDIA RTX 3090. Our code can be found at this https URL.

Title: Bayesian optimized deep ensemble for uncertainty quantification of deep neural networks: a system safety case study on sodium fast reactor thermal stratification modeling

Authors: Zaid Abulawi, Rui Hu, Prasanna Balaprakash, Yang Liu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.08776
Pdf URL: https://arxiv.org/pdf/2412.08776
Copy Paste: [[2412.08776]] Bayesian optimized deep ensemble for uncertainty quantification of deep neural networks: a system safety case study on sodium fast reactor thermal stratification modeling(https://arxiv.org/abs/2412.08776)
Keywords: robust
Abstract: Accurate predictions and uncertainty quantification (UQ) are essential for decision-making in risk-sensitive fields such as system safety modeling. Deep ensembles (DEs) are efficient and scalable methods for UQ in Deep Neural Networks (DNNs); however, their performance is limited when constructed by simply retraining the same DNN multiple times with randomly sampled initializations. To overcome this limitation, we propose a novel method that combines Bayesian optimization (BO) with DE, referred to as BODE, to enhance both predictive accuracy and UQ. We apply BODE to a case study involving a Densely connected Convolutional Neural Network (DCNN) trained on computational fluid dynamics (CFD) data to predict eddy viscosity in sodium fast reactor thermal stratification modeling. Compared to a manually tuned baseline ensemble, BODE estimates total uncertainty approximately four times lower in a noise-free environment, primarily due to the baseline's overestimation of aleatoric uncertainty. Specifically, BODE estimates aleatoric uncertainty close to zero, while aleatoric uncertainty dominates the total uncertainty in the baseline ensemble. We also observe a reduction of more than 30% in epistemic uncertainty. When Gaussian noise with standard deviations of 5% and 10% is introduced into the data, BODE accurately fits the data and estimates uncertainty that aligns with the data noise. These results demonstrate that BODE effectively reduces uncertainty and enhances predictions in data-driven models, making it a flexible approach for various applications requiring accurate predictions and robust UQ.

Title: Reward-based Blockchain Infrastructure for 3D IC Supply Chain Provenance

Authors: Sulyab Thottungal Valapu, Aritri Saha, Bhaskar Krishnamachari, Vivek Menon, Ujjwal Guin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.08777
Pdf URL: https://arxiv.org/pdf/2412.08777
Copy Paste: [[2412.08777]] Reward-based Blockchain Infrastructure for 3D IC Supply Chain Provenance(https://arxiv.org/abs/2412.08777)
Keywords: robust
Abstract: In response to the growing demand for enhanced performance and power efficiency, the semiconductor industry has witnessed a paradigm shift toward heterogeneous integration, giving rise to 2.5D/3D chips. These chips incorporate diverse chiplets, manufactured globally and integrated into a single chip. Securing these complex 2.5D/3D integrated circuits (ICs) presents a formidable challenge due to inherent trust issues within the semiconductor supply chain. Chiplets produced in untrusted locations may be susceptible to tampering, introducing malicious circuits that could compromise sensitive information. This paper introduces an innovative approach that leverages blockchain technology to establish traceability for ICs and chiplets throughout the supply chain. Given that chiplet manufacturers are dispersed globally and may operate within different blockchain consortiums, ensuring the integrity of data within each blockchain ledger becomes imperative. To address this, we propose a novel dual-layer approach for establishing distributed trust across diverse blockchain ledgers. The lower layer comprises of a blockchain-based framework for IC supply chain provenance that enables transactions between blockchain instances run by different consortiums, making it possible to trace the complete provenance DAG of each IC. The upper layer implements a multi-chain reputation scheme that assigns reputation scores to entities while specifically accounting for high-risk transactions that cross blockchain trust zones. This approach enhances the credibility of the blockchain data, mitigating potential risks associated with the use of multiple consortiums and ensuring a robust foundation for securing 2.5D/3D ICs in the evolving landscape of heterogeneous integration.

Title: Generative Modeling with Explicit Memory

Authors: Yi Tang, Peng Sun, Zhenglin Cheng, Tao Lin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08781
Pdf URL: https://arxiv.org/pdf/2412.08781
Copy Paste: [[2412.08781]] Generative Modeling with Explicit Memory(https://arxiv.org/abs/2412.08781)
Keywords: diffusion, generative
Abstract: Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce \textbf{G}enerative \textbf{M}odeling with \textbf{E}xplicit \textbf{M}emory (GMem), leveraging an external memory bank in both training and sampling phases of diffusion models. This approach preserves semantic information from data distributions, reducing reliance on neural network capacity for learning and generalizing across diverse datasets. The results are significant: our GMem enhances both training, sampling efficiency, and generation quality. For instance, on ImageNet at $256 \times 256$ resolution, GMem accelerates SiT training by over $46.7\times$, achieving the performance of a SiT model trained for $7M$ steps in fewer than $150K$ steps. Compared to the most efficient existing method, REPA, GMem still offers a $16\times$ speedup, attaining an FID score of 5.75 within $250K$ steps, whereas REPA requires over $4M$ steps. Additionally, our method achieves state-of-the-art generation quality, with an FID score of {3.56} without classifier-free guidance on ImageNet $256\times256$. Our code is available at \url{this https URL}.

Title: Coverage-based Fairness in Multi-document Summarization

Authors: Haoyuan Li, Yusen Zhang, Rui Zhang, Snigdha Chaturvedi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08795
Pdf URL: https://arxiv.org/pdf/2412.08795
Copy Paste: [[2412.08795]] Coverage-based Fairness in Multi-document Summarization(https://arxiv.org/abs/2412.08795)
Keywords: fair
Abstract: Fairness in multi-document summarization (MDS) measures whether a system can generate a summary fairly representing information from documents with different social attribute values. Fairness in MDS is crucial since a fair summary can offer readers a comprehensive view. Previous works focus on quantifying summary-level fairness using Proportional Representation, a fairness measure based on Statistical Parity. However, Proportional Representation does not consider redundancy in input documents and overlooks corpus-level unfairness. In this work, we propose a new summary-level fairness measure, Equal Coverage, which is based on coverage of documents with different social attribute values and considers the redundancy within documents. To detect the corpus-level unfairness, we propose a new corpus-level measure, Coverage Parity. Our human evaluations show that our measures align more with our definition of fairness. Using our measures, we evaluate the fairness of thirteen different LLMs. We find that Claude3-sonnet is the fairest among all evaluated LLMs. We also find that almost all LLMs overrepresent different social attribute values.

Title: HARP: A challenging human-annotated math reasoning benchmark

Authors: Albert S. Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, Aaditya K. Singh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08819
Pdf URL: https://arxiv.org/pdf/2412.08819
Copy Paste: [[2412.08819]] HARP: A challenging human-annotated math reasoning benchmark(https://arxiv.org/abs/2412.08819)
Keywords: large language model
Abstract: Math reasoning is becoming an ever increasing area of focus as we scale large language models. However, even the previously-toughest evals like MATH are now close to saturated by frontier models (90.0% for o1-mini and 86.5% for Gemini 1.5 Pro). We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO). Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy). These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro). Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written, ground-truth solutions per problem, offering new avenues of research that we explore briefly. We report evaluations for many frontier models and share some interesting analyses, such as demonstrating that frontier models across families intrinsically scale their inference-time compute for more difficult problems. Finally, we open source all code used for dataset construction (including scraping) and all code for evaluation (including answer checking) to enable future research at: this https URL.

Title: Large Concept Models: Language Modeling in a Sentence Representation Space

Authors: The LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08821
Pdf URL: https://arxiv.org/pdf/2412.08821
Copy Paste: [[2412.08821]] Large Concept Models: Language Modeling in a Sentence Representation Space(https://arxiv.org/abs/2412.08821)
Keywords: diffusion, generative
Abstract: LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available.

Title: Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

Authors: Minsang Kim, Seungjun Baek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08846
Pdf URL: https://arxiv.org/pdf/2412.08846
Copy Paste: [[2412.08846]] Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology(https://arxiv.org/abs/2412.08846)
Keywords: large language model
Abstract: Large language models (LLMs) closely interact with humans, and thus need an intimate understanding of the cultural values of human society. In this paper, we explore how open-source LLMs make judgments on diverse categories of cultural values across countries, and its relation to training methodology such as model sizes, training corpus, alignment, etc. Our analysis shows that LLMs can judge socio-cultural norms similar to humans but less so on social systems and progress. In addition, LLMs tend to judge cultural values biased toward Western culture, which can be improved with training on the multilingual corpus. We also find that increasing model size helps a better understanding of social values, but smaller models can be enhanced by using synthetic data. Our analysis reveals valuable insights into the design methodology of LLMs in connection with their understanding of cultural values.

Title: ViUniT: Visual Unit Tests for More Robust Visual Programming

Authors: Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08859
Pdf URL: https://arxiv.org/pdf/2412.08859
Copy Paste: [[2412.08859]] ViUniT: Visual Unit Tests for More Robust Visual Programming(https://arxiv.org/abs/2412.08859)
Keywords: robust
Abstract: Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.

Title: A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

Authors: Jiankang Wang, Jianjun Xu, Xiaorui Wang, Yuxin Wang, Mengting Xing, Shancheng Fang, Zhineng Chen, Hongtao Xie, Yongdong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08864
Pdf URL: https://arxiv.org/pdf/2412.08864
Copy Paste: [[2412.08864]] A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions(https://arxiv.org/abs/2412.08864)
Keywords: large language model
Abstract: Synthesizing high-quality reasoning data for continual training has been proven to be effective in enhancing the performance of Large Language Models (LLMs). However, previous synthetic approaches struggle to easily scale up data and incur high costs in the pursuit of high quality. In this paper, we propose the Graph-based Synthetic Data Pipeline (GSDP), an economical and scalable framework for high-quality reasoning data synthesis. Inspired by knowledge graphs, we extracted knowledge points from seed data and constructed a knowledge point relationships graph to explore their interconnections. By exploring the implicit relationships among knowledge, our method achieves $\times$255 data expansion. Furthermore, GSDP led by open-source models, achieves synthesis quality comparable to GPT-4-0613 while maintaining $\times$100 lower costs. To tackle the most challenging mathematical reasoning task, we present the GSDP-MATH dataset comprising over 1.91 million pairs of math problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on Mistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating the effectiveness of our method. The dataset and models trained in this paper will be available.

Title: Inference-Time Diffusion Model Distillation

Authors: Geon Yeong Park, Sang Wan Lee, Jong Chul Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08871
Pdf URL: https://arxiv.org/pdf/2412.08871
Copy Paste: [[2412.08871]] Inference-Time Diffusion Model Distillation(https://arxiv.org/abs/2412.08871)
Keywords: robust, diffusion
Abstract: Diffusion distillation models effectively accelerate reverse sampling by compressing the process into fewer steps. However, these models still exhibit a performance gap compared to their pre-trained diffusion model counterparts, exacerbated by distribution shifts and accumulated errors during multi-step sampling. To address this, we introduce Distillation++, a novel inference-time distillation framework that reduces this gap by incorporating teacher-guided refinement during sampling. Inspired by recent advances in conditional sampling, our approach recasts student model sampling as a proximal optimization problem with a score distillation sampling loss (SDS). To this end, we integrate distillation optimization during reverse sampling, which can be viewed as teacher guidance that drives student sampling trajectory towards the clean manifold using pre-trained diffusion models. Thus, Distillation++ improves the denoising process in real-time without additional source data or fine-tuning. Distillation++ demonstrates substantial improvements over state-of-the-art distillation baselines, particularly in early sampling stages, positioning itself as a robust guided sampling process crafted for diffusion distillation models. Code: this https URL.

Title: Towards modeling evolving longitudinal health trajectories with a transformer-based deep learning model

Authors: Hans Moen, Vishnu Raj, Andrius Vabalas, Markus Perola, Samuel Kaski, Andrea Ganna, Pekka Marttinen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08873
Pdf URL: https://arxiv.org/pdf/2412.08873
Copy Paste: [[2412.08873]] Towards modeling evolving longitudinal health trajectories with a transformer-based deep learning model(https://arxiv.org/abs/2412.08873)
Keywords: transformer
Abstract: Health registers contain rich information about individuals' health histories. Here our interest lies in understanding how individuals' health trajectories evolve in a nationwide longitudinal dataset with coded features, such as clinical codes, procedures, and drug purchases. We introduce a straightforward approach for training a Transformer-based deep learning model in a way that lets us analyze how individuals' trajectories change over time. This is achieved by modifying the training objective and by applying a causal attention mask. We focus here on a general task of predicting the onset of a range of common diseases in a given future forecast interval. However, instead of providing a single prediction about diagnoses that could occur in this forecast interval, our approach enable the model to provide continuous predictions at every time point up until, and conditioned on, the time of the forecast period. We find that this model performs comparably to other models, including a bi-directional transformer model, in terms of basic prediction performance while at the same time offering promising trajectory modeling properties. We explore a couple of ways to use this model for analyzing health trajectories and aiding in early detection of events that forecast possible later disease onsets. We hypothesize that this method may be helpful in continuous monitoring of peoples' health trajectories and enabling interventions in ongoing health trajectories, as well as being useful in retrospective analyses.

Title: SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

Authors: Kwangryeol Park, Seulki Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08894
Pdf URL: https://arxiv.org/pdf/2412.08894
Copy Paste: [[2412.08894]] SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization(https://arxiv.org/abs/2412.08894)
Keywords: transformer
Abstract: We propose SMMF (Square-Matricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate optimizers, such as Adam, by up to 96%. SMMF enables flexible and efficient factorization of an arbitrary rank (shape) of the first and second momentum tensors during optimization, based on the proposed square-matricization and one-time single matrix factorization. From this, it becomes effectively applicable to any rank (shape) of momentum tensors, i.e., bias, matrix, and any rank-d tensors, prevalent in various deep model architectures, such as CNNs (high rank) and Transformers (low rank), in contrast to existing memory-efficient optimizers that applies only to a particular (rank-2) momentum tensor, e.g., linear layers. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate optimizers, such as AdamNC, providing a theoretical basis for its competitive optimization capability. In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memory efficient optimizers, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance on various CNN and Transformer tasks.

Title: AI-assisted Knowledge Discovery in Biomedical Literature to Support Decision-making in Precision Oncology

Authors: Ting He, Kory Kreimeyer, Mimi Najjar, Jonathan Spiker, Maria Fatteh, Valsamo Anagnostou, Taxiarchis Botsis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08900
Pdf URL: https://arxiv.org/pdf/2412.08900
Copy Paste: [[2412.08900]] AI-assisted Knowledge Discovery in Biomedical Literature to Support Decision-making in Precision Oncology(https://arxiv.org/abs/2412.08900)
Keywords: extraction, transformer, large language model
Abstract: The delivery of appropriate targeted therapies to cancer patients requires the complete analysis of the molecular profiling of tumors and the patient's clinical characteristics in the context of existing knowledge and recent findings described in biomedical literature and several other sources. We evaluated the potential contributions of specific natural language processing solutions to support knowledge discovery from biomedical literature. Two models from the Bidirectional Encoder Representations from Transformers (BERT) family, two Large Language Models, and PubTator 3.0 were tested for their ability to support the named entity recognition (NER) and the relation extraction (RE) tasks. PubTator 3.0 and the BioBERT model performed best in the NER task (best F1-score equal to 0.93 and 0.89, respectively), while BioBERT outperformed all other solutions in the RE task (best F1-score 0.79) and a specific use case it was applied to by recognizing nearly all entity mentions and most of the relations.

Title: Federated Foundation Models on Heterogeneous Time Series

Authors: Shengchao Chen, Guodong Long, Jing Jiang, Chengqi Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08906
Pdf URL: https://arxiv.org/pdf/2412.08906
Copy Paste: [[2412.08906]] Federated Foundation Models on Heterogeneous Time Series(https://arxiv.org/abs/2412.08906)
Keywords: robust, federate, transformer
Abstract: Training a general-purpose time series foundation models with robust generalization capabilities across diverse applications from scratch is still an open challenge. Efforts are primarily focused on fusing cross-domain time series datasets to extract shared subsequences as tokens for training models on Transformer architecture. However, due to significant statistical heterogeneity across domains, this cross-domain fusing approach doesn't work effectively as the same as fusing texts and images. To tackle this challenge, this paper proposes a novel federated learning approach to address the heterogeneity in time series foundation models training, namely FFTS. Specifically, each data-holding organization is treated as an independent client in a collaborative learning framework with federated settings, and then many client-specific local models will be trained to preserve the unique characteristics per dataset. Moreover, a new regularization mechanism will be applied to both client-side and server-side, thus to align the shared knowledge across heterogeneous datasets from different domains. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed federated learning approach. The newly learned time series foundation models achieve superior generalization capabilities on cross-domain time series analysis tasks, including forecasting, imputation, and anomaly detection.

Title: Goal-Conditioned Supervised Learning for Multi-Objective Recommendation

Authors: Shijun Li, Hilaf Hasson, Jing Hu, Joydeep Ghosh
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.08911
Pdf URL: https://arxiv.org/pdf/2412.08911
Copy Paste: [[2412.08911]] Goal-Conditioned Supervised Learning for Multi-Objective Recommendation(https://arxiv.org/abs/2412.08911)
Keywords: robust
Abstract: Multi-objective learning endeavors to concurrently optimize multiple objectives using a single model, aiming to achieve high and balanced performance across these diverse objectives. However, it often involves a more complex optimization problem, particularly when navigating potential conflicts between objectives, leading to solutions with higher memory requirements and computational complexity. This paper introduces a Multi-Objective Goal-Conditioned Supervised Learning (MOGCSL) framework for automatically learning to achieve multiple objectives from offline sequential data. MOGCSL extends the conventional Goal-Conditioned Supervised Learning (GCSL) method to multi-objective scenarios by redefining goals from one-dimensional scalars to multi-dimensional vectors. The need for complex architectures and optimization constraints can be naturally eliminated. MOGCSL benefits from filtering out uninformative or noisy instances that do not achieve desirable long-term rewards. It also incorporates a novel goal-choosing algorithm to model and select "high" achievable goals for inference. While MOGCSL is quite general, we focus on its application to the next action prediction problem in commercial-grade recommender systems. In this context, any viable solution needs to be reasonably scalable and also be robust to large amounts of noisy data that is characteristic of this application space. We show that MOGCSL performs admirably on both counts. Specifically, extensive experiments conducted on real-world recommendation datasets validate its efficacy and efficiency. Also, analysis and experiments are included to explain its strength in discounting the noisier portions of training data in recommender systems.

Title: Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression

Authors: Ali Mollaahmadi Dehaghi, Reza Razavi, Mohammad Moshirpour
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.08912
Pdf URL: https://arxiv.org/pdf/2412.08912
Copy Paste: [[2412.08912]] Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression(https://arxiv.org/abs/2412.08912)
Keywords: diffusion, transformer
Abstract: In this paper, we introduce DiQP; a novel Transformer-Diffusion model for restoring 8K video quality degraded by codec compression. To the best of our knowledge, our model is the first to consider restoring the artifacts introduced by various codecs (AV1, HEVC) by Denoising Diffusion without considering additional noise. This approach allows us to model the complex, non-Gaussian nature of compression artifacts, effectively learning to reverse the degradation. Our architecture combines the power of Transformers to capture long-range dependencies with an enhanced windowed mechanism that preserves spatiotemporal context within groups of pixels across frames. To further enhance restoration, the model incorporates auxiliary "Look Ahead" and "Look Around" modules, providing both future and surrounding frame information to aid in reconstructing fine details and enhancing overall visual quality. Extensive experiments on different datasets demonstrate that our model outperforms state-of-the-art methods, particularly for high-resolution videos such as 4K and 8K, showcasing its effectiveness in restoring perceptually pleasing videos from highly compressed sources.

Title: Sensing for Space Safety and Sustainability: A Deep Learning Approach with Vision Transformers

Authors: Wenxuan Zhang, Peng Hu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.08913
Pdf URL: https://arxiv.org/pdf/2412.08913
Copy Paste: [[2412.08913]] Sensing for Space Safety and Sustainability: A Deep Learning Approach with Vision Transformers(https://arxiv.org/abs/2412.08913)
Keywords: transformer
Abstract: The rapid increase of space assets represented by small satellites in low Earth orbit can enable ubiquitous digital services for everyone. However, due to the dynamic space environment, numerous space objects, complex atmospheric conditions, and unexpected events can easily introduce adverse conditions affecting space safety, operations, and sustainability of the outer space environment. This challenge calls for responsive, effective satellite object detection (SOD) solutions that allow a small satellite to assess and respond to collision risks, with the consideration of constrained resources on a small satellite platform. This paper discusses the SOD tasks and onboard deep learning (DL) approach to the tasks. Two new DL models are proposed, called GELAN-ViT and GELAN-RepViT, which incorporate vision transformer (ViT) into the Generalized Efficient Layer Aggregation Network (GELAN) architecture and address limitations by separating the convolutional neural network and ViT paths. These models outperform the state-of-the-art YOLOv9-t in terms of mean average precision (mAP) and computational costs. On the SOD dataset, our proposed models can achieve around 95% mAP50 with giga-floating point operations (GFLOPs) reduced by over 5.0. On the VOC 2012 dataset, they can achieve $\geq$ 60.7% mAP50 with GFLOPs reduced by over 5.2.

Title: QFAM: Mitigating QUIC Handshake Flooding Attacks Through Crypto Challenges

Authors: Abdollah Jabbari, Y A Joarder, Benjamin Teyssier, Carol Fung
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.08936
Pdf URL: https://arxiv.org/pdf/2412.08936
Copy Paste: [[2412.08936]] QFAM: Mitigating QUIC Handshake Flooding Attacks Through Crypto Challenges(https://arxiv.org/abs/2412.08936)
Keywords: security, defense, attack
Abstract: QUIC protocol is primarily designed to optimize web performance and security. However, previous research has pointed out that it is vulnerable to handshake flooding attacks. Attackers can send excessive volume of handshaking requests to exhaust the CPU resource of the server, through utilizing the large CPU amplification factor occurred during the handshake process under attack. In this paper, we introduce a novel defense mechanism by introducing the concept of crypto challenges into the handshake protocol. This enhancement involves a proposal of modifying the RETRY token to integrate a cryptographic challenge into it. The client must solve crypto challenges during the handshake process in order to receive a high priority on the server side. By properly choosing the difficulty level of the challenges, the CPU amplification can be reduced, thus the DDoS vulnerability is naturalized. We evaluated the effectiveness of our proposed solution by integrating the crypto challenges into the clients and server of \textit{aioquic}. Our experimental results demonstrate that our solution can effectively balance the resource usage between the attacker and the server during of handshake flooding attacks while maintaining a low overhead for legitimate clients.

Title: Optimized Gradient Clipping for Noisy Label Learning

Authors: Xichen Ye, Yifan Wu, Weizhong Zhang, Xiaoqiang Li, Yifan Chen, Cheng Jin
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.08941
Pdf URL: https://arxiv.org/pdf/2412.08941
Copy Paste: [[2412.08941]] Optimized Gradient Clipping for Noisy Label Learning(https://arxiv.org/abs/2412.08941)
Keywords: robust
Abstract: Previous research has shown that constraining the gradient of loss function with respect to model-predicted probabilities can enhance the model robustness against noisy labels. These methods typically specify a fixed optimal threshold for gradient clipping through validation data to obtain the desired robustness against noise. However, this common practice overlooks the dynamic distribution of gradients from both clean and noisy-labeled samples at different stages of training, significantly limiting the model capability to adapt to the variable nature of gradients throughout the training process. To address this issue, we propose a simple yet effective approach called Optimized Gradient Clipping (OGC), which dynamically adjusts the clipping threshold based on the ratio of noise gradients to clean gradients after clipping, estimated by modeling the distributions of clean and noisy samples. This approach allows us to modify the clipping threshold at each training step, effectively controlling the influence of noise gradients. Additionally, we provide statistical analysis to certify the noise-tolerance ability of OGC. Our extensive experiments across various types of label noise, including symmetric, asymmetric, instance-dependent, and real-world noise, demonstrate the effectiveness of our approach. The code and a technical appendix for better digital viewing are included as supplementary materials and scheduled to be open-sourced upon publication.

Title: MoSLD: An Extremely Parameter-Efficient Mixture-of-Shared LoRAs for Multi-Task Learning

Authors: Lulu Zhao, Weihao Zeng, Xiaofeng Shi, Hua Zhou
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.08946
Pdf URL: https://arxiv.org/pdf/2412.08946
Copy Paste: [[2412.08946]] MoSLD: An Extremely Parameter-Efficient Mixture-of-Shared LoRAs for Multi-Task Learning(https://arxiv.org/abs/2412.08946)
Keywords: robust
Abstract: Recently, LoRA has emerged as a crucial technique for fine-tuning large pre-trained models, yet its performance in multi-task learning scenarios often falls short. In contrast, the MoE architecture presents a natural solution to this issue. However, it introduces challenges such as mutual interference of data across multiple domains and knowledge forgetting of various tasks. Additionally, MoE significantly increases the number of parameters, posing a computational cost challenge. Therefore, in this paper, we propose MoSLD, a mixture-of-shared-LoRAs model with a dropout strategy. MoSLD addresses these challenges by sharing the upper projection matrix in LoRA among different experts, encouraging the model to learn general knowledge across tasks, while still allowing the lower projection matrix to focus on the unique features of each task. The application of dropout alleviates the imbalanced update of parameter matrix and mitigates parameter overfitting in LoRA. Extensive experiments demonstrate that our model exhibits excellent performance in both single-task and multi-task scenarios, with robust out-of-domain generalization capabilities.

Title: Selective Visual Prompting in Vision Mamba

Authors: Yifeng Yao, Zichen Liu, Zhenyu Cui, Yuxin Peng, Jiahuan Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08947
Pdf URL: https://arxiv.org/pdf/2412.08947
Copy Paste: [[2412.08947]] Selective Visual Prompting in Vision Mamba(https://arxiv.org/abs/2412.08947)
Keywords: extraction, transformer
Abstract: Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks in a computationally efficient manner, attributed to their unique design of selective state space models. To further extend their applicability to diverse downstream vision tasks, Vim models can be adapted using the efficient fine-tuning technique known as visual prompting. However, existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models that leverage global attention, neglecting the distinctive sequential token-wise compression and propagation characteristics of Vim. Specifically, existing prompt tokens prefixed to the sequence are insufficient to effectively activate the input and forget gates across the entire sequence, hindering the extraction and propagation of discriminative information. To address this limitation, we introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim. To prevent the loss of discriminative information during state space propagation, SVP employs lightweight selective prompters for token-wise prompt generation, ensuring adaptive activation of the update and forget gates within Mamba blocks to promote discriminative information propagation. Moreover, considering that Vim propagates both shared cross-layer information and specific inner-layer information, we further refine SVP with a dual-path structure: Cross-Prompting and Inner-Prompting. Cross-Prompting utilizes shared parameters across layers, while Inner-Prompting employs distinct parameters, promoting the propagation of both shared and specific information, respectively. Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods. Our code is available at this https URL.

Title: Mojito: Motion Trajectory and Intensity Control for Video Generation

Authors: Xuehai He, Shuohang Wang, Jianwei Yang, Xiaoxia Wu, Yiping Wang, Kuan Wang, Zheng Zhan, Olatunji Ruwase, Yelong Shen, Xin Eric Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.08948
Pdf URL: https://arxiv.org/pdf/2412.08948
Copy Paste: [[2412.08948]] Mojito: Motion Trajectory and Intensity Control for Video Generation(https://arxiv.org/abs/2412.08948)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. This paper introduces Mojito, a diffusion model that incorporates both \textbf{Mo}tion tra\textbf{j}ectory and \textbf{i}ntensi\textbf{t}y contr\textbf{o}l for text to video generation. Specifically, Mojito features a Directional Motion Control module that leverages cross-attention to efficiently direct the generated object's motion without additional training, alongside a Motion Intensity Modulator that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.

Title: Align, Generate, Learn: A Novel Closed-Loop Framework for Cross-Lingual In-Context Learning

Authors: Mateo Alejandro Rojas, Rafael Carranza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08955
Pdf URL: https://arxiv.org/pdf/2412.08955
Copy Paste: [[2412.08955]] Align, Generate, Learn: A Novel Closed-Loop Framework for Cross-Lingual In-Context Learning(https://arxiv.org/abs/2412.08955)
Keywords: robust, generative, large language model
Abstract: Cross-lingual in-context learning (XICL) has emerged as a transformative paradigm for leveraging large language models (LLMs) to tackle multilingual tasks, especially for low-resource languages. However, existing approaches often rely on external retrievers or task-specific fine-tuning, limiting their scalability and generalizability. In this paper, we propose a novel self-supervised framework that harnesses the generative capabilities of LLMs to internally select and utilize task-relevant examples. Our method introduces two key objectives: a retrieval-generation alignment loss to optimize the quality of selected examples and a semantic coherence loss to ensure cross-lingual consistency. Through extensive experiments on multilingual benchmarks, our approach achieves state-of-the-art performance, significantly outperforming existing baselines. Further analysis highlights its robustness across diverse language families and its ability to generalize to unseen tasks. Human evaluations confirm the superior fluency, relevance, and semantic correctness of outputs generated by our method. This work provides a scalable, effective, and generalizable solution for cross-lingual in-context learning.

Title: BA-ORABE: Blockchain-Based Auditable Registered Attribute-Based Encryption With Reliable Outsourced Decryption

Authors: Dongliang Cai, Borui Chen, Liang Zhang, Haibin Kan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.08957
Pdf URL: https://arxiv.org/pdf/2412.08957
Copy Paste: [[2412.08957]] BA-ORABE: Blockchain-Based Auditable Registered Attribute-Based Encryption With Reliable Outsourced Decryption(https://arxiv.org/abs/2412.08957)
Keywords: security, protect, fair
Abstract: Attribute-based encryption (ABE) is a generalization of public-key encryption that enables fine-grained access control in cloud services. Recently, Hohenberger et al. (Eurocrypt 2023) introduced the notion of registered ABE, which is an ABE scheme without a trusted central authority. Instead, users generate their own public/secret keys and then register their keys and attributes with a key curator. The key curator is a transparent and untrusted entity and its behavior needs to be audited for malicious registration. In addition, pairing-based registered ABE still suffers the heavy decryption overhead like ABE. A general approach to address this issue is to outsource decryption to a decryption cloud service (DCS). In this work, we propose BA-ORABE, the first fully auditable registered ABE with reliable outsource decryption scheme based on blockchain. First, we utilize a verifiable tag mechanism to achieve verifiability of ciphertext transformation, and the exemptibility which enables the honest DCS to escape from wrong claims is guaranteed by zero knowledge fraud proof under optimistic assumption. Additionally, our system achieves fairness and decentralized outsourcing to protect the interests of all parties and the registration and outsourcing process are transparent and fully auditable through blockchain. Finally, we give formal security analysis and implement and evaluate our scheme on Ethereum to demonstrate its feasibility and efficiency.

Title: AFFAKT: A Hierarchical Optimal Transport based Method for Affective Facial Knowledge Transfer in Video Deception Detection

Authors: Zihan Ji, Xuetao Tian, Ye Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08965
Pdf URL: https://arxiv.org/pdf/2412.08965
Copy Paste: [[2412.08965]] AFFAKT: A Hierarchical Optimal Transport based Method for Affective Facial Knowledge Transfer in Video Deception Detection(https://arxiv.org/abs/2412.08965)
Keywords: interpretability
Abstract: The scarcity of high-quality large-scale labeled datasets poses a huge challenge for employing deep learning models in video deception detection. To address this issue, inspired by the psychological theory on the relation between deception and expressions, we propose a novel method called AFFAKT in this paper, which enhances the classification performance by transferring useful and correlated knowledge from a large facial expression dataset. Two key challenges in knowledge transfer arise: 1) \textit{how much} knowledge of facial expression data should be transferred and 2) \textit{how to} effectively leverage transferred knowledge for the deception classification model during inference. Specifically, the optimal relation mapping between facial expression classes and deception samples is firstly quantified using proposed H-OTKT module and then transfers knowledge from the facial expression dataset to deception samples. Moreover, a correlation prototype within another proposed module SRKB is well designed to retain the invariant correlations between facial expression classes and deception classes through momentum updating. During inference, the transferred knowledge is fine-tuned with the correlation prototype using a sample-specific re-weighting strategy. Experimental results on two deception detection datasets demonstrate the superior performance of our proposed method. The interpretability study reveals high associations between deception and negative affections, which coincides with the theory in psychology.

Title: Deep Learning Model Security: Threats and Defenses

Authors: Tianyang Wang, Ziqian Bi, Yichao Zhang, Ming Liu, Weiche Hsieh, Pohsun Feng, Lawrence K.Q. Yan, Yizhu Wen, Benji Peng, Junyu Liu, Keyu Chen, Sen Zhang, Ming Li, Chuanqi Jiang, Xinyuan Song, Junjie Yang, Bowen Jing, Jintao Ren, Junhao Song, Hong-Ming Tseng, Silin Chen, Yunze Wang, Chia Xin Liang, Jiawei Xu, Xuanhe Pan, Jinlang Wang, Qian Niu
Subjects: cs.CR, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2412.08969
Pdf URL: https://arxiv.org/pdf/2412.08969
Copy Paste: [[2412.08969]] Deep Learning Model Security: Threats and Defenses(https://arxiv.org/abs/2412.08969)
Keywords: security, privacy, defense, attack, robust, federate
Abstract: Deep learning has transformed AI applications but faces critical security challenges, including adversarial attacks, data poisoning, model theft, and privacy leakage. This survey examines these vulnerabilities, detailing their mechanisms and impact on model integrity and confidentiality. Practical implementations, including adversarial examples, label flipping, and backdoor attacks, are explored alongside defenses such as adversarial training, differential privacy, and federated learning, highlighting their strengths and limitations. Advanced methods like contrastive and self-supervised learning are presented for enhancing robustness. The survey concludes with future directions, emphasizing automated defenses, zero-trust architectures, and the security challenges of large AI models. A balanced approach to performance and security is essential for developing reliable deep learning systems.

Title: Reasoning-Aware Query-Focused Summarization over Multi-Table Data

Authors: Xiaochuan Lin, Xiangyong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08970
Pdf URL: https://arxiv.org/pdf/2412.08970
Copy Paste: [[2412.08970]] Reasoning-Aware Query-Focused Summarization over Multi-Table Data(https://arxiv.org/abs/2412.08970)
Keywords: robust, generative, large language model
Abstract: Query-focused summarization over multi-table data is a challenging yet critical task for extracting precise and relevant information from structured data. Existing methods often rely on complex preprocessing steps and struggle to generalize across domains or handle the logical reasoning required for multi-table queries. In this paper, we propose QueryTableSummarizer++, an end-to-end generative framework leveraging large language models (LLMs) enhanced with table-aware pre-training, query-aligned fine-tuning, and reinforcement learning with feedback. Our method eliminates the need for intermediate serialization steps and directly generates query-relevant summaries. Experiments on a benchmark dataset demonstrate that QueryTableSummarizer++ significantly outperforms state-of-the-art baselines in terms of BLEU, ROUGE, and F1-score. Additional analyses highlight its scalability, generalization across domains, and robust handling of complex queries. Human evaluation further validates the superior quality and practical applicability of the generated summaries, establishing QueryTableSummarizer++ as a highly effective solution for multi-table summarization tasks.

Title: RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Authors: Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08972
Pdf URL: https://arxiv.org/pdf/2412.08972
Copy Paste: [[2412.08972]] RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios(https://arxiv.org/abs/2412.08972)
Keywords: large language model
Abstract: This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. These results highlight significant challenges in advancing LLMs' rule-guided reasoning capabilities in real-life applications.

Title: Elevating Flow-Guided Video Inpainting with Reference Generation

Authors: Suhwan Cho, Seoung Wug Oh, Sangyoun Lee, Joon-Young Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08975
Pdf URL: https://arxiv.org/pdf/2412.08975
Copy Paste: [[2412.08975]] Elevating Flow-Guided Video Inpainting with Reference Generation(https://arxiv.org/abs/2412.08975)
Keywords: robust, generative
Abstract: Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. In this study, we propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Powered by a strong generative model, our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts. For pixel propagation, we introduce a one-shot pixel pulling method that effectively avoids error accumulation from repeated sampling while maintaining sub-pixel precision. To evaluate various VI methods in realistic scenarios, we also propose a high-quality VI benchmark, HQVI, comprising carefully generated videos using alpha matte composition. On public benchmarks and the HQVI dataset, our method demonstrates significantly higher visual quality and metric scores compared to existing solutions. Furthermore, it can process high-resolution videos exceeding 2K resolution with ease, underscoring its superiority for real-world applications.

Title: Assessing the Robustness of Retrieval-Augmented Generation Systems in K-12 Educational Question Answering with Knowledge Discrepancies

Authors: Tianshi Zheng, Weihan Li, Jiaxin Bai, Weiqi Wang, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08985
Pdf URL: https://arxiv.org/pdf/2412.08985
Copy Paste: [[2412.08985]] Assessing the Robustness of Retrieval-Augmented Generation Systems in K-12 Educational Question Answering with Knowledge Discrepancies(https://arxiv.org/abs/2412.08985)
Keywords: robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) systems have demonstrated remarkable potential as question answering systems in the K-12 Education domain, where knowledge is typically queried within the restricted scope of authoritative textbooks. However, the discrepancy between textbooks and the parametric knowledge in Large Language Models (LLMs) could undermine the effectiveness of RAG systems. To systematically investigate the robustness of RAG systems under such knowledge discrepancies, we present EduKDQA, a question answering dataset that simulates knowledge discrepancies in real applications by applying hypothetical knowledge updates in answers and source documents. EduKDQA includes 3,005 questions covering five subjects, under a comprehensive question typology from the perspective of context utilization and knowledge integration. We conducted extensive experiments on retrieval and question answering performance. We find that most RAG systems suffer from a substantial performance drop in question answering with knowledge discrepancies, while questions that require integration of contextual knowledge and parametric knowledge pose a challenge to LLMs.

Title: CBCMS: A Compliance Management System for Cross-Border Data Transfer

Authors: Zhixian Zhuang, Xiaodong Lee, Jiuqi Wei, Yufan Fu, Aiyao Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.08993
Pdf URL: https://arxiv.org/pdf/2412.08993
Copy Paste: [[2412.08993]] CBCMS: A Compliance Management System for Cross-Border Data Transfer(https://arxiv.org/abs/2412.08993)
Keywords: protect
Abstract: Cross-border data transfer is vital for the digital economy by enabling data flow across different countries or regions. However, ensuring compliance with diverse data protection regulations during the transfer introduces significant complexities. Existing solutions either focus on a single legal framework or neglect real-time and concurrent processing demands, resulting in incomplete and inconsistent compliance management. To address this issue, we propose Cross-Border Compliance Management System (CBCMS), which not only enables the unified management of data processing policies across multiple jurisdictions to ensure compliance with various legal frameworks involved in cross-border data transfer, but also supports real-time and high-concurrency processing capabilities. We design Policy Definition Language (PDL) that supports the unified management of data processing policies, bridging the gap between natural language policies and machine-processable expressions, thereby allowing various legal frameworks to be seamlessly integrated into CBCMS. We present Compliance Policy Generation Model (CPGM), the core component of CBCMS, which generates compliant data processing policies with high accuracy, achieving up to 25.16% improvement in F1 score (reaching 97.32%) compared to rule-based baseline. CPGM achieves inference time in the order of milliseconds (6 to 13 ms), and keeps low latency even under high-load scenarios, demonstrating high real-time and concurrent performance. To our knowledge, CBCMS is the first system to support unified compliance management across jurisdictions while ensuring real-time and concurrent processing capabilities.

Title: MS2Mesh-XR: Multi-modal Sketch-to-Mesh Generation in XR Environments

Authors: Yuqi Tong, Yue Qiu, Ruiyang Li, Shi Qiu, Pheng-Ann Heng
Subjects: cs.CV, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2412.09008
Pdf URL: https://arxiv.org/pdf/2412.09008
Copy Paste: [[2412.09008]] MS2Mesh-XR: Multi-modal Sketch-to-Mesh Generation in XR Environments(https://arxiv.org/abs/2412.09008)
Keywords: generative
Abstract: We present MS2Mesh-XR, a novel multi-modal sketch-to-mesh generation pipeline that enables users to create realistic 3D objects in extended reality (XR) environments using hand-drawn sketches assisted by voice inputs. In specific, users can intuitively sketch objects using natural hand movements in mid-air within a virtual environment. By integrating voice inputs, we devise ControlNet to infer realistic images based on the drawn sketches and interpreted text prompts. Users can then review and select their preferred image, which is subsequently reconstructed into a detailed 3D mesh using the Convolutional Reconstruction Model. In particular, our proposed pipeline can generate a high-quality 3D mesh in less than 20 seconds, allowing for immersive visualization and manipulation in run-time XR scenes. We demonstrate the practicability of our pipeline through two use cases in XR settings. By leveraging natural user inputs and cutting-edge generative AI capabilities, our approach can significantly facilitate XR-based creative production and enhance user experiences. Our code and demo will be available at: this https URL

Title: A physics-informed transformer neural operator for learning generalized solutions of initial boundary value problems

Authors: Sumanth Kumar Boya, Deepak Subramani
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2412.09009
Pdf URL: https://arxiv.org/pdf/2412.09009
Copy Paste: [[2412.09009]] A physics-informed transformer neural operator for learning generalized solutions of initial boundary value problems(https://arxiv.org/abs/2412.09009)
Keywords: transformer
Abstract: Initial boundary value problems arise commonly in applications with engineering and natural systems governed by nonlinear partial differential equations (PDEs). Operator learning is an emerging field for solving these equations by using a neural network to learn a map between infinite dimensional input and output function spaces. These neural operators are trained using a combination of data (observations or simulations) and PDE-residuals (physics-loss). A major drawback of existing neural approaches is the requirement to retrain with new initial/boundary conditions, and the necessity for a large amount of simulation data for training. We develop a physics-informed transformer neural operator (named PINTO) that efficiently generalizes to unseen initial and boundary conditions, trained in a simulation-free setting using only physics loss. The main innovation lies in our new iterative kernel integral operator units, implemented using cross-attention, to transform the PDE solution's domain points into an initial/boundary condition-aware representation vector, enabling efficient learning of the solution function for new scenarios. The PINTO architecture is applied to simulate the solutions of important equations used in engineering applications: advection, Burgers, and steady and unsteady Navier-Stokes equations (three flow scenarios). For these five test cases, we show that the relative errors during testing under challenging conditions of unseen initial/boundary conditions are only one-fifth to one-third of other leading physics informed operator learning methods. Moreover, our PINTO model is able to accurately solve the advection and Burgers equations at time steps that are not included in the training collocation points. The code is available at $\texttt{this https URL}$

Title: What Makes Cryptic Crosswords Challenging for LLMs?

Authors: Abdelrahman Sadallah, Daria Kotova, Ekaterina Kochmar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09012
Pdf URL: https://arxiv.org/pdf/2412.09012
Copy Paste: [[2412.09012]] What Makes Cryptic Crosswords Challenging for LLMs?(https://arxiv.org/abs/2412.09012)
Keywords: large language model
Abstract: Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at this https URL.

Title: Arbitrary-steps Image Super-resolution via Diffusion Inversion

Authors: Zongsheng Yue, Kang Liao, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09013
Pdf URL: https://arxiv.org/pdf/2412.09013
Copy Paste: [[2412.09013]] Arbitrary-steps Image Super-resolution via Diffusion Inversion(https://arxiv.org/abs/2412.09013)
Keywords: diffusion
Abstract: This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Central to our approach is a deep noise predictor to estimate the optimal noise maps for the forward diffusion process. Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result. Compared to existing approaches, our method offers a flexible and efficient sampling mechanism that supports an arbitrary number of sampling steps, ranging from one to five. Even with a single sampling step, our method demonstrates superior or comparable performance to recent state-of-the-art approaches. The code and model are publicly available at this https URL.

Title: STEAM: Squeeze and Transform Enhanced Attention Module

Authors: Rishabh Sabharwal, Ram Samarth B B, Parikshit Singh Rathore, Punit Rathore
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09023
Pdf URL: https://arxiv.org/pdf/2412.09023
Copy Paste: [[2412.09023]] STEAM: Squeeze and Transform Enhanced Attention Module(https://arxiv.org/abs/2412.09023)
Keywords: transformer, segmentation
Abstract: Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, STEAM: Squeeze and Transform Enhanced Attention Module, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce Output Guided Pooling (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a 2% increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three-fold reduction in GFLOPs.

Title: Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model

Authors: Hang Zhou, Jiale Cai, Yuteng Ye, Yonghui Feng, Chenxing Gao, Junqing Yu, Zikai Song, Wei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09026
Pdf URL: https://arxiv.org/pdf/2412.09026
Copy Paste: [[2412.09026]] Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model(https://arxiv.org/abs/2412.09026)
Keywords: diffusion
Abstract: A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patch-based diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results in four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.

Title: Learning and Current Prediction of PMSM Drive via Differential Neural Networks

Authors: Wenjie Mei, Xiaorui Wang, Yanrong Lu, Ke Yu, Shihua Li
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2412.09028
Pdf URL: https://arxiv.org/pdf/2412.09028
Copy Paste: [[2412.09028]] Learning and Current Prediction of PMSM Drive via Differential Neural Networks(https://arxiv.org/abs/2412.09028)
Keywords: robust
Abstract: Learning models for dynamical systems in continuous time is significant for understanding complex phenomena and making accurate predictions. This study presents a novel approach utilizing differential neural networks (DNNs) to model nonlinear systems, specifically permanent magnet synchronous motors (PMSMs), and to predict their current trajectories. The efficacy of our approach is validated through experiments conducted under various load disturbances and no-load conditions. The results demonstrate that our method effectively and accurately reconstructs the original systems, showcasing strong short-term and long-term prediction capabilities and robustness. This study provides valuable insights into learning the inherent dynamics of complex dynamical data and holds potential for further applications in fields such as weather forecasting, robotics, and collective behavior analysis.

Title: RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell Property Prediction

Authors: Zhihao Ding, Ting Zhang, Yiran Li, Jieming Shi, Chen Jason Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09030
Pdf URL: https://arxiv.org/pdf/2412.09030
Copy Paste: [[2412.09030]] RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell Property Prediction(https://arxiv.org/abs/2412.09030)
Keywords: transformer
Abstract: Organic Solar Cells (OSCs) are a promising technology for sustainable energy production. However, the identification of molecules with desired OSC properties typically involves laborious experimental research. To accelerate progress in the field, it is crucial to develop machine learning models capable of accurately predicting the properties of OSC molecules. While graph representation learning has demonstrated success in molecular property prediction, it remains underexplored for OSC-specific tasks. Existing methods fail to capture the unique structural features of OSC molecules, particularly the intricate ring systems that critically influence OSC properties, leading to suboptimal performance. To fill the gap, we present RingFormer, a novel graph transformer framework specially designed to capture both atom and ring level structural patterns in OSC molecules. RingFormer constructs a hierarchical graph that integrates atomic and ring structures and employs a combination of local message passing and global attention mechanisms to generate expressive graph representations for accurate OSC property prediction. We evaluate RingFormer's effectiveness on five curated OSC molecule datasets through extensive experiments. The results demonstrate that RingFormer consistently outperforms existing methods, achieving a 22.77% relative improvement over the nearest competitor on the CEPDB dataset.

Title: Dialogue Language Model with Large-Scale Persona Data Engineering

Authors: Mengze Hong, Chen Zhang, Chaotao Chen, Rongzhong Lian, Di Jiang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2412.09034
Pdf URL: https://arxiv.org/pdf/2412.09034
Copy Paste: [[2412.09034]] Dialogue Language Model with Large-Scale Persona Data Engineering(https://arxiv.org/abs/2412.09034)
Keywords: robust, extraction, generative
Abstract: Maintaining persona consistency is paramount in the application of open-domain dialogue systems, as exemplified by models like ChatGPT. Despite significant advancements, the limited scale and diversity of current persona dialogue datasets remain challenges to achieving robust persona-consistent dialogue models. In this study, drawing inspiration from the success of large-scale pre-training, we introduce PPDS, an open-domain persona dialogue system that employs extensive generative pre-training on a persona dialogue dataset to enhance persona consistency. Specifically, we present a persona extraction model designed to autonomously and precisely generate vast persona dialogue datasets. Additionally, we unveil a pioneering persona augmentation technique to address the invalid persona bias inherent in the constructed dataset. Both quantitative and human evaluations consistently highlight the superior response quality and persona consistency of our proposed model, underscoring its effectiveness.

Title: ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

Authors: Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09036
Pdf URL: https://arxiv.org/pdf/2412.09036
Copy Paste: [[2412.09036]] ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty(https://arxiv.org/abs/2412.09036)
Keywords: large language model
Abstract: Large Language models (LLMs) have become a research hotspot. To accelerate the inference of LLMs, storing computed caches in memory has become the standard technique. However, as the inference length increases, growing KV caches might lead to out-of-memory issues. Many existing methods address this issue through KV cache compression, primarily by preserving key tokens throughout all layers to reduce information loss. Most of them allocate a uniform budget size for each layer to retain. However, we observe that the minimum budget sizes needed to retain essential information vary across layers and models based on the perspectives of attention and hidden state output. Building on this observation, this paper proposes a simple yet effective KV cache compression method that leverages layer uncertainty to allocate budget size for each layer. Experimental results show that the proposed method can reduce memory usage of the KV caches to only $\sim$20\% when compared to Full KV inference while achieving nearly lossless performance.

Title: Beyond Confusion: A Fine-grained Dialectical Examination of Human Activity Recognition Benchmark Datasets

Authors: Daniel Geissler, Dominique Nshimyimana, Vitor Fortes Rey, Sungho Suh, Bo Zhou, Paul Lukowicz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09037
Pdf URL: https://arxiv.org/pdf/2412.09037
Copy Paste: [[2412.09037]] Beyond Confusion: A Fine-grained Dialectical Examination of Human Activity Recognition Benchmark Datasets(https://arxiv.org/abs/2412.09037)
Keywords: transformer
Abstract: The research of machine learning (ML) algorithms for human activity recognition (HAR) has made significant progress with publicly available datasets. However, most research prioritizes statistical metrics over examining negative sample details. While recent models like transformers have been applied to HAR datasets with limited success from the benchmark metrics, their counterparts have effectively solved problems on similar levels with near 100% accuracy. This raises questions about the limitations of current approaches. This paper aims to address these open questions by conducting a fine-grained inspection of six popular HAR benchmark datasets. We identified for some parts of the data, none of the six chosen state-of-the-art ML methods can correctly classify, denoted as the intersect of false classifications (IFC). Analysis of the IFC reveals several underlying problems, including ambiguous annotations, irregularities during recording execution, and misaligned transition periods. We contribute to the field by quantifying and characterizing annotated data ambiguities, providing a trinary categorization mask for dataset patching, and stressing potential improvements for future data collections.

Title: Motif Guided Graph Transformer with Combinatorial Skeleton Prototype Learning for Skeleton-Based Person Re-Identification

Authors: Haocong Rao, Chunyan Miao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09044
Pdf URL: https://arxiv.org/pdf/2412.09044
Copy Paste: [[2412.09044]] Motif Guided Graph Transformer with Combinatorial Skeleton Prototype Learning for Skeleton-Based Person Re-Identification(https://arxiv.org/abs/2412.09044)
Keywords: transformer
Abstract: Person re-identification (re-ID) via 3D skeleton data is a challenging task with significant value in many scenarios. Existing skeleton-based methods typically assume virtual motion relations between all joints, and adopt average joint or sequence representations for learning. However, they rarely explore key body structure and motion such as gait to focus on more important body joints or limbs, while lacking the ability to fully mine valuable spatial-temporal sub-patterns of skeletons to enhance model learning. This paper presents a generic Motif guided graph transformer with Combinatorial skeleton prototype learning (MoCos) that exploits structure-specific and gait-related body relations as well as combinatorial features of skeleton graphs to learn effective skeleton representations for person re-ID. In particular, motivated by the locality within joints' structure and the body-component collaboration in gait, we first propose the motif guided graph transformer (MGT) that incorporates hierarchical structural motifs and gait collaborative motifs, which simultaneously focuses on multi-order local joint correlations and key cooperative body parts to enhance skeleton relation learning. Then, we devise the combinatorial skeleton prototype learning (CSP) that leverages random spatial-temporal combinations of joint nodes and skeleton graphs to generate diverse sub-skeleton and sub-tracklet representations, which are contrasted with the most representative features (prototypes) of each identity to learn class-related semantics and discriminative skeleton representations. Extensive experiments validate the superior performance of MoCos over existing state-of-the-art models. We further show its generality under RGB-estimated skeletons, different graph modeling, and unsupervised scenarios.

Title: Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation

Authors: Xuebin Wang, Lei Zhang, Zhenghua Li, Shilin Zhou, Chen Gong, Yang Hou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09045
Pdf URL: https://arxiv.org/pdf/2412.09045
Copy Paste: [[2412.09045]] Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation(https://arxiv.org/abs/2412.09045)
Keywords: robust, segmentation
Abstract: Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from speech-text parallel data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have annotated about 1,000 sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.

Title: Multi-Task Learning with LLMs for Implicit Sentiment Analysis: Data-level and Task-level Automatic Weight Learning

Authors: Wenna Lai, Haoran Xie, Guandong Xu, Qing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09046
Pdf URL: https://arxiv.org/pdf/2412.09046
Copy Paste: [[2412.09046]] Multi-Task Learning with LLMs for Implicit Sentiment Analysis: Data-level and Task-level Automatic Weight Learning(https://arxiv.org/abs/2412.09046)
Keywords: generative, large language model
Abstract: Implicit sentiment analysis (ISA) presents significant challenges due to the absence of salient cue words. Previous methods have struggled with insufficient data and limited reasoning capabilities to infer underlying opinions. Integrating multi-task learning (MTL) with large language models (LLMs) offers the potential to enable models of varying sizes to reliably perceive and recognize genuine opinions in ISA. However, existing MTL approaches are constrained by two sources of uncertainty: data-level uncertainty, arising from hallucination problems in LLM-generated contextual information, and task-level uncertainty, stemming from the varying capacities of models to process contextual information. To handle these uncertainties, we introduce MT-ISA, a novel MTL framework that enhances ISA by leveraging the generation and reasoning capabilities of LLMs through automatic MTL. Specifically, MT-ISA constructs auxiliary tasks using generative LLMs to supplement sentiment elements and incorporates automatic MTL to fully exploit auxiliary data. We introduce data-level and task-level automatic weight learning (AWL), which dynamically identifies relationships and prioritizes more reliable data and critical tasks, enabling models of varying sizes to adaptively learn fine-grained weights based on their reasoning capabilities. We investigate three strategies for data-level AWL, while also introducing homoscedastic uncertainty for task-level AWL. Extensive experiments reveal that models of varying sizes achieve an optimal balance between primary prediction and auxiliary tasks in MT-ISA. This underscores the effectiveness and adaptability of our approach.

Title: Dial-In LLM: Human-Aligned Dialogue Intent Clustering with LLM-in-the-loop

Authors: Mengze Hong, Yuanfeng Song, Di Jiang, Wailing Ng, Yanjie Sun, Chen Jason Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09049
Pdf URL: https://arxiv.org/pdf/2412.09049
Copy Paste: [[2412.09049]] Dial-In LLM: Human-Aligned Dialogue Intent Clustering with LLM-in-the-loop(https://arxiv.org/abs/2412.09049)
Keywords: robust, large language model
Abstract: The discovery of customer intention from dialogue plays an important role in automated support system. However, traditional text clustering methods are poorly aligned with human perceptions due to the shift from embedding distance to semantic distance, and existing quantitative metrics for text clustering may not accurately reflect the true quality of intent clusters. In this paper, we leverage the superior language understanding capabilities of Large Language Models (LLMs) for designing better-calibrated intent clustering algorithms. We first establish the foundation by verifying the robustness of fine-tuned LLM utility in semantic coherence evaluation and cluster naming, resulting in an accuracy of 97.50% and 94.40%, respectively, when compared to the human-labeled ground truth. Then, we propose an iterative clustering algorithm that facilitates cluster-level refinement and the continuous discovery of high-quality intent clusters. Furthermore, we present several LLM-in-the-loop semi-supervised clustering techniques tailored for intent discovery from customer service dialogue. Experiments on a large-scale industrial dataset comprising 1,507 intent clusters demonstrate the effectiveness of the proposed techniques. The methods outperformed existing counterparts, achieving 6.25% improvement in quantitative metrics and 12% enhancement in application-level performance when constructing an intent classifier.

Title: ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Authors: Mingda Jia, Liming Zhao, Ge Li, Yun Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09050
Pdf URL: https://arxiv.org/pdf/2412.09050
Copy Paste: [[2412.09050]] ContextHOI: Spatial Context Learning for Human-Object Interaction Detection(https://arxiv.org/abs/2412.09050)
Keywords: transformer
Abstract: Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.

Title: Hyperbolic-constraint Point Cloud Reconstruction from Single RGB-D Images

Authors: Wenrui Li, Zhe Yang, Wei Han, Hengyu Man, Xingtao Wang, Xiaopeng Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09055
Pdf URL: https://arxiv.org/pdf/2412.09055
Copy Paste: [[2412.09055]] Hyperbolic-constraint Point Cloud Reconstruction from Single RGB-D Images(https://arxiv.org/abs/2412.09055)
Keywords: extraction
Abstract: Reconstructing desired objects and scenes has long been a primary goal in 3D computer vision. Single-view point cloud reconstruction has become a popular technique due to its low cost and accurate results. However, single-view reconstruction methods often rely on expensive CAD models and complex geometric priors. Effectively utilizing prior knowledge about the data remains a challenge. In this paper, we introduce hyperbolic space to 3D point cloud reconstruction, enabling the model to represent and understand complex hierarchical structures in point clouds with low distortion. We build upon previous methods by proposing a hyperbolic Chamfer distance and a regularized triplet loss to enhance the relationship between partial and complete point clouds. Additionally, we design adaptive boundary conditions to improve the model's understanding and reconstruction of 3D structures. Our model outperforms most existing models, and ablation studies demonstrate the significance of our model and its components. Experimental results show that our method significantly improves feature extraction capabilities. Our model achieves outstanding performance in 3D reconstruction tasks.

Title: PhishIntel: Toward Practical Deployment of Reference-based Phishing Detection

Authors: Yuexin Li, Hiok Kuek Tan, Qiaoran Meng, Mei Lin Lock, Tri Cao, Shumin Deng, Nay Oo, Hoon Wei Lim, Bryan Hooi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.09057
Pdf URL: https://arxiv.org/pdf/2412.09057
Copy Paste: [[2412.09057]] PhishIntel: Toward Practical Deployment of Reference-based Phishing Detection(https://arxiv.org/abs/2412.09057)
Keywords: robust
Abstract: Phishing is a critical cyber threat, exploiting deceptive tactics to compromise victims and cause significant financial losses. While reference-based phishing detectors (RBPDs) achieve high precision by analyzing brand-domain consistency, their real-world deployment is hindered by challenges such as high latency and inefficiency in URL analysis. To address these limitations, we present PhishIntel, an end-to-end phishing detection system for real-world deployment. PhishIntel intelligently determines whether a URL can be processed immediately or not, segmenting the detection process into two distinct tasks: a fast task that checks against local blacklists and result cache, and a slow task that conducts online blacklist verification, URL crawling, and webpage analysis using an RBPD. This fast-slow task system architecture ensures low response latency while retaining the robust detection capabilities of RBPDs for zero-day phishing threats. Furthermore, we develop two downstream applications based on PhishIntel: a phishing intelligence platform and a phishing email detection plugin for Microsoft Outlook, demonstrating its practical efficacy and utility.

Title: Go With the Flow: Fast Diffusion for Gaussian Mixture Models

Authors: George Rapakoulias, Ali Reza Pedram, Panagiotis Tsiotras
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09059
Pdf URL: https://arxiv.org/pdf/2412.09059
Copy Paste: [[2412.09059]] Go With the Flow: Fast Diffusion for Gaussian Mixture Models(https://arxiv.org/abs/2412.09059)
Keywords: diffusion
Abstract: Schrödinger Bridges (SB) are diffusion processes that steer, in finite time, a given initial distribution to another final one while minimizing a suitable cost functional. Although various methods for computing SBs have recently been proposed in the literature, most of these approaches require computationally expensive training schemes, even for solving low-dimensional problems. In this work, we propose an analytic parametrization of a set of feasible policies for steering the distribution of a dynamical system from one Gaussian Mixture Model (GMM) to another. Instead of relying on standard non-convex optimization techniques, the optimal policy within the set can be approximated as the solution of a low-dimensional linear program whose dimension scales linearly with the number of components in each mixture. Furthermore, our method generalizes naturally to more general classes of dynamical systems such as controllable Linear Time-Varying systems that cannot currently be solved using traditional neural SB approaches. We showcase the potential of this approach in low-to-moderate dimensional problems such as image-to-image translation in the latent space of an autoencoder, and various other examples. We also benchmark our approach on an Entropic Optimal Transport (EOT) problem and show that it outperforms state-of-the-art methods in cases where the boundary distributions are mixture models while requiring virtually no training.

Title: An Efficient Framework for Enhancing Discriminative Models via Diffusion Techniques

Authors: Chunxiao Li, Xiaoxiao Wang, Boming Miao, Chuanlong Xie, Zizhe Wang, Yao Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09063
Pdf URL: https://arxiv.org/pdf/2412.09063
Copy Paste: [[2412.09063]] An Efficient Framework for Enhancing Discriminative Models via Diffusion Techniques(https://arxiv.org/abs/2412.09063)
Keywords: diffusion, transformer, generative
Abstract: Image classification serves as the cornerstone of computer vision, traditionally achieved through discriminative models based on deep neural networks. Recent advancements have introduced classification methods derived from generative models, which offer the advantage of zero-shot classification. However, these methods suffer from two main drawbacks: high computational overhead and inferior performance compared to discriminative models. Inspired by the coordinated cognitive processes of rapid-slow pathway interactions in the human brain during visual signal recognition, we propose the Diffusion-Based Discriminative Model Enhancement Framework (DBMEF). This framework seamlessly integrates discriminative and generative models in a training-free manner, leveraging discriminative models for initial predictions and endowing deep neural networks with rethinking capabilities via diffusion models. Consequently, DBMEF can effectively enhance the classification accuracy and generalization capability of discriminative models in a plug-and-play manner. We have conducted extensive experiments across 17 prevalent deep model architectures with different training methods, including both CNN-based models such as ResNet and Transformer-based models like ViT, to demonstrate the effectiveness of the proposed DBMEF. Specifically, the framework yields a 1.51\% performance improvement for ResNet-50 on the ImageNet dataset and 3.02\% on the ImageNet-A dataset. In conclusion, our research introduces a novel paradigm for image classification, demonstrating stable improvements across different datasets and neural networks.

Title: SVasP: Self-Versatility Adversarial Style Perturbation for Cross-Domain Few-Shot Learning

Authors: Wenqian Li, Pengfei Fang, Hui Xue
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09073
Pdf URL: https://arxiv.org/pdf/2412.09073
Copy Paste: [[2412.09073]] SVasP: Self-Versatility Adversarial Style Perturbation for Cross-Domain Few-Shot Learning(https://arxiv.org/abs/2412.09073)
Keywords: robust
Abstract: Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from seen source domains to unseen target domains, which is crucial for evaluating the generalization and robustness of models. Recent studies focus on utilizing visual styles to bridge the domain gap between different domains. However, the serious dilemma of gradient instability and local optimization problem occurs in those style-based CD-FSL methods. This paper addresses these issues and proposes a novel crop-global style perturbation method, called \underline{\textbf{S}}elf-\underline{\textbf{V}}ersatility \underline{\textbf{A}}dversarial \underline{\textbf{S}}tyle \underline{\textbf{P}}erturbation (\textbf{SVasP}), which enhances the gradient stability and escapes from poor sharp minima jointly. Specifically, SVasP simulates more diverse potential target domain adversarial styles via diversifying input patterns and aggregating localized crop style gradients, to serve as global style perturbation stabilizers within one image, a concept we refer to as self-versatility. Then a novel objective function is proposed to maximize visual discrepancy while maintaining semantic consistency between global, crop, and adversarial features. Having the stabilized global style perturbation in the training phase, one can obtain a flattened minima in the loss landscape, boosting the transferability of the model to the target domains. Extensive experiments on multiple benchmark datasets demonstrate that our method significantly outperforms existing state-of-the-art methods. Our codes are available at this https URL.

Title: Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Authors: Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, Yunhe Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09078
Pdf URL: https://arxiv.org/pdf/2412.09078
Copy Paste: [[2412.09078]] Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning(https://arxiv.org/abs/2412.09078)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable abilities across various language tasks, but solving complex reasoning problems remains a challenge. While existing methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) enhance reasoning by decomposing problems or structuring prompts, they typically perform a single pass of reasoning and may fail to revisit flawed paths, compromising accuracy. To address this, we propose a novel reasoning framework called Forest-of-Thought (FoT), which integrates multiple reasoning trees to leverage collective decision-making for solving complex logical problems. FoT utilizes sparse activation strategies to select the most relevant reasoning paths, improving both efficiency and accuracy. Additionally, we introduce a dynamic self-correction strategy that enables real-time error correction and learning from past mistakes, as well as consensus-guided decision making strategies to optimize correctness and computational resources. Experimental results demonstrate that the FoT framework, combined with these strategies, significantly enhances the reasoning capabilities of LLMs, enabling them to solve complex tasks with greater precision and efficiency.

Title: Neural Networks for Threshold Dynamics Reconstruction

Authors: Elisa Negrini, Almanzo Jiahe Gao, Abigail Bowering, Wei Zhu, Luca Capogna
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2412.09079
Pdf URL: https://arxiv.org/pdf/2412.09079
Copy Paste: [[2412.09079]] Neural Networks for Threshold Dynamics Reconstruction(https://arxiv.org/abs/2412.09079)
Keywords: robust
Abstract: We introduce two convolutional neural network (CNN) architectures, inspired by the Merriman-Bence-Osher (MBO) algorithm and by cellular automatons, to model and learn threshold dynamics for front evolution from video data. The first model, termed the (single-dynamics) MBO network, learns a specific kernel and threshold for each input video without adapting to new dynamics, while the second, a meta-learning MBO network, generalizes across diverse threshold dynamics by adapting its parameters per input. Both models are evaluated on synthetic and real-world videos (ice melting and fire front propagation), with performance metrics indicating effective reconstruction and extrapolation of evolving boundaries, even under noisy conditions. Empirical results highlight the robustness of both networks across varied synthetic and real-world dynamics.

Title: Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

Authors: Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09082
Pdf URL: https://arxiv.org/pdf/2412.09082
Copy Paste: [[2412.09082]] Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method(https://arxiv.org/abs/2412.09082)
Keywords: robust
Abstract: Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.

Title: Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion

Authors: Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09094
Pdf URL: https://arxiv.org/pdf/2412.09094
Copy Paste: [[2412.09094]] Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion(https://arxiv.org/abs/2412.09094)
Keywords: large language model
Abstract: Large Language Models (LLMs) present massive inherent knowledge and superior semantic comprehension capability, which have revolutionized various tasks in natural language processing. Despite their success, a critical gap remains in enabling LLMs to perform knowledge graph completion (KGC). Empirical evidence suggests that LLMs consistently perform worse than conventional KGC approaches, even through sophisticated prompt design or tailored instruction-tuning. Fundamentally, applying LLMs on KGC introduces several critical challenges, including a vast set of entity candidates, hallucination issue of LLMs, and under-exploitation of the graph structure. To address these challenges, we propose a novel instruction-tuning-based method, namely FtG. Specifically, we present a \textit{filter-then-generate} paradigm and formulate the KGC task into a multiple-choice question format. In this way, we can harness the capability of LLMs while mitigating the issue casused by hallucinations. Moreover, we devise a flexible ego-graph serialization prompt and employ a structure-text adapter to couple structure and text information in a contextualized manner. Experimental results demonstrate that FtG achieves substantial performance gain compared to existing state-of-the-art methods. The instruction dataset and code are available at \url{this https URL}.

Title: OriginPruner: Leveraging Method Origins for Guided Call Graph Pruning

Authors: Amir M. Mir, Mehdi Keshani, Sebastian Proksch
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2412.09110
Pdf URL: https://arxiv.org/pdf/2412.09110
Copy Paste: [[2412.09110]] OriginPruner: Leveraging Method Origins for Guided Call Graph Pruning(https://arxiv.org/abs/2412.09110)
Keywords: security
Abstract: Most static program analyses depend on Call Graphs (CGs), including reachability of security vulnerabilities. Static CGs ensure soundness through over-approximation, which results in inflated sizes and imprecision. Recent research has employed machine learning (ML) models to prune false edges and enhance CG precision. However, these models require real-world programs with high test coverage to generalize effectively and the inference is expensive. In this paper, we present OriginPruner, a novel call graph pruning technique that leverages the method origin, which is where a method signature is first introduced within a class hierarchy. By incorporating insights from a localness analysis that investigated the scope of method interactions into our approach, OriginPruner confidently identifies and prunes edges related to these origin methods. Our key findings reveal that (1) dominant origin methods, such as this http URL, significantly impact CG sizes; (2) derivatives of these origin methods are primarily local, enabling safe pruning without affecting downstream inter-procedural analyses; (3) OriginPruner achieves a significant reduction in CG size while maintaining the soundness of CGs for security applications like vulnerability propagation analysis; and (4) OriginPruner introduces minimal computational overhead. These findings underscore the potential of leveraging domain knowledge about the type system for more effective CG pruning, offering a promising direction for future work in static program analysis.

Title: The Utility and Complexity of In- and Out-of-Distribution Machine Unlearning

Authors: Youssef Allouah, Joshua Kazdan, Rachid Guerraoui, Sanmi Koyejo
Subjects: cs.LG, cs.CR, math.OC
Abstract URL: https://arxiv.org/abs/2412.09119
Pdf URL: https://arxiv.org/pdf/2412.09119
Copy Paste: [[2412.09119]] The Utility and Complexity of In- and Out-of-Distribution Machine Unlearning(https://arxiv.org/abs/2412.09119)
Keywords: privacy, robust
Abstract: Machine unlearning, the process of selectively removing data from trained models, is increasingly crucial for addressing privacy concerns and knowledge gaps post-deployment. Despite this importance, existing approaches are often heuristic and lack formal guarantees. In this paper, we analyze the fundamental utility, time, and space complexity trade-offs of approximate unlearning, providing rigorous certification analogous to differential privacy. For in-distribution forget data -- data similar to the retain set -- we show that a surprisingly simple and general procedure, empirical risk minimization with output perturbation, achieves tight unlearning-utility-complexity trade-offs, addressing a previous theoretical gap on the separation from unlearning "for free" via differential privacy, which inherently facilitates the removal of such data. However, such techniques fail with out-of-distribution forget data -- data significantly different from the retain set -- where unlearning time complexity can exceed that of retraining, even for a single sample. To address this, we propose a new robust and noisy gradient descent variant that provably amortizes unlearning time complexity without compromising utility.

Title: LVMark: Robust Watermark for latent video diffusion models

Authors: MinHyuk Jang, Youngdong Jang, JaeHyeok Lee, Kodai Kawamura, Feng Yang, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09122
Pdf URL: https://arxiv.org/pdf/2412.09122
Copy Paste: [[2412.09122]] LVMark: Robust Watermark for latent video diffusion models(https://arxiv.org/abs/2412.09122)
Keywords: protect, attack, robust, watermark, diffusion, generative
Abstract: Rapid advancements in generative models have made it possible to create hyper-realistic videos. As their applicability increases, their unauthorized use has raised significant concerns, leading to the growing demand for techniques to protect the ownership of the generative model itself. While existing watermarking methods effectively embed watermarks into image-generative models, they fail to account for temporal information, resulting in poor performance when applied to video-generative models. To address this issue, we introduce a novel watermarking method called LVMark, which embeds watermarks into video diffusion models. A key component of LVMark is a selective weight modulation strategy that efficiently embeds watermark messages into the video diffusion model while preserving the quality of the generated videos. To accurately decode messages in the presence of malicious attacks, we design a watermark decoder that leverages spatio-temporal information in the 3D wavelet domain through a cross-attention module. To the best of our knowledge, our approach is the first to highlight the potential of video-generative model watermarking as a valuable tool for enhancing the effectiveness of ownership protection in video-generative models.

Title: Evaluating the Potential of In-Memory Processing to Accelerate Homomorphic Encryption

Authors: Mpoki Mwaisela, Joel Hari, Peterson Yuhala, Jämes Ménétrey, Pascal Felber, Valerio Schiavoni
Subjects: cs.CR, cs.AR, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2412.09144
Pdf URL: https://arxiv.org/pdf/2412.09144
Copy Paste: [[2412.09144]] Evaluating the Potential of In-Memory Processing to Accelerate Homomorphic Encryption(https://arxiv.org/abs/2412.09144)
Keywords: security, privacy
Abstract: The widespread adoption of cloud-based solutions introduces privacy and security concerns. Techniques such as homomorphic encryption (HE) mitigate this problem by allowing computation over encrypted data without the need for decryption. However, the high computational and memory overhead associated with the underlying cryptographic operations has hindered the practicality of HE-based solutions. While a significant amount of research has focused on reducing computational overhead by utilizing hardware accelerators like GPUs and FPGAs, there has been relatively little emphasis on addressing HE memory overhead. Processing in-memory (PIM) presents a promising solution to this problem by bringing computation closer to data, thereby reducing the overhead resulting from processor-memory data movements. In this work, we evaluate the potential of a PIM architecture from UPMEM for accelerating HE operations. Firstly, we focus on PIM-based acceleration for polynomial operations, which underpin HE algorithms. Subsequently, we conduct a case study analysis by integrating PIM into two popular and open-source HE libraries, OpenFHE and HElib. Our study concludes with key findings and takeaways gained from the practical application of HE operations using PIM, providing valuable insights for those interested in adopting this technology.

Title: Evaluating Adversarial Attacks on Traffic Sign Classifiers beyond Standard Baselines

Authors: Svetlana Pavlitska, Leopold Müller, J. Marius Zöllner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09150
Pdf URL: https://arxiv.org/pdf/2412.09150
Copy Paste: [[2412.09150]] Evaluating Adversarial Attacks on Traffic Sign Classifiers beyond Standard Baselines(https://arxiv.org/abs/2412.09150)
Keywords: attack, fair
Abstract: Adversarial attacks on traffic sign classification models were among the first successfully tried in the real world. Since then, the research in this area has been mainly restricted to repeating baseline models, such as LISA-CNN or GTSRB-CNN, and similar experiment settings, including white and black patches on traffic signs. In this work, we decouple model architectures from the datasets and evaluate on further generic models to make a fair comparison. Furthermore, we compare two attack settings, inconspicuous and visible, which are usually regarded without direct comparison. Our results show that standard baselines like LISA-CNN or GTSRB-CNN are significantly more susceptible than the generic ones. We, therefore, suggest evaluating new attacks on a broader spectrum of baselines in the future. Our code is available at \url{this https URL}.

Title: When Text Embedding Meets Large Language Model: A Comprehensive Survey

Authors: Zhijie Nie, Zhangchi Feng, Mingxin Li, Cunwang Zhang, Yanzhao Zhang, Dingkun Long, Richong Zhang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.09165
Pdf URL: https://arxiv.org/pdf/2412.09165
Copy Paste: [[2412.09165]] When Text Embedding Meets Large Language Model: A Comprehensive Survey(https://arxiv.org/abs/2412.09165)
Keywords: robust, generative, large language model
Abstract: Text embedding has become a foundational technology in natural language processing (NLP) during the deep learning era, driving advancements across a wide array of downstream tasks. While many natural language understanding challenges can now be modeled using generative paradigms and leverage the robust generative and comprehension capabilities of large language models (LLMs), numerous practical applications, such as semantic matching, clustering, and information retrieval, continue to rely on text embeddings for their efficiency and effectiveness. In this survey, we categorize the interplay between LLMs and text embeddings into three overarching themes: (1) LLM-augmented text embedding, enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders, utilizing their innate capabilities for embedding generation; and (3) Text embedding understanding with LLMs, leveraging LLMs to analyze and interpret embeddings. By organizing these efforts based on interaction patterns rather than specific downstream applications, we offer a novel and systematic overview of contributions from various research and application domains in the era of LLMs. Furthermore, we highlight the unresolved challenges that persisted in the pre-LLM era with pre-trained language models (PLMs) and explore the emerging obstacles brought forth by LLMs. Building on this analysis, we outline prospective directions for the evolution of text embedding, addressing both theoretical and practical opportunities in the rapidly advancing landscape of NLP.

Title: DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization

Authors: Geonhui Jang, Jin-Hwa Kim, Yong-Hyun Park, Junho Kim, Gayoung Lee, Yonghyun Jeong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09169
Pdf URL: https://arxiv.org/pdf/2412.09169
Copy Paste: [[2412.09169]] DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization(https://arxiv.org/abs/2412.09169)
Keywords: diffusion
Abstract: Text-to-image (T2I) models can effectively capture the content or style of reference images to perform high-quality customization. A representative technique for this is fine-tuning using low-rank adaptations (LoRA), which enables efficient model customization with reference images. However, fine-tuning with a limited number of reference images often leads to overfitting, resulting in issues such as prompt misalignment or content leakage. These issues prevent the model from accurately following the input prompt or generating undesired objects during inference. To address this problem, we examine the text embeddings that guide the diffusion model during inference. This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry and identify the cause of overfitting. Based on this, we propose DECOR, which projects text embeddings onto a vector space orthogonal to undesired token vectors, thereby reducing the influence of unwanted semantics in the text embeddings. Experimental results demonstrate that DECOR outperforms state-of-the-art customization models and achieves Pareto frontier performance across text and visual alignment evaluation metrics. Furthermore, it generates images more faithful to the input prompts, showcasing its effectiveness in addressing overfitting and enhancing text-to-image customization.

Title: ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks

Authors: Jiashu Yao, Heyan Huang, Zeming Liu, Haoyu Wen, Wei Su, Boao Qian, Yuhang Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09173
Pdf URL: https://arxiv.org/pdf/2412.09173
Copy Paste: [[2412.09173]] ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks(https://arxiv.org/abs/2412.09173)
Keywords: interpretability, large language model, segmentation
Abstract: Following formatting instructions to generate well-structured content is a fundamental yet often unmet capability for large language models (LLMs). To study this capability, which we refer to as format faithfulness, we present FormatBench, a comprehensive format-related benchmark. Compared to previous format-related benchmarks, FormatBench involves a greater variety of tasks in terms of application scenes (traditional NLP tasks, creative works, autonomous agency tasks), human-LLM interaction styles (single-turn instruction, multi-turn chat), and format types (inclusion, wrapping, length, coding). Moreover, each task in FormatBench is attached with a format checker program. Extensive experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality. Without any annotated data, ReFF can substantially improve the format faithfulness rate (e.g., from 21.6% in original LLaMA3 to 95.0% on caption segmentation task), while keep the general quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with labeled training data, ReFF can simultaneously improve both format faithfulness (e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from 47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to explain how ReFF improves both format faithfulness and general quality.

Title: On the effectiveness of Rotation-Equivariance in U-Net: A Benchmark for Image Segmentation

Authors: Robin Ghyselinck, Valentin Delchevalerie, Bruno Dumas, Benoît Frénay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09182
Pdf URL: https://arxiv.org/pdf/2412.09182
Copy Paste: [[2412.09182]] On the effectiveness of Rotation-Equivariance in U-Net: A Benchmark for Image Segmentation(https://arxiv.org/abs/2412.09182)
Keywords: segmentation
Abstract: Numerous studies have recently focused on incorporating different variations of equivariance in Convolutional Neural Networks (CNNs). In particular, rotation-equivariance has gathered significant attention due to its relevance in many applications related to medical imaging, microscopic imaging, satellite imaging, industrial tasks, etc. While prior research has primarily focused on enhancing classification tasks with rotation equivariant CNNs, their impact on more complex architectures, such as U-Net for image segmentation, remains scarcely explored. Indeed, previous work interested in integrating rotation-equivariance into U-Net architecture have focused on solving specific applications with a limited scope. In contrast, this paper aims to provide a more exhaustive evaluation of rotation equivariant U-Net for image segmentation across a broader range of tasks. We benchmark their effectiveness against standard U-Net architectures, assessing improvements in terms of performance and sustainability (i.e., computational cost). Our evaluation focuses on datasets whose orientation of objects of interest is arbitrary in the image (e.g., Kvasir-SEG), but also on more standard segmentation datasets (such as COCO-Stuff) as to explore the wider applicability of rotation equivariance beyond tasks undoubtedly concerned by rotation equivariance. The main contribution of this work is to provide insights into the trade-offs and advantages of integrating rotation equivariance for segmentation tasks.

Title: RAD: Region-Aware Diffusion Models for Image Inpainting

Authors: Sora Kim, Sungho Suh, Minsik Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09191
Pdf URL: https://arxiv.org/pdf/2412.09191
Copy Paste: [[2412.09191]] RAD: Region-Aware Diffusion Models for Image Inpainting(https://arxiv.org/abs/2412.09191)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in image generation, with applications broadening across various domains. Inpainting is one such application that can benefit significantly from diffusion models. Existing methods either hijack the reverse process of a pretrained diffusion model or cast the problem into a larger framework, \ie, conditioned generation. However, these approaches often require nested loops in the generation process or additional components for conditioning. In this paper, we present region-aware diffusion models (RAD) for inpainting with a simple yet effective reformulation of the vanilla diffusion models. RAD utilizes a different noise schedule for each pixel, which allows local regions to be generated asynchronously while considering the global image context. A plain reverse process requires no additional components, enabling RAD to achieve inference time up to 100 times faster than the state-of-the-art approaches. Moreover, we employ low-rank adaptation (LoRA) to fine-tune RAD based on other pretrained diffusion models, reducing computational burdens in training as well. Experiments demonstrated that RAD provides state-of-the-art results both qualitatively and quantitatively, on the FFHQ, LSUN Bedroom, and ImageNet datasets.

Title: ExpRDiff: Short-exposure Guided Diffusion Model for Realistic Local Motion Deblurring

Authors: Zhongbao Yang, Jiangxin Dong, Jinhui Tang, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09193
Pdf URL: https://arxiv.org/pdf/2412.09193
Copy Paste: [[2412.09193]] ExpRDiff: Short-exposure Guided Diffusion Model for Realistic Local Motion Deblurring(https://arxiv.org/abs/2412.09193)
Keywords: diffusion
Abstract: Removing blur caused by moving objects is challenging, as the moving objects are usually significantly blurry while the static background remains clear. Existing methods that rely on local blur detection often suffer from inaccuracies and cannot generate satisfactory results when focusing solely on blurred regions. To overcome these problems, we first design a context-based local blur detection module that incorporates additional contextual information to improve the identification of blurry regions. Considering that modern smartphones are equipped with cameras capable of providing short-exposure images, we develop a blur-aware guided image restoration method that utilizes sharp structural details from short-exposure images, facilitating accurate reconstruction of heavily blurred regions. Furthermore, to restore images realistically and visually-pleasant, we develop a short-exposure guided diffusion model that explores useful features from short-exposure images and blurred regions to better constrain the diffusion process. Finally, we formulate the above components into a simple yet effective network, named ExpRDiff. Experimental results show that ExpRDiff performs favorably against state-of-the-art methods.

Title: MVC-VPR: Mutual Learning of Viewpoint Classification and Visual Place Recognition

Authors: Qiwen Gu, Xufei Wang, Fenglin Zhang, Junqiao Zhao, Siyue Tao, Chen Ye, Tiantian Feng, Changjun Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09199
Pdf URL: https://arxiv.org/pdf/2412.09199
Copy Paste: [[2412.09199]] MVC-VPR: Mutual Learning of Viewpoint Classification and Visual Place Recognition(https://arxiv.org/abs/2412.09199)
Keywords: robust
Abstract: Visual Place Recognition (VPR) aims to robustly identify locations by leveraging image retrieval based on descriptors encoded from environmental images. However, drastic appearance changes of images captured from different viewpoints at the same location pose incoherent supervision signals for descriptor learning, which severely hinder the performance of VPR. Previous work proposes classifying images based on manually defined rules or ground truth labels for viewpoints, followed by descriptor training based on the classification results. However, not all datasets have ground truth labels of viewpoints and manually defined rules may be suboptimal, leading to degraded descriptor this http URL address these challenges, we introduce the mutual learning of viewpoint self-classification and VPR. Starting from coarse classification based on geographical coordinates, we progress to finer classification of viewpoints using simple clustering techniques. The dataset is partitioned in an unsupervised manner while simultaneously training a descriptor extractor for place recognition. Experimental results show that this approach almost perfectly partitions the dataset based on viewpoints, thus achieving mutually reinforcing effects. Our method even excels state-of-the-art (SOTA) methods that partition datasets using ground truth labels.

Title: CleanComedy: Creating Friendly Humor through Generative Techniques

Authors: Dmitry Vikhorev, Daria Galimzianova, Svetlana Gorovaia, Elizaveta Zhemchuzhina, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09203
Pdf URL: https://arxiv.org/pdf/2412.09203
Copy Paste: [[2412.09203]] CleanComedy: Creating Friendly Humor through Generative Techniques(https://arxiv.org/abs/2412.09203)
Keywords: robust, generative
Abstract: Humor generation is a challenging task in natural language processing due to limited resources and the quality of existing datasets. Available humor language resources often suffer from toxicity and duplication, limiting their effectiveness for training robust models. This paper proposes CleanComedy, a specialized, partially annotated toxicity-filtered corpus of English and Russian jokes collected from various sources. We study the effectiveness of our data filtering approach through a survey on humor and toxicity levels in various joke groups. In addition, we study advances in computer humor generation by comparing jokes written by humans with various groups of generative jokes, including our baseline models trained on the CleanComedy datasets.

Title: Enhancing Implicit Neural Representations via Symmetric Power Transformation

Authors: Weixiang Zhang, Shuzhao Xie, Chengwei Ren, Shijia Ge, Mingzi Wang, Zhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09213
Pdf URL: https://arxiv.org/pdf/2412.09213
Copy Paste: [[2412.09213]] Enhancing Implicit Neural Representations via Symmetric Power Transformation(https://arxiv.org/abs/2412.09213)
Keywords: robust
Abstract: We propose symmetric power transformation to enhance the capacity of Implicit Neural Representation~(INR) from the perspective of data transformation. Unlike prior work utilizing random permutation or index rearrangement, our method features a reversible operation that does not require additional storage consumption. Specifically, we first investigate the characteristics of data that can benefit the training of INR, proposing the Range-Defined Symmetric Hypothesis, which posits that specific range and symmetry can improve the expressive ability of INR. Based on this hypothesis, we propose a nonlinear symmetric power transformation to achieve both range-defined and symmetric properties simultaneously. We use the power coefficient to redistribute data to approximate symmetry within the target range. To improve the robustness of the transformation, we further design deviation-aware calibration and adaptive soft boundary to address issues of extreme deviation boosting and continuity breaking. Extensive experiments are conducted to verify the performance of the proposed method, demonstrating that our transformation can reliably improve INR compared with other data transformations. We also conduct 1D audio, 2D image and 3D video fitting tasks to demonstrate the effectiveness and applicability of our method.

Title: USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation

Authors: Wanjiang Weng, Hongsong Wang, Junbo He, Lei He, Guosen Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09220
Pdf URL: https://arxiv.org/pdf/2412.09220
Copy Paste: [[2412.09220]] USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation(https://arxiv.org/abs/2412.09220)
Keywords: extraction
Abstract: Contrastive learning has achieved great success in skeleton-based representation learning recently. However, the prevailing methods are predominantly negative-based, necessitating additional momentum encoder and memory bank to get negative samples, which increases the difficulty of model training. Furthermore, these methods primarily concentrate on learning a global representation for recognition and retrieval tasks, while overlooking the rich and detailed local representations that are crucial for dense prediction tasks. To alleviate these issues, we introduce a Unified Skeleton-based Dense Representation Learning framework based on feature decorrelation, called USDRL, which employs feature decorrelation across temporal, spatial, and instance domains in a multi-grained manner to reduce redundancy among dimensions of the representations to maximize information extraction from features. Additionally, we design a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained action representations effectively, thereby enhancing the performance of dense prediction tasks. Comprehensive experiments, conducted on the benchmarks NTU-60, NTU-120, PKU-MMD I, and PKU-MMD II, across diverse downstream tasks including action recognition, action retrieval, and action detection, conclusively demonstrate that our approach significantly outperforms the current state-of-the-art (SOTA) approaches. Our code and models are available at this https URL.

Title: Building a Privacy Web with SPIDEr -- Secure Pipeline for Information De-Identification with End-to-End Encryption

Authors: Novoneel Chakraborty, Anshoo Tandon, Kailash Reddy, Kaushal Kirpekar, Bryan Paul Robert, Hari Dilip Kumar, Abhilash Venkatesh, Abhay Sharma
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2412.09222
Pdf URL: https://arxiv.org/pdf/2412.09222
Copy Paste: [[2412.09222]] Building a Privacy Web with SPIDEr -- Secure Pipeline for Information De-Identification with End-to-End Encryption(https://arxiv.org/abs/2412.09222)
Keywords: secure, privacy
Abstract: Data de-identification makes it possible to glean insights from data while preserving user privacy. The use of Trusted Execution Environments (TEEs) allow for the execution of de-identification applications on the cloud without the need for a user to trust the third-party application provider. In this paper, we present \textit{SPIDEr - Secure Pipeline for Information De-Identification with End-to-End Encryption}, our implementation of an end-to-end encrypted data de-identification pipeline. SPIDEr supports classical anonymisation techniques such as suppression, pseudonymisation, generalisation, and aggregation, as well as techniques that offer a formal privacy guarantee such as k-anonymisation and differential privacy. To enable scalability and improve performance on constrained TEE hardware, we enable batch processing of data for differential privacy computations. We present our design of the control flows for end-to-end secure execution of de-identification operations within a TEE. As part of the control flow for running SPIDEr within the TEE, we perform attestation, a process that verifies that the software binaries were properly instantiated on a known, trusted platform.

Title: DASK: Distribution Rehearsing via Adaptive Style Kernel Learning for Exemplar-Free Lifelong Person Re-Identification

Authors: Kunlun Xu, Chenghao Jiang, Peixi Xiong, Yuxin Peng, Jiahuan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09224
Pdf URL: https://arxiv.org/pdf/2412.09224
Copy Paste: [[2412.09224]] DASK: Distribution Rehearsing via Adaptive Style Kernel Learning for Exemplar-Free Lifelong Person Re-Identification(https://arxiv.org/abs/2412.09224)
Keywords: privacy
Abstract: Lifelong person re-identification (LReID) is an important but challenging task that suffers from catastrophic forgetting due to significant domain gaps between training steps. Existing LReID approaches typically rely on data replay and knowledge distillation to mitigate this issue. However, data replay methods compromise data privacy by storing historical exemplars, while knowledge distillation methods suffer from limited performance due to the cumulative forgetting of undistilled knowledge. To overcome these challenges, we propose a novel paradigm that models and rehearses the distribution of the old domains to enhance knowledge consolidation during the new data learning, possessing a strong anti-forgetting capacity without storing any exemplars. Specifically, we introduce an exemplar-free LReID method called Distribution Rehearsing via Adaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser Learning mechanism that learns to transform arbitrary distribution data into the current data style at each learning step. To enhance the style transfer capacity of DRL, an Adaptive Kernel Prediction network is explored to achieve an instance-specific distribution adjustment. Additionally, we design a Distribution Rehearsing-driven LReID Training module, which rehearses old distribution based on the new data via the old AKPNet model, achieving effective new-old knowledge accumulation under a joint knowledge consolidation scheme. Experimental results show our DASK outperforms the existing methods by 3.6%-6.8% and 4.5%-6.5% on anti-forgetting and generalization capacity, respectively. Our code is available at this https URL

Title: Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering

Authors: Sai Bhargav Rongali, Mohamad Hassan N C, Ankit Jha, Neha Bhargava, Saurabh Prasad, Biplab Banerjee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09230
Pdf URL: https://arxiv.org/pdf/2412.09230
Copy Paste: [[2412.09230]] Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering(https://arxiv.org/abs/2412.09230)
Keywords: transformer
Abstract: This paper tackles the intricate challenge of video question-answering (VideoQA). Despite notable progress, current methods fall short of effectively integrating questions with video frames and semantic object-level abstractions to create question-aware video representations. We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better and emphasize semantic visual concepts relevant to specific questions. LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions. It captures the dynamics of objects within these frames using distinct graphs, grounding them in question semantics with the miniGPT model. These graphs are processed by a question-aware dynamic graph transformer (Q-DGT), which refines the outputs to develop nuanced global and local video representations. An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers. Extensive evaluations across multiple benchmarks demonstrate that LGQAVE significantly outperforms existing models in delivering accurate multi-choice and open-ended answers.

Title: Uplift modeling with continuous treatments: A predict-then-optimize approach

Authors: Simon De Vos, Christopher Bockel-Rickermann, Stefan Lessmann, Wouter Verbeke
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09232
Pdf URL: https://arxiv.org/pdf/2412.09232
Copy Paste: [[2412.09232]] Uplift modeling with continuous treatments: A predict-then-optimize approach(https://arxiv.org/abs/2412.09232)
Keywords: fair
Abstract: The goal of uplift modeling is to recommend actions that optimize specific outcomes by determining which entities should receive treatment. One common approach involves two steps: first, an inference step that estimates conditional average treatment effects (CATEs), and second, an optimization step that ranks entities based on their CATE values and assigns treatment to the top k within a given budget. While uplift modeling typically focuses on binary treatments, many real-world applications are characterized by continuous-valued treatments, i.e., a treatment dose. This paper presents a predict-then-optimize framework to allow for continuous treatments in uplift modeling. First, in the inference step, conditional average dose responses (CADRs) are estimated from data using causal machine learning techniques. Second, in the optimization step, we frame the assignment task of continuous treatments as a dose-allocation problem and solve it using integer linear programming (ILP). This approach allows decision-makers to efficiently and effectively allocate treatment doses while balancing resource availability, with the possibility of adding extra constraints like fairness considerations or adapting the objective function to take into account instance-dependent costs and benefits to maximize utility. The experiments compare several CADR estimators and illustrate the trade-offs between policy value and fairness, as well as the impact of an adapted objective function. This showcases the framework's advantages and flexibility across diverse applications in healthcare, lending, and human resource management. All code is available on this http URL.

Title: VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation

Authors: Roberto Alcover-Couso, Marcos Escudero-Viñolo, Juan C. SanMiguel, Jesus Bescos
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09240
Pdf URL: https://arxiv.org/pdf/2412.09240
Copy Paste: [[2412.09240]] VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation(https://arxiv.org/abs/2412.09240)
Keywords: robust, segmentation
Abstract: Segmentation models are typically constrained by the categories defined during training. To address this, researchers have explored two independent approaches: adapting Vision-Language Models (VLMs) and leveraging synthetic data. However, VLMs often struggle with granularity, failing to disentangle fine-grained concepts, while synthetic data-based methods remain limited by the scope of available datasets. This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA). First, we improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS) framework. Next, we incorporate these enhancements into a UDA framework by employing distillation to stabilize training and cross-domain mixed sampling to boost adaptability without compromising generalization. The resulting UDA-FROVSS framework is the first UDA approach to effectively adapt across domains without requiring shared categories.

Title: Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative LLMs

Authors: Asli Umay Ozturk, Recep Firat Cekinel, Asli Umay Ozturk
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09247
Pdf URL: https://arxiv.org/pdf/2412.09247
Copy Paste: [[2412.09247]] Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative LLMs(https://arxiv.org/abs/2412.09247)
Keywords: robust, explainability, generative, large language model
Abstract: Satire detection is essential for accurately extracting opinions from textual data and combating misinformation online. However, the lack of diverse corpora for satire leads to the problem of stylistic bias which impacts the models' detection performances. This study proposes a debiasing approach for satire detection, focusing on reducing biases in training data by utilizing generative large language models. The approach is evaluated in both cross-domain (irony detection) and cross-lingual (English) settings. Results show that the debiasing method enhances the robustness and generalizability of the models for satire and irony detection tasks in Turkish and English. However, its impact on causal language models, such as Llama-3.1, is limited. Additionally, this work curates and presents the Turkish Satirical News Dataset with detailed human annotations, with case studies on classification, debiasing, and explainability.

Title: GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

Authors: Abdessalam Ed-dib, Zhanibek Datbayev, Amine Mohamed Aboussalah
Subjects: cs.LG, math.GT, stat.ML
Abstract URL: https://arxiv.org/abs/2412.09250
Pdf URL: https://arxiv.org/pdf/2412.09250
Copy Paste: [[2412.09250]] GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning(https://arxiv.org/abs/2412.09250)
Keywords: large language model
Abstract: Fine-tuning large language models (LLMs) is computationally intensive because it requires updating all parameters. Low-Rank Adaptation (LoRA) improves efficiency by modifying only a subset of weights but introduces a trade-off between expressivity and computational cost: lower ranks reduce resources but limit expressiveness, while higher ranks enhance expressivity at increased cost. Despite recent advances in adaptive LoRA techniques, existing methods fail to provide a theoretical basis for optimizing the trade-off between model performance and efficiency. We propose Geometric Low-Rank Adaptation (GeLoRA), a novel framework that computes the intrinsic dimensionality of hidden state representations to adaptively select LoRA ranks. We demonstrate that the intrinsic dimension provides a lower bound for the optimal rank of LoRA matrices, allowing for a principled selection that balances efficiency and expressivity. GeLoRA dynamically adjusts the rank for each layer based on the intrinsic dimensionality of its input and output representations, recognizing that not all model parameters equally impact fine-tuning. Empirical validation on multiple tasks shows that GeLoRA consistently outperforms recent baselines within the same parameter budget.

Title: When Can Memorization Improve Fairness?

Authors: Bob Pepin, Christian Igel, Raghavendra Selvan
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2412.09254
Pdf URL: https://arxiv.org/pdf/2412.09254
Copy Paste: [[2412.09254]] When Can Memorization Improve Fairness?(https://arxiv.org/abs/2412.09254)
Keywords: fair
Abstract: We study to which extent additive fairness metrics (statistical parity, equal opportunity and equalized odds) can be influenced in a multi-class classification problem by memorizing a subset of the population. We give explicit expressions for the bias resulting from memorization in terms of the label and group membership distribution of the memorized dataset and the classifier bias on the unmemorized dataset. We also characterize the memorized datasets that eliminate the bias for all three metrics considered. Finally we provide upper and lower bounds on the total probability mass in the memorized dataset that is necessary for the complete elimination of these biases.

Title: FD2-Net: Frequency-Driven Feature Decomposition Network for Infrared-Visible Object Detection

Authors: Ke Li, Di Wang, Zhangyuan Hu, Shaofeng Li, Weiping Ni, Lin Zhao, Quan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09258
Pdf URL: https://arxiv.org/pdf/2412.09258
Copy Paste: [[2412.09258]] FD2-Net: Frequency-Driven Feature Decomposition Network for Infrared-Visible Object Detection(https://arxiv.org/abs/2412.09258)
Keywords: extraction
Abstract: Infrared-visible object detection (IVOD) seeks to harness the complementary information in infrared and visible images, thereby enhancing the performance of detectors in complex environments. However, existing methods often neglect the frequency characteristics of complementary information, such as the abundant high-frequency details in visible images and the valuable low-frequency thermal information in infrared images, thus constraining detection performance. To solve this problem, we introduce a novel Frequency-Driven Feature Decomposition Network for IVOD, called FD2-Net, which effectively captures the unique frequency representations of complementary information across multimodal visual spaces. Specifically, we propose a feature decomposition encoder, wherein the high-frequency unit (HFU) utilizes discrete cosine transform to capture representative high-frequency features, while the low-frequency unit (LFU) employs dynamic receptive fields to model the multi-scale context of diverse objects. Next, we adopt a parameter-free complementary strengths strategy to enhance multimodal features through seamless inter-frequency recoupling. Furthermore, we innovatively design a multimodal reconstruction mechanism that recovers image details lost during feature extraction, further leveraging the complementary information from infrared and visible images to enhance overall representational capacity. Extensive experiments demonstrate that FD2-Net outperforms state-of-the-art (SOTA) models across various IVOD benchmarks, i.e. LLVIP (96.2% mAP), FLIR (82.9% mAP), and M3FD (83.5% mAP).

Title: Multi-client Functional Encryption for Set Intersection with Non-monotonic Access Structures in Federated Learning

Authors: Ruyuan Zhang, Jinguang Han
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.09259
Pdf URL: https://arxiv.org/pdf/2412.09259
Copy Paste: [[2412.09259]] Multi-client Functional Encryption for Set Intersection with Non-monotonic Access Structures in Federated Learning(https://arxiv.org/abs/2412.09259)
Keywords: security, attack, federate
Abstract: Federated learning (FL) based on cloud servers is a distributed machine learning framework that involves an aggregator and multiple clients, which allows multiple clients to collaborate in training a shared model without exchanging data. Considering the confidentiality of training data, several schemes employing functional encryption (FE) have been presented. However, existing schemes cannot express complex access control policies. In this paper, to realize more flexible and fine-grained access control, we propose a multi-client functional encryption scheme for set intersection with non-monotonic access structures (MCFE-SI-NAS), where multiple clients co-exist and encrypt independently without interaction. All ciphertexts are associated with an label, which can resist "mix-and-match" attacks. Aggregator can aggregate ciphertexts, but cannot know anything about the plaintexts. We first formalize the definition and security model for the MCFE-SI-NAS scheme and build a concrete construction based on asymmetric prime-order pairings. The security of our scheme is formally proven. Finally, we implement our MCFE-SI-NAS scheme and provide its efficiency analysis.

Title: LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

Authors: Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, Weiwei Xing
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09262
Pdf URL: https://arxiv.org/pdf/2412.09262
Copy Paste: [[2412.09262]] LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync(https://arxiv.org/abs/2412.09262)
Keywords: diffusion
Abstract: We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Furthermore, we observe the commonly encountered SyncNet convergence issue and conduct comprehensive empirical studies, identifying key factors affecting SyncNet convergence in terms of model architecture, training hyperparameters, and data preprocessing methods. We significantly improve the accuracy of SyncNet from 91% to 94% on the HDTF test set. Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet. Based on the above innovations, our method outperforms state-of-the-art lip sync methods across various metrics on the HDTF and VoxCeleb2 datasets.

Title: Towards Understanding the Robustness of LLM-based Evaluations under Perturbations

Authors: Manav Chaudhary, Harshit Gupta, Savita Bhat, Vasudeva Varma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09269
Pdf URL: https://arxiv.org/pdf/2412.09269
Copy Paste: [[2412.09269]] Towards Understanding the Robustness of LLM-based Evaluations under Perturbations(https://arxiv.org/abs/2412.09269)
Keywords: robust, large language model
Abstract: Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.

Title: Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Authors: Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, Yehui Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09278
Pdf URL: https://arxiv.org/pdf/2412.09278
Copy Paste: [[2412.09278]] Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine(https://arxiv.org/abs/2412.09278)
Keywords: large language model
Abstract: In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at this https URL.

Title: Learning to Solve Domain-Specific Calculation Problems with Knowledge-Intensive Programs Generator

Authors: Chengyuan Liu, Shihang Wang, Lizhi Qing, Jun Lin, Ji Zhang, Fei Wu, Kun Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09280
Pdf URL: https://arxiv.org/pdf/2412.09280
Copy Paste: [[2412.09280]] Learning to Solve Domain-Specific Calculation Problems with Knowledge-Intensive Programs Generator(https://arxiv.org/abs/2412.09280)
Keywords: large language model
Abstract: Domain Large Language Models (LLMs) are developed for domain-specific tasks based on general LLMs. But it still requires professional knowledge to facilitate the expertise for some domain-specific tasks. In this paper, we investigate into knowledge-intensive calculation problems. We find that the math problems to be challenging for LLMs, when involving complex domain-specific rules and knowledge documents, rather than simple formulations of terminologies. Therefore, we propose a pipeline to solve the domain-specific calculation problems with Knowledge-Intensive Programs Generator more effectively, named as KIPG. It generates knowledge-intensive programs according to the domain-specific documents. For each query, key variables are extracted, then outcomes which are dependent on domain knowledge are calculated with the programs. By iterative preference alignment, the code generator learns to improve the logic consistency with the domain knowledge. Taking legal domain as an example, we have conducted experiments to prove the effectiveness of our pipeline, and extensive analysis on the modules. We also find that the code generator is also adaptable to other domains, without training on the new knowledge.

Title: CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs

Authors: Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09282
Pdf URL: https://arxiv.org/pdf/2412.09282
Copy Paste: [[2412.09282]] CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs(https://arxiv.org/abs/2412.09282)
Keywords: large language model
Abstract: Powerful large language models (LLMs) are increasingly expected to be deployed with lower computational costs, enabling their capabilities on resource-constrained devices. Post-training quantization (PTQ) has emerged as a star approach to achieve this ambition, with best methods compressing weights to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector Quantization (CRVQ), a novel technique that significantly improves the performance of PTQ baselines at the cost of only minimal additional bits. This state-of-the-art extreme compression method achieves its results through two key innovations: (1) carefully selecting and reordering a very small subset of critical weight channels, and (2) leveraging multiple codebooks to relax the constraint of critical channels. With our method, we demonstrate a 38.9% improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer lossless 1-bit compression. Furthermore, our approach offers flexible customization of quantization bit-width and performance, providing a wider range of deployment options for diverse hardware platforms.

Title: Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices

Authors: Thanaphon Suwannaphong, Ferdian Jovan, Ian Craddock, Ryan McConville
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2412.09289
Pdf URL: https://arxiv.org/pdf/2412.09289
Copy Paste: [[2412.09289]] Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices(https://arxiv.org/abs/2412.09289)
Keywords: privacy, transformer
Abstract: This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices, specifically for on-device indoor localisation. Typical approaches for indoor localisation rely on centralised remote processing of data transmitted from lower powered devices such as wearables. However, there are several benefits for moving this to the edge device itself, including increased battery life, enhanced privacy, reduced latency and lowered operational costs, all of which are key for common applications such as health monitoring. The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size while maintaining high predictive performance. We base our work on a large state-of-the-art transformer-based model and seek to deploy it within low-power MCUs. We also propose a state-space-based architecture using Mamba as a more compact alternative to the transformer. Our results show that the quantized transformer model performs well within a 64 KB RAM constraint, achieving an effective balance between model size and localisation precision. Additionally, the compact Mamba model has strong performance under even tighter constraints, such as a 32 KB of RAM, without the need for model compression, making it a viable option for more resource-limited environments. We demonstrate that, through our framework, it is feasible to deploy advanced indoor localisation models onto low-power MCUs with restricted memory limitations. The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring by providing accurate, real-time location data while minimizing power consumption, increasing data privacy, improving latency and reducing infrastructure costs.

Title: Transfer Learning of RSSI to Improve Indoor Localisation Performance

Authors: Thanaphon Suwannaphong, Ryan McConville, Ian Craddock
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09292
Pdf URL: https://arxiv.org/pdf/2412.09292
Copy Paste: [[2412.09292]] Transfer Learning of RSSI to Improve Indoor Localisation Performance(https://arxiv.org/abs/2412.09292)
Keywords: robust, generative
Abstract: With the growing demand for health monitoring systems, in-home localisation is essential for tracking patient conditions. The unique spatial characteristics of each house required annotated data for Bluetooth Low Energy (BLE) Received Signal Strength Indicator (RSSI)-based monitoring system. However, collecting annotated training data is time-consuming, particularly for patients with limited health conditions. To address this, we propose Conditional Generative Adversarial Networks (ConGAN)-based augmentation, combined with our transfer learning framework (T-ConGAN), to enable the transfer of generic RSSI information between different homes, even when data is collected using different experimental protocols. This enhances the performance and scalability of such intelligent systems by reducing the need for annotation in each home. We are the first to demonstrate that BLE RSSI data can be shared across different homes, and that shared information can improve the indoor localisation performance. Our T-ConGAN enhances the macro F1 score of room-level indoor localisation by up to 12.2%, with a remarkable 51% improvement in challenging areas such as stairways or outside spaces. This state-of-the-art RSSI augmentation model significantly enhances the robustness of in-home health monitoring systems.

Title: GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression

Authors: Ziqi Zhou, Weize Quan, Hailin Shi, Wei Li, Lili Wang, Dong-ming Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09296
Pdf URL: https://arxiv.org/pdf/2412.09296
Copy Paste: [[2412.09296]] GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression(https://arxiv.org/abs/2412.09296)
Keywords: robust, diffusion
Abstract: Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.

Title: Advancing Attribution-Based Neural Network Explainability through Relative Absolute Magnitude Layer-Wise Relevance Propagation and Multi-Component Evaluation

Authors: Davor Vukadin, Petar Afrić, Marin Šilić, Goran Delač
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09311
Pdf URL: https://arxiv.org/pdf/2412.09311
Copy Paste: [[2412.09311]] Advancing Attribution-Based Neural Network Explainability through Relative Absolute Magnitude Layer-Wise Relevance Propagation and Multi-Component Evaluation(https://arxiv.org/abs/2412.09311)
Keywords: robust, explainability, transformer
Abstract: Recent advancement in deep-neural network performance led to the development of new state-of-the-art approaches in numerous areas. However, the black-box nature of neural networks often prohibits their use in areas where model explainability and model transparency are crucial. Over the years, researchers proposed many algorithms to aid neural network understanding and provide additional information to the human expert. One of the most popular methods being Layer-Wise Relevance Propagation (LRP). This method assigns local relevance based on the pixel-wise decomposition of nonlinear classifiers. With the rise of attribution method research, there has emerged a pressing need to assess and evaluate their performance. Numerous metrics have been proposed, each assessing an individual property of attribution methods such as faithfulness, robustness or localization. Unfortunately, no single metric is deemed optimal for every case, and researchers often use several metrics to test the quality of the attribution maps. In this work, we address the shortcomings of the current LRP formulations and introduce a novel method for determining the relevance of input neurons through layer-wise relevance propagation. Furthermore, we apply this approach to the recently developed Vision Transformer architecture and evaluate its performance against existing methods on two image classification datasets, namely ImageNet and PascalVOC. Our results clearly demonstrate the advantage of our proposed method. Furthermore, we discuss the insufficiencies of current evaluation metrics for attribution-based explainability and propose a new evaluation metric that combines the notions of faithfulness, robustness and contrastiveness. We utilize this new metric to evaluate the performance of various attribution-based methods. Our code is available at: this https URL

Title: FAMNet: Frequency-aware Matching Network for Cross-domain Few-shot Medical Image Segmentation

Authors: Yuntian Bo, Yazhou Zhu, Lunbo Li, Haofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09319
Pdf URL: https://arxiv.org/pdf/2412.09319
Copy Paste: [[2412.09319]] FAMNet: Frequency-aware Matching Network for Cross-domain Few-shot Medical Image Segmentation(https://arxiv.org/abs/2412.09319)
Keywords: segmentation
Abstract: Existing few-shot medical image segmentation (FSMIS) models fail to address a practical issue in medical imaging: the domain shift caused by different imaging techniques, which limits the applicability to current FSMIS tasks. To overcome this limitation, we focus on the cross-domain few-shot medical image segmentation (CD-FSMIS) task, aiming to develop a generalized model capable of adapting to a broader range of medical image segmentation scenarios with limited labeled data from the novel target domain. Inspired by the characteristics of frequency domain similarity across different domains, we propose a Frequency-aware Matching Network (FAMNet), which includes two key components: a Frequency-aware Matching (FAM) module and a Multi-Spectral Fusion (MSF) module. The FAM module tackles two problems during the meta-learning phase: 1) intra-domain variance caused by the inherent support-query bias, due to the different appearances of organs and lesions, and 2) inter-domain variance caused by different medical imaging techniques. Additionally, we design an MSF module to integrate the different frequency features decoupled by the FAM module, and further mitigate the impact of inter-domain variance on the model's segmentation performance. Combining these two modules, our FAMNet surpasses existing FSMIS models and Cross-domain Few-shot Semantic Segmentation models on three cross-domain datasets, achieving state-of-the-art performance in the CD-FSMIS task.

Title: Are Conditional Latent Diffusion Models Effective for Image Restoration?

Authors: Yunchen Yuan, Junyuan Xiao, Xinjie Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09324
Pdf URL: https://arxiv.org/pdf/2412.09324
Copy Paste: [[2412.09324]] Are Conditional Latent Diffusion Models Effective for Image Restoration?(https://arxiv.org/abs/2412.09324)
Keywords: diffusion
Abstract: Recent advancements in image restoration increasingly employ conditional latent diffusion models (CLDMs). While these models have demonstrated notable performance improvements in recent years, this work questions their suitability for IR tasks. CLDMs excel in capturing high-level semantic correlations, making them effective for tasks like text-to-image generation with spatial conditioning. However, in IR, where the goal is to enhance image perceptual quality, these models face difficulty of modeling the relationship between degraded images and ground truth images using a low-level representation. To support our claims, we compare state-of-the-art CLDMs with traditional image restoration models through extensive experiments. Results reveal that despite the scaling advantages of CLDMs, they suffer from high distortion and semantic deviation, especially in cases with minimal degradation, where traditional methods outperform them. Additionally, we perform empirical studies to examine the impact of various CLDM design elements on their restoration performance. We hope this finding inspires a reexamination of current CLDM-based IR solutions, opening up more opportunities in this field.

Title: Auto-Regressive Moving Diffusion Models for Time Series Forecasting

Authors: Jiaxin Gao, Qinglong Cao, Yuntian Chen
Subjects: cs.LG, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2412.09328
Pdf URL: https://arxiv.org/pdf/2412.09328
Copy Paste: [[2412.09328]] Auto-Regressive Moving Diffusion Models for Time Series Forecasting(https://arxiv.org/abs/2412.09328)
Keywords: diffusion
Abstract: Time series forecasting (TSF) is essential in various domains, and recent advancements in diffusion-based TSF models have shown considerable promise. However, these models typically adopt traditional diffusion patterns, treating TSF as a noise-based conditional generation task. This approach neglects the inherent continuous sequential nature of time series, leading to a fundamental misalignment between diffusion mechanisms and the TSF objective, thereby severely impairing performance. To bridge this misalignment, and inspired by the classic Auto-Regressive Moving Average (ARMA) theory, which views time series as continuous sequential progressions evolving from previous data points, we propose a novel Auto-Regressive Moving Diffusion (ARMD) model to first achieve the continuous sequential diffusion-based TSF. Unlike previous methods that start from white Gaussian noise, our model employs chain-based diffusion with priors, accurately modeling the evolution of time series and leveraging intermediate state information to improve forecasting accuracy and stability. Specifically, our approach reinterprets the diffusion process by considering future series as the initial state and historical series as the final state, with intermediate series generated using a sliding-based technique during the forward process. This design aligns the diffusion model's sampling procedure with the forecasting objective, resulting in an unconditional, continuous sequential diffusion TSF model. Extensive experiments conducted on seven widely used datasets demonstrate that our model achieves state-of-the-art performance, significantly outperforming existing diffusion-based TSF models. Our code is available on GitHub: this https URL.

Title: MaskTerial: A Foundation Model for Automated 2D Material Flake Detection

Authors: Jan-Lucas Uslu, Alexey Nekrasov, Alexander Hermans, Bernd Beschoten, Bastian Leibe, Lutz Waldecker, Christoph Stampfer
Subjects: cs.CV, cond-mat.mtrl-sci, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09333
Pdf URL: https://arxiv.org/pdf/2412.09333
Copy Paste: [[2412.09333]] MaskTerial: A Foundation Model for Automated 2D Material Flake Detection(https://arxiv.org/abs/2412.09333)
Keywords: segmentation
Abstract: The detection and classification of exfoliated two-dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large-scale data collection. Existing algorithms often exhibit challenges in identifying low-contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre-trained using a synthetic data generator, that generates realistic microscopy images from unlabeled data. This results in a model that can to quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low-contrast materials such as hexagonal boron nitride.

Title: Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain

Authors: Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09341
Pdf URL: https://arxiv.org/pdf/2412.09341
Copy Paste: [[2412.09341]] Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain(https://arxiv.org/abs/2412.09341)
Keywords: privacy
Abstract: Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called Payslips. Moreover, we show that we can achieve competitive results using a smaller and faster model.

Title: DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

Authors: Hongxiang Li, Yaowei Li, Yuhang Yang, Junjie Cao, Zhihong Zhu, Xuxin Cheng, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09349
Pdf URL: https://arxiv.org/pdf/2412.09349
Copy Paste: [[2412.09349]] DisPose: Disentangling Pose Guidance for Controllable Human Image Animation(https://arxiv.org/abs/2412.09349)
Keywords: diffusion
Abstract: Controllable human image animation aims to generate videos from reference images using driving videos. Due to the limited control signals provided by sparse guidance (e.g., skeleton pose), recent works have attempted to introduce additional dense conditions (e.g., depth map) to ensure motion alignment. However, such strict dense guidance impairs the quality of the generated video when the body shape of the reference character differs significantly from that of the driving video. In this paper, we present DisPose to mine more generalizable and effective control signals without additional dense input, which disentangles the sparse skeleton pose in human image animation into motion field guidance and keypoint correspondence. Specifically, we generate a dense motion field from a sparse motion field and the reference image, which provides region-level dense guidance while maintaining the generalization of the sparse pose control. We also extract diffusion features corresponding to pose keypoints from the reference image, and then these point features are transferred to the target pose to provide distinct identity information. To seamlessly integrate into existing models, we propose a plug-and-play hybrid ControlNet that improves the quality and consistency of generated videos while freezing the existing model parameters. Extensive qualitative and quantitative experiments demonstrate the superiority of DisPose compared to current methods. Code: \hyperlink{this https URL}{this https URL}.

Title: Causal Graphical Models for Vision-Language Compositional Understanding

Authors: Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.09353
Pdf URL: https://arxiv.org/pdf/2412.09353
Copy Paste: [[2412.09353]] Causal Graphical Models for Vision-Language Compositional Understanding(https://arxiv.org/abs/2412.09353)
Keywords: generative
Abstract: Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets.

Title: Word Sense Linking: Disambiguating Outside the Sandbox

Authors: Andrei Stefan Bejgu, Edoardo Barba, Luigi Procopio, Alberte Fernández-Castro, Roberto Navigli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09370
Pdf URL: https://arxiv.org/pdf/2412.09370
Copy Paste: [[2412.09370]] Word Sense Linking: Disambiguating Outside the Sandbox(https://arxiv.org/abs/2412.09370)
Keywords: transformer
Abstract: Word Sense Disambiguation (WSD) is the task of associating a word in a given context with its most suitable meaning among a set of possible candidates. While the task has recently witnessed renewed interest, with systems achieving performances above the estimated inter-annotator agreement, at the time of writing it still struggles to find downstream applications. We argue that one of the reasons behind this is the difficulty of applying WSD to plain text. Indeed, in the standard formulation, models work under the assumptions that a) all the spans to disambiguate have already been identified, and b) all the possible candidate senses of each span are provided, both of which are requirements that are far from trivial. In this work, we present a new task called Word Sense Linking (WSL) where, given an input text and a reference sense inventory, systems have to both identify which spans to disambiguate and then link them to their most suitable this http URL put forward a transformer-based architecture for the task and thoroughly evaluate both its performance and those of state-of-the-art WSD systems scaled to WSL, iteratively relaxing the assumptions of WSD. We hope that our work will foster easier integration of lexical semantics into downstream applications.

Title: A comprehensive interpretable machine learning framework for Mild Cognitive Impairment and Alzheimer's disease diagnosis

Authors: Maria Eleftheria Vlontzou, Maria Athanasiou, Kalliopi Dalakleidi, Ioanna Skampardoni, Christos Davatzikos, Konstantina Nikita
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09376
Pdf URL: https://arxiv.org/pdf/2412.09376
Copy Paste: [[2412.09376]] A comprehensive interpretable machine learning framework for Mild Cognitive Impairment and Alzheimer's disease diagnosis(https://arxiv.org/abs/2412.09376)
Keywords: robust, interpretability
Abstract: An interpretable machine learning (ML) framework is introduced to enhance the diagnosis of Mild Cognitive Impairment (MCI) and Alzheimer's disease (AD) by ensuring robustness of the ML models' interpretations. The dataset used comprises volumetric measurements from brain MRI and genetic data from healthy individuals and patients with MCI/AD, obtained through the Alzheimer's Disease Neuroimaging Initiative. The existing class imbalance is addressed by an ensemble learning approach, while various attribution-based and counterfactual-based interpretability methods are leveraged towards producing diverse explanations related to the pathophysiology of MCI/AD. A unification method combining SHAP with counterfactual explanations assesses the interpretability techniques' robustness. The best performing model yielded 87.5% balanced accuracy and 90.8% F1-score. The attribution-based interpretability methods highlighted significant volumetric and genetic features related to MCI/AD risk. The unification method provided useful insights regarding those features' necessity and sufficiency, further showcasing their significance in MCI/AD diagnosis.

Title: Diffusion Model with Representation Alignment for Protein Inverse Folding

Authors: Chenglin Wang, Yucheng Zhou, Zijie Zhai, Jianbing Shen, Kai Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09380
Pdf URL: https://arxiv.org/pdf/2412.09380
Copy Paste: [[2412.09380]] Diffusion Model with Representation Alignment for Protein Inverse Folding(https://arxiv.org/abs/2412.09380)
Keywords: diffusion
Abstract: Protein inverse folding is a fundamental problem in bioinformatics, aiming to recover the amino acid sequences from a given protein backbone structure. Despite the success of existing methods, they struggle to fully capture the intricate inter-residue relationships critical for accurate sequence prediction. We propose a novel method that leverages diffusion models with representation alignment (DMRA), which enhances diffusion-based inverse folding by (1) proposing a shared center that aggregates contextual information from the entire protein structure and selectively distributes it to each residue; and (2) aligning noisy hidden representations with clean semantic representations during the denoising process. This is achieved by predefined semantic representations for amino acid types and a representation alignment method that utilizes type embeddings as semantic feedback to normalize each residue. In experiments, we conduct extensive evaluations on the CATH4.2 dataset to demonstrate that DMRA outperforms leading methods, achieving state-of-the-art performance and exhibiting strong generalization capabilities on the TS50 and TS500 datasets.

Title: UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer

Authors: Delong Liu, Zhaohui Hou, Mingjie Zhan, Shihao Han, Zhicheng Zhao, Fei Su
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09389
Pdf URL: https://arxiv.org/pdf/2412.09389
Copy Paste: [[2412.09389]] UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer(https://arxiv.org/abs/2412.09389)
Keywords: diffusion
Abstract: Recently, diffusion-based video generation models have achieved significant success. However, existing models often suffer from issues like weak consistency and declining image quality over time. To overcome these challenges, inspired by aesthetic principles, we propose a non-invasive plug-in called Uniform Frame Organizer (UFO), which is compatible with any diffusion-based video generation model. The UFO comprises a series of adaptive adapters with adjustable intensities, which can significantly enhance the consistency between the foreground and background of videos and improve image quality without altering the original model parameters when integrated. The training for UFO is simple, efficient, requires minimal resources, and supports stylized training. Its modular design allows for the combination of multiple UFOs, enabling the customization of personalized video generation models. Furthermore, the UFO also supports direct transferability across different models of the same specification without the need for specific retraining. The experimental results indicate that UFO effectively enhances video generation quality and demonstrates its superiority in public video generation benchmarks. The code will be publicly available at this https URL.

Title: MultiEYE: Dataset and Benchmark for OCT-Enhanced Retinal Disease Recognition from Fundus Images

Authors: Lehan Wang, Chongchong Qi, Chubin Ou, Lin An, Mei Jin, Xiangbin Kong, Xiaomeng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09402
Pdf URL: https://arxiv.org/pdf/2412.09402
Copy Paste: [[2412.09402]] MultiEYE: Dataset and Benchmark for OCT-Enhanced Retinal Disease Recognition from Fundus Images(https://arxiv.org/abs/2412.09402)
Keywords: interpretability
Abstract: Existing multi-modal learning methods on fundus and OCT images mostly require both modalities to be available and strictly paired for training and testing, which appears less practical in clinical scenarios. To expand the scope of clinical applications, we formulate a novel setting, "OCT-enhanced disease recognition from fundus images", that allows for the use of unpaired multi-modal data during the training phase and relies on the widespread fundus photographs for testing. To benchmark this setting, we present the first large multi-modal multi-class dataset for eye disease diagnosis, MultiEYE, and propose an OCT-assisted Conceptual Distillation Approach (OCT-CoDA), which employs semantically rich concepts to extract disease-related knowledge from OCT images and leverage them into the fundus model. Specifically, we regard the image-concept relation as a link to distill useful knowledge from the OCT teacher model to the fundus student model, which considerably improves the diagnostic performance based on fundus images and formulates the cross-modal knowledge transfer into an explainable process. Through extensive experiments on the multi-disease classification task, our proposed OCT-CoDA demonstrates remarkable results and interpretability, showing great potential for clinical application. Our dataset and code are available at this https URL.

Title: Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

Authors: Kaushal Kumar Maurya, KV Aditya Srivatsa, Kseniia Petukhova, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09416
Pdf URL: https://arxiv.org/pdf/2412.09416
Copy Paste: [[2412.09416]] Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors(https://arxiv.org/abs/2412.09416)
Keywords: large language model
Abstract: In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusion in the mathematical domain. We release MRBench -- a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 LLM as an evaluator and analyze each tutor's pedagogical abilities, highlighting which LLMs are good tutors and which ones are more suitable as question-answering systems. We believe that the presented taxonomy, benchmark, and human-annotated labels will streamline the evaluation process and help track the progress in AI tutors' development.

Title: Towards Robust and Fair Vision Learning in Open-World Environments

Authors: Thanh-Dat Truong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09439
Pdf URL: https://arxiv.org/pdf/2412.09439
Copy Paste: [[2412.09439]] Towards Robust and Fair Vision Learning in Open-World Environments(https://arxiv.org/abs/2412.09439)
Keywords: robust, fair, transformer
Abstract: The dissertation presents four key contributions toward fairness and robustness in vision learning. First, to address the problem of large-scale data requirements, the dissertation presents a novel Fairness Domain Adaptation approach derived from two major novel research findings of Bijective Maximum Likelihood and Fairness Adaptation Learning. Second, to enable the capability of open-world modeling of vision learning, this dissertation presents a novel Open-world Fairness Continual Learning Framework. The success of this research direction is the result of two research lines, i.e., Fairness Continual Learning and Open-world Continual Learning. Third, since visual data are often captured from multiple camera views, robust vision learning methods should be capable of modeling invariant features across views. To achieve this desired goal, the research in this thesis will present a novel Geometry-based Cross-view Adaptation framework to learn robust feature representations across views. Finally, with the recent increase in large-scale videos and multimodal data, understanding the feature representations and improving the robustness of large-scale visual foundation models is critical. Therefore, this thesis will present novel Transformer-based approaches to improve the robust feature representations against multimodal and temporal data. Then, a novel Domain Generalization Approach will be presented to improve the robustness of visual foundation models. The research's theoretical analysis and experimental results have shown the effectiveness of the proposed approaches, demonstrating their superior performance compared to prior studies. The contributions in this dissertation have advanced the fairness and robustness of machine vision learning.

Title: MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental Learning

Authors: Hai-Long Sun, Da-Wei Zhou, Hanbin Zhao, Le Gan, De-Chuan Zhan, Han-Jia Ye
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09441
Pdf URL: https://arxiv.org/pdf/2412.09441
Copy Paste: [[2412.09441]] MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental Learning(https://arxiv.org/abs/2412.09441)
Keywords: robust
Abstract: Class-Incremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting old ones. Despite Pre-trained Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting still occurs as the model learns new concepts. Existing work seeks to utilize lightweight components to adjust the PTM, while the forgetting phenomenon still comes from {\em parameter and retrieval} levels. Specifically, iterative updates of the model result in parameter drift, while mistakenly retrieving irrelevant modules leads to the mismatch during inference. To this end, we propose MOdel Surgery (MOS) to rescue the model from forgetting previous knowledge. By training task-specific adapters, we continually adjust the PTM to downstream tasks. To mitigate parameter-level forgetting, we present an adapter merging approach to learn task-specific adapters, which aims to bridge the gap between different components while reserve task-specific information. Besides, to address retrieval-level forgetting, we introduce a training-free self-refined adapter retrieval mechanism during inference, which leverages the model's inherent ability for better adapter retrieval. By jointly rectifying the model with those steps, MOS can robustly resist catastrophic forgetting in the learning process. Extensive experiments on seven benchmark datasets validate MOS's state-of-the-art performance. Code is available at: this https URL

Title: ATPrompt: Textual Prompt Learning with Embedded Attributes

Authors: Zheng Li, Yibing Song, Penghai Zhao, Ming-Ming Cheng, Xiang Li, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09442
Pdf URL: https://arxiv.org/pdf/2412.09442
Copy Paste: [[2412.09442]] ATPrompt: Textual Prompt Learning with Embedded Attributes(https://arxiv.org/abs/2412.09442)
Keywords: large language model
Abstract: Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text prompt inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. To finalize the attributes for downstream tasks, we propose a differentiable attribute search method that learns to identify representative and suitable attributes from a candidate pool summarized by a large language model. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format of textual-based methods, offering general improvements at a negligible computational cost. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.

Title: A Semi Black-Box Adversarial Bit-Flip Attack with Limited DNN Model Information

Authors: Behnam Ghavami, Mani Sadati, Mohammad Shahidzadeh, Lesley Shannon, Steve Wilton
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.09450
Pdf URL: https://arxiv.org/pdf/2412.09450
Copy Paste: [[2412.09450]] A Semi Black-Box Adversarial Bit-Flip Attack with Limited DNN Model Information(https://arxiv.org/abs/2412.09450)
Keywords: attack
Abstract: Despite the rising prevalence of deep neural networks (DNNs) in cyber-physical systems, their vulnerability to adversarial bit-flip attacks (BFAs) is a noteworthy concern. This paper proposes B3FA, a semi-black-box BFA-based parameter attack on DNNs, assuming the adversary has limited knowledge about the model. We consider practical scenarios often feature a more restricted threat model for real-world systems, contrasting with the typical BFA models that presuppose the adversary's full access to a network's inputs and parameters. The introduced bit-flip approach utilizes a magnitude-based ranking method and a statistical re-construction technique to identify the vulnerable bits. We demonstrate the effectiveness of B3FA on several DNN models in a semi-black-box setting. For example, B3FA could drop the accuracy of a MobileNetV2 from 69.84% to 9% with only 20 bit-flips in a real-world setting.

Title: The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Authors: Javier de la Rosa, Vladislav Mikhailov, Lemei Zhang, Freddy Wetjen, David Samuel, Peng Liu, Rolv-Arild Braaten, Petter Mæhlum, Magnus Breder Birkenes, Andrey Kutuzov, Tita Enstad, Svein Arne Brygfjeld, Jon Atle Gulla, Stephan Oepen, Erik Velldal, Wilfred Østgulen, Liljia Øvrelid, Aslak Sira Myhre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09460
Pdf URL: https://arxiv.org/pdf/2412.09460
Copy Paste: [[2412.09460]] The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective(https://arxiv.org/abs/2412.09460)
Keywords: generative, large language model
Abstract: The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

Title: OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs

Authors: Yuanzhi Zhu, Ruiqing Wang, Shilin Lu, Junnan Li, Hanshu Yan, Kai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09465
Pdf URL: https://arxiv.org/pdf/2412.09465
Copy Paste: [[2412.09465]] OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs(https://arxiv.org/abs/2412.09465)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion and flow-based generative models have demonstrated remarkable success in image restoration tasks, achieving superior perceptual quality compared to traditional deep learning approaches. However, these methods either require numerous sampling steps to generate high-quality images, resulting in significant computational overhead, or rely on model distillation, which usually imposes a fixed fidelity-realism trade-off and thus lacks flexibility. In this paper, we introduce OFTSR, a novel flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism. Our approach first trains a conditional flow-based super-resolution model to serve as a teacher model. We then distill this teacher model by applying a specialized constraint. Specifically, we force the predictions from our one-step student model for same input to lie on the same sampling ODE trajectory of the teacher model. This alignment ensures that the student model's single-step predictions from initial states match the teacher's predictions from a closer intermediate state. Through extensive experiments on challenging datasets including FFHQ (256$\times$256), DIV2K, and ImageNet (256$\times$256), we demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off. Code and pre-trained models are available at this https URL and this https URL, respectively.

Title: STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized Variational Autoencoders for Financial Trading

Authors: Yilei Zhao, Wentao Zhang, Tingran Yang, Yong Jiang, Fei Huang, Wei Yang Bryan Lim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09468
Pdf URL: https://arxiv.org/pdf/2412.09468
Copy Paste: [[2412.09468]] STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized Variational Autoencoders for Financial Trading(https://arxiv.org/abs/2412.09468)
Keywords: robust
Abstract: In financial trading, factor models are widely used to price assets and capture excess returns from mispricing. Recently, we have witnessed the rise of variational autoencoder-based latent factor models, which learn latent factors self-adaptively. While these models focus on modeling overall market conditions, they often fail to effectively capture the temporal patterns of individual stocks. Additionally, representing multiple factors as single values simplifies the model but limits its ability to capture complex relationships and dependencies. As a result, the learned factors are of low quality and lack diversity, reducing their effectiveness and robustness across different trading periods. To address these issues, we propose a Spatio-Temporal factOR Model based on dual vector quantized variational autoencoders, named STORM, which extracts features of stocks from temporal and spatial perspectives, then fuses and aligns these features at the fine-grained and semantic level, and represents the factors as multi-dimensional embeddings. The discrete codebooks cluster similar factor embeddings, ensuring orthogonality and diversity, which helps distinguish between different factors and enables factor selection in financial trading. To show the performance of the proposed factor model, we apply it to two downstream experiments: portfolio management on two stock datasets and individual trading tasks on six specific stocks. The extensive experiments demonstrate STORM's flexibility in adapting to downstream tasks and superior performance over baseline models.

Title: A Novel Ensemble-Based Deep Learning Model with Explainable AI for Accurate Kidney Disease Diagnosis

Authors: Md. Arifuzzaman, Iftekhar Ahmed, Md. Jalal Uddin Chowdhury, Shadman Sakib, Mohammad Shoaib Rahman, Md. Ebrahim Hossain, Shakib Absar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09472
Pdf URL: https://arxiv.org/pdf/2412.09472
Copy Paste: [[2412.09472]] A Novel Ensemble-Based Deep Learning Model with Explainable AI for Accurate Kidney Disease Diagnosis(https://arxiv.org/abs/2412.09472)
Keywords: transformer
Abstract: Chronic Kidney Disease (CKD) represents a significant global health challenge, characterized by the progressive decline in renal function, leading to the accumulation of waste products and disruptions in fluid balance within the body. Given its pervasive impact on public health, there is a pressing need for effective diagnostic tools to enable timely intervention. Our study delves into the application of cutting-edge transfer learning models for the early detection of CKD. Leveraging a comprehensive and publicly available dataset, we meticulously evaluate the performance of several state-of-the-art models, including EfficientNetV2, InceptionNetV2, MobileNetV2, and the Vision Transformer (ViT) technique. Remarkably, our analysis demonstrates superior accuracy rates, surpassing the 90% threshold with MobileNetV2 and achieving 91.5% accuracy with ViT. Moreover, to enhance predictive capabilities further, we integrate these individual methodologies through ensemble modeling, resulting in our ensemble model exhibiting a remarkable 96% accuracy in the early detection of CKD. This significant advancement holds immense promise for improving clinical outcomes and underscores the critical role of machine learning in addressing complex medical challenges.

Title: Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Authors: Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.09501
Pdf URL: https://arxiv.org/pdf/2412.09501
Copy Paste: [[2412.09501]] Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition(https://arxiv.org/abs/2412.09501)
Keywords: robust, large language model
Abstract: As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.

Title: Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction

Authors: Edvard Ghukasyan, Hrant Khachatrian, Rafayel Mkrtchyan, Theofanis P. Raptis
Subjects: cs.CV, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2412.09507
Pdf URL: https://arxiv.org/pdf/2412.09507
Copy Paste: [[2412.09507]] Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction(https://arxiv.org/abs/2412.09507)
Keywords: transformer
Abstract: Vision Transformers (ViTs) have demonstrated remarkable success in achieving state-of-the-art performance across various image-based tasks and beyond. In this study, we employ a ViT-based neural network to address the problem of indoor pathloss radio map prediction. The network's generalization ability is evaluated across diverse settings, including unseen buildings, frequencies, and antennas with varying radiation patterns. By leveraging extensive data augmentation techniques and pretrained DINOv2 weights, we achieve promising results, even under the most challenging scenarios.

Title: GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency

Authors: Dongyue Lu, Lingdong Kong, Tianxin Huang, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09511
Pdf URL: https://arxiv.org/pdf/2412.09511
Copy Paste: [[2412.09511]] GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency(https://arxiv.org/abs/2412.09511)
Keywords: robust
Abstract: Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction. However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption. We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models. We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds. A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models. To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions. Code and corruption datasets have been made publicly available.

Title: Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Clinical Pathology Analysis

Authors: Shengxuming Zhang, Weihan Li, Tianhong Gao, Jiacong Hu, Haoming Luo, Mingli Song, Xiuming Zhang, Zunlei Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09521
Pdf URL: https://arxiv.org/pdf/2412.09521
Copy Paste: [[2412.09521]] Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Clinical Pathology Analysis(https://arxiv.org/abs/2412.09521)
Keywords: extraction
Abstract: Pathological diagnosis is vital for determining disease characteristics, guiding treatment, and assessing prognosis, relying heavily on detailed, multi-scale analysis of high-resolution whole slide images (WSI). However, traditional pure vision models face challenges of redundant feature extraction, whereas existing large vision-language models (LVLMs) are limited by input resolution constraints, hindering their efficiency and accuracy. To overcome these issues, we propose two innovative strategies: the mixed task-guided feature enhancement, which directs feature extraction toward lesion-related details across scales, and the prompt-guided detail feature completion, which integrates coarse- and fine-grained features from WSI based on specific prompts without compromising inference speed. Leveraging a comprehensive dataset of 490,000 samples from diverse pathology tasks-including cancer detection, grading, vascular and neural invasion identification, and so on-we trained the pathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that this model significantly outperforms existing methods in diagnostic accuracy and efficiency, offering an interactive, clinically aligned approach for auxiliary diagnosis in a wide range of pathology applications.

Title: Can Modern LLMs Act as Agent Cores in Radiology~Environments?

Authors: Qiaoyu Zheng, Chaoyi Wu, Pengcheng Qiu, Lisong Dai, Ya Zhang, Yanfeng Wang, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09529
Pdf URL: https://arxiv.org/pdf/2412.09529
Copy Paste: [[2412.09529]] Can Modern LLMs Act as Agent Cores in Radiology~Environments?(https://arxiv.org/abs/2412.09529)
Keywords: interpretability, large language model
Abstract: Advancements in large language models (LLMs) have paved the way for LLM-based agent systems that offer enhanced accuracy and interpretability across various domains. Radiology, with its complex analytical requirements, is an ideal field for the application of these agents. This paper aims to investigate the pre-requisite question for building concrete radiology agents which is, `Can modern LLMs act as agent cores in radiology environments?' To investigate it, we introduce RadABench with three-fold contributions: First, we present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents, generated from an extensive taxonomy encompassing 6 anatomies, 5 imaging modalities, 10 tool categories, and 11 radiology tasks. Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow and the capability to simulate a wide range of radiology toolsets. Third, we assess the performance of 7 leading LLMs on our benchmark from 5 perspectives with multiple metrics. Our findings indicate that while current LLMs demonstrate strong capabilities in many areas, they are still not sufficiently advanced to serve as the central agent core in a fully operational radiology agent system. Additionally, we identify key factors influencing the performance of LLM-based agent cores, offering insights for clinicians on how to apply agent systems in real-world radiology practices effectively. All of our code and data are open-sourced in this https URL.

Title: Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Authors: Paria Rashidinejad, Yuandong Tian
Subjects: cs.LG, cs.AI, math.OC, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2412.09544
Pdf URL: https://arxiv.org/pdf/2412.09544
Copy Paste: [[2412.09544]] Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking(https://arxiv.org/abs/2412.09544)
Keywords: robust
Abstract: Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of preference optimization and develop a novel technique that dynamically updates preference labels toward certain "stationary labels", resulting in diminishing gradients for untrustworthy samples. Empirically, POWER with dynamic labels (POWER-DL) consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and 11.5 points on Arena-Hard over DPO, while also improving or maintaining performance on downstream tasks such as mathematical reasoning. Strong theoretical guarantees and empirical results demonstrate the promise of POWER-DL in mitigating reward hacking.

Title: SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing

Authors: Xueting Li, Ye Yuan, Shalini De Mello, Gilles Daviet, Jonathan Leaf, Miles Macklin, Jan Kautz, Umar Iqbal
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09545
Pdf URL: https://arxiv.org/pdf/2412.09545
Copy Paste: [[2412.09545]] SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing(https://arxiv.org/abs/2412.09545)
Keywords: diffusion, generative
Abstract: We introduce SimAvatar, a framework designed to generate simulation-ready clothed 3D human avatars from a text prompt. Current text-driven human avatar generation methods either model hair, clothing, and the human body using a unified geometry or produce hair and garments that are not easily adaptable for simulation within existing simulation pipelines. The primary challenge lies in representing the hair and garment geometry in a way that allows leveraging established prior knowledge from foundational image diffusion models (e.g., Stable Diffusion) while being simulation-ready using either physics or neural simulators. To address this task, we propose a two-stage framework that combines the flexibility of 3D Gaussians with simulation-ready hair strands and garment meshes. Specifically, we first employ three text-conditioned 3D generative models to generate garment mesh, body shape and hair strands from the given text prompt. To leverage prior knowledge from foundational diffusion models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair strands and learn the avatar appearance through optimization. To drive the avatar given a pose sequence, we first apply physics simulators onto the garment meshes and hair strands. We then transfer the motion onto 3D Gaussians through carefully designed mechanisms for each body part. As a result, our synthesized avatars have vivid texture and realistic dynamic motion. To the best of our knowledge, our method is the first to produce highly realistic, fully simulation-ready 3D avatars, surpassing the capabilities of current approaches.

Title: Exemplar Masking for Multimodal Incremental Learning

Authors: Yi-Lun Lee, Chen-Yu Lee, Wei-Chen Chiu, Yi-Hsuan Tsai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09549
Pdf URL: https://arxiv.org/pdf/2412.09549
Copy Paste: [[2412.09549]] Exemplar Masking for Multimodal Incremental Learning(https://arxiv.org/abs/2412.09549)
Keywords: robust, large language model
Abstract: Multimodal incremental learning needs to digest the information from multiple modalities while concurrently learning new knowledge without forgetting the previously learned information. There are numerous challenges for this task, mainly including the larger storage size of multimodal data in exemplar-based methods and the computational requirement of finetuning on huge multimodal models. In this paper, we leverage the parameter-efficient tuning scheme to reduce the burden of fine-tuning and propose the exemplar masking framework to efficiently replay old knowledge. Specifically, the non-important tokens are masked based on the attention weights and the correlation across different modalities, significantly reducing the storage size of an exemplar and consequently saving more exemplars under the same memory buffer. Moreover, we design a multimodal data augmentation technique to diversify exemplars for replaying prior knowledge. In experiments, we not only evaluate our method in existing multimodal datasets but also extend the ImageNet-R dataset to a multimodal dataset as a real-world application, where captions are generated by querying multimodal large language models (e.g., InstructBLIP). Extensive experiments show that our exemplar masking framework is more efficient and robust to catastrophic forgetting under the same limited memory buffer. Code is available at this https URL.

Title: Video Creation by Demonstration

Authors: Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09551
Pdf URL: https://arxiv.org/pdf/2412.09551
Copy Paste: [[2412.09551]] Video Creation by Demonstration(https://arxiv.org/abs/2412.09551)
Keywords: diffusion
Abstract: We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present $\delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, $\delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at this https URL.

Title: Does Representation Matter? Exploring Intermediate Layers in Large Language Models

Authors: Oscar Skean, Md Rifat Arefin, Yann LeCun, Ravid Shwartz-Ziv
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09563
Pdf URL: https://arxiv.org/pdf/2412.09563
Copy Paste: [[2412.09563]] Does Representation Matter? Exploring Intermediate Layers in Large Language Models(https://arxiv.org/abs/2412.09563)
Keywords: transformer, large language model
Abstract: Understanding what defines a good representation in large language models (LLMs) is fundamental to both theoretical understanding and practical applications. In this paper, we investigate the quality of intermediate representations in various LLM architectures, including Transformers and State Space Models (SSMs). We find that intermediate layers often yield more informative representations for downstream tasks than the final layers. To measure the representation quality, we adapt and apply a suite of metrics - such as prompt entropy, curvature, and augmentation-invariance - originally proposed in other contexts. Our empirical study reveals significant architectural differences, how representations evolve throughout training, and how factors like input randomness and prompt length affect each layer. Notably, we observe a bimodal pattern in the entropy of some intermediate layers and consider potential explanations tied to training data. Overall, our results illuminate the internal mechanics of LLMs and guide strategies for architectural optimization and training.

Title: Obfuscated Activations Bypass LLM Latent-Space Defenses

Authors: Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, Scott Emmons
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09565
Pdf URL: https://arxiv.org/pdf/2412.09565
Copy Paste: [[2412.09565]] Obfuscated Activations Bypass LLM Latent-Space Defenses(https://arxiv.org/abs/2412.09565)
Keywords: defense, attack
Abstract: Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.

Title: JuStRank: Benchmarking LLM Judges for System Ranking

Authors: Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09569
Pdf URL: https://arxiv.org/pdf/2412.09569
Copy Paste: [[2412.09569]] JuStRank: Benchmarking LLM Judges for System Ranking(https://arxiv.org/abs/2412.09569)
Keywords: generative
Abstract: Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

Title: DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction

Authors: Yu Feng, Phu Mon Htut, Zheng Qi, Wei Xiao, Manuel Mager, Nikolaos Pappas, Kishaloy Halder, Yang Li, Yassine Benajiba, Dan Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.09572
Pdf URL: https://arxiv.org/pdf/2412.09572
Copy Paste: [[2412.09572]] DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction(https://arxiv.org/abs/2412.09572)
Keywords: large language model
Abstract: Quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs), especially in a black-box setting, poses a significant challenge. Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. Models might respond consistently to the origin query with a wrong answer, yet respond correctly to varied questions from different perspectives about the same query, and vice versa. In this paper, we propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction under the assumption that if a model is certain, it should consistently recall the answer to the original query across a diverse collection of questions about the same original query. We further implement an abstention policy to withhold responses when uncertainty is high. Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods. Additionally, it demonstrates that existing models often fail to consistently retrieve the correct answer to the same query under diverse varied questions even when knowing the correct answer.

Title: FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

Authors: Jiale Xu, Shenghua Gao, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09573
Pdf URL: https://arxiv.org/pdf/2412.09573
Copy Paste: [[2412.09573]] FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction(https://arxiv.org/abs/2412.09573)
Keywords: transformer
Abstract: Existing sparse-view reconstruction models heavily rely on accurate known camera poses. However, deriving camera extrinsics and intrinsics from sparse-view images presents significant challenges. In this work, we present FreeSplatter, a highly scalable, feed-forward reconstruction framework capable of generating high-quality 3D Gaussians from uncalibrated sparse-view images and recovering their camera parameters in mere seconds. FreeSplatter is built upon a streamlined transformer architecture, comprising sequential self-attention blocks that facilitate information exchange among multi-view image tokens and decode them into pixel-wise 3D Gaussian primitives. The predicted Gaussian primitives are situated in a unified reference frame, allowing for high-fidelity 3D modeling and instant camera parameter estimation using off-the-shelf solvers. To cater to both object-centric and scene-level reconstruction, we train two model variants of FreeSplatter on extensive datasets. In both scenarios, FreeSplatter outperforms state-of-the-art baselines in terms of reconstruction quality and pose estimation accuracy. Furthermore, we showcase FreeSplatter's potential in enhancing the productivity of downstream applications, such as text/image-to-3D content creation.

Title: Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Authors: Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09586
Pdf URL: https://arxiv.org/pdf/2412.09586
Copy Paste: [[2412.09586]] Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders(https://arxiv.org/abs/2412.09586)
Keywords: transformer
Abstract: We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: this http URL .

Title: Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

Authors: Zexin He, Tengfei Wang, Xin Huang, Xingang Pan, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09593
Pdf URL: https://arxiv.org/pdf/2412.09593
Copy Paste: [[2412.09593]] Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion(https://arxiv.org/abs/2412.09593)
Keywords: diffusion
Abstract: Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs. This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions. 2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials. Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects. Code and dataset are available on our project page at this https URL.

Title: InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Authors: Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, Qipeng Guo, Haodong Duan, Xin Chen, Han Lv, Zheng Nie, Min Zhang, Bin Wang, Wenwei Zhang, Xinyue Zhang, Jiaye Ge, Wei Li, Jingwen Li, Zhongying Tu, Conghui He, Xingcheng Zhang, Kai Chen, Yu Qiao, Dahua Lin, Jiaqi Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09596
Pdf URL: https://arxiv.org/pdf/2412.09596
Copy Paste: [[2412.09596]] InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions(https://arxiv.org/abs/2412.09596)
Keywords: large language model
Abstract: Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.

Title: LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

Authors: Yabo Chen, Chen Yang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, Wei Shen, Wenrui Dai, Hongkai Xiong, Qi Tian
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.09597
Pdf URL: https://arxiv.org/pdf/2412.09597
Copy Paste: [[2412.09597]] LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors(https://arxiv.org/abs/2412.09597)
Keywords: robust, diffusion, generative
Abstract: Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.

Title: Do Multimodal Large Language Models See Like Humans?

Authors: Jiaying Lin, Shuquan Ye, Rynson W.H. Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09603
Pdf URL: https://arxiv.org/pdf/2412.09603
Copy Paste: [[2412.09603]] Do Multimodal Large Language Models See Like Humans?(https://arxiv.org/abs/2412.09603)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans? Current benchmarks lack the ability to evaluate MLLMs from this perspective. To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision. HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs. We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.

Title: SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Authors: Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09604
Pdf URL: https://arxiv.org/pdf/2412.09604
Copy Paste: [[2412.09604]] SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding(https://arxiv.org/abs/2412.09604)
Keywords: large language model
Abstract: The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.

Title: Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

Authors: Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, Yuliang Xiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09606
Pdf URL: https://arxiv.org/pdf/2412.09606
Copy Paste: [[2412.09606]] Feat2GS: Probing Visual Foundation Models with Gaussian Splatting(https://arxiv.org/abs/2412.09606)
Keywords: fair
Abstract: Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture via novel view synthesis, without requiring 3D data. Additionally, the disentanglement of 3DGS parameters - geometry ($\boldsymbol{x}, \alpha, \Sigma$) and texture ($\boldsymbol{c}$) - enables separate analysis of texture and geometry awareness. Under Feat2GS, we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS useful for probing VFMs, and as a simple-yet-effective baseline for novel-view synthesis. Code and data will be made available at this https URL.

Title: Spectral Image Tokenizer

Authors: Carlos Esteves, Mohammed Suhail, Ameesh Makadia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09607
Pdf URL: https://arxiv.org/pdf/2412.09607
Copy Paste: [[2412.09607]] Spectral Image Tokenizer(https://arxiv.org/abs/2412.09607)
Keywords: transformer
Abstract: Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial locations in the input image, arranged in raster scan order, which is not ideal for autoregressive modeling. In this paper, we propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT), such that the sequence of tokens represents the image in a coarse-to-fine fashion. Our tokenizer brings several advantages: 1) it leverages that natural images are more compressible at high frequencies, 2) it can take and reconstruct images of different resolutions without retraining, 3) it improves the conditioning for next-token prediction -- instead of conditioning on a partial line-by-line reconstruction of the image, it takes a coarse reconstruction of the full image, 4) it enables partial decoding where the first few generated tokens can reconstruct a coarse version of the image, 5) it enables autoregressive models to be used for image upsampling. We evaluate the tokenizer reconstruction metrics as well as multiscale image generation, text-guided image upsampling and editing.

Title: FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Authors: Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09611
Pdf URL: https://arxiv.org/pdf/2412.09611
Copy Paste: [[2412.09611]] FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers(https://arxiv.org/abs/2412.09611)
Keywords: transformer
Abstract: Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.

Title: Olympus: A Universal Task Router for Computer Vision Tasks

Authors: Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip H. S. Torr
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09612
Pdf URL: https://arxiv.org/pdf/2412.09612
Copy Paste: [[2412.09612]] Olympus: A Universal Task Router for Computer Vision Tasks(https://arxiv.org/abs/2412.09612)
Keywords: generative, large language model
Abstract: We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: this https URL

Title: Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

Authors: Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09614
Pdf URL: https://arxiv.org/pdf/2412.09614
Copy Paste: [[2412.09614]] Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG(https://arxiv.org/abs/2412.09614)
Keywords: diffusion
Abstract: We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. This capability significantly improves upon the limitations of existing T2I models, which often struggle with the accurate depiction of complex or culturally specific subjects due to dataset constraints. Furthermore, we propose a novel self-correcting mechanism for text-to-image models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. Our qualitative and quantitative experiments demonstrate that Context Canvas significantly enhances the capabilities of popular models such as Flux, Stable Diffusion, and DALL-E, and improves the functionality of ControlNet for fine-grained image editing tasks. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images.

Title: EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Authors: Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09618
Pdf URL: https://arxiv.org/pdf/2412.09618
Copy Paste: [[2412.09618]] EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM(https://arxiv.org/abs/2412.09618)
Keywords: robust, diffusion, large language model
Abstract: Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.

Title: SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Authors: Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S.-H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09619
Pdf URL: https://arxiv.org/pdf/2412.09619
Copy Paste: [[2412.09619]] SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training(https://arxiv.org/abs/2412.09619)
Keywords: diffusion
Abstract: Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

Title: Learning Camera Movement Control from Real-World Drone Videos

Authors: Yunzhong Hou, Liang Zheng, Philip Torr
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.09620
Pdf URL: https://arxiv.org/pdf/2412.09620
Copy Paste: [[2412.09620]] Learning Camera Movement Control from Real-World Drone Videos(https://arxiv.org/abs/2412.09620)
Keywords: transformer
Abstract: This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos. Data and code are available at this http URL.

Title: LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

Authors: Enis Simsar, Thomas Hofmann, Federico Tombari, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09622
Pdf URL: https://arxiv.org/pdf/2412.09622
Copy Paste: [[2412.09622]] LoRACLR: Contrastive Adaptation for Customization of Diffusion Models(https://arxiv.org/abs/2412.09622)
Keywords: diffusion
Abstract: Recent advances in text-to-image customization have enabled high-fidelity, context-rich generation of personalized images, allowing specific concepts to appear in a variety of scenarios. However, current methods struggle with combining multiple personalized models, often leading to attribute entanglement or requiring separate training to preserve concept distinctiveness. We present LoRACLR, a novel approach for multi-concept image generation that merges multiple LoRA models, each fine-tuned for a distinct concept, into a single, unified model without additional individual fine-tuning. LoRACLR uses a contrastive objective to align and merge the weight spaces of these models, ensuring compatibility while minimizing interference. By enforcing distinct yet cohesive representations for each concept, LoRACLR enables efficient, scalable model composition for high-quality, multi-concept image synthesis. Our results highlight the effectiveness of LoRACLR in accurately merging multiple concepts, advancing the capabilities of personalized image generation.

Title: OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation

Authors: Weiqi Li, Shijie Zhao, Chong Mou, Xuhan Sheng, Zhenyu Zhang, Qian Wang, Junlin Li, Li Zhang, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09623
Pdf URL: https://arxiv.org/pdf/2412.09623
Copy Paste: [[2412.09623]] OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation(https://arxiv.org/abs/2412.09623)
Keywords: diffusion
Abstract: As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing. While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs. Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions. To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation. Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion. In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points. We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation. The project page is available at this https URL.

Title: GenEx: Generating an Explorable World

Authors: Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, Jieneng Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.09624
Pdf URL: https://arxiv.org/pdf/2412.09624
Copy Paste: [[2412.09624]] GenEx: Generating an Explorable World(https://arxiv.org/abs/2412.09624)
Keywords: robust, generative
Abstract: Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative imagination of the world, GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation. These agents utilize predictive expectation regarding unseen parts of the physical world to refine their beliefs, simulate different outcomes based on potential decisions, and make more informed choices. In summary, we demonstrate that GenEx provides a transformative platform for advancing embodied AI in imaginative spaces and brings potential for extending these capabilities to real-world exploration.

Title: Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors

Authors: Yue Feng, Vaibhav Sanjay, Spencer Lutz, Badour AlBahar, Songwei Ge, Jia-Bin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09625
Pdf URL: https://arxiv.org/pdf/2412.09625
Copy Paste: [[2412.09625]] Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors(https://arxiv.org/abs/2412.09625)
Keywords: diffusion
Abstract: Automatically generating multiview illusions is a compelling challenge, where a single piece of visual content offers distinct interpretations from different viewing perspectives. Traditional methods, such as shadow art and wire art, create interesting 3D illusions but are limited to simple visual outputs (i.e., figure-ground or line drawing), restricting their artistic expressiveness and practical versatility. Recent diffusion-based illusion generation methods can generate more intricate designs but are confined to 2D images. In this work, we present a simple yet effective approach for creating 3D multiview illusions based on user-provided text prompts or images. Our method leverages a pre-trained text-to-image diffusion model to optimize the textures and geometry of neural 3D representations through differentiable rendering. When viewed from multiple angles, this produces different interpretations. We develop several techniques to improve the quality of the generated 3D multiview illusions. We demonstrate the effectiveness of our approach through extensive experiments and showcase illusion generation with diverse 3D forms.

Title: FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Authors: Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09626
Pdf URL: https://arxiv.org/pdf/2412.09626
Copy Paste: [[2412.09626]] FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion(https://arxiv.org/abs/2412.09626)
Keywords: diffusion
Abstract: Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the generation of 8k-resolution images for the first time.

Title: Doe-1: Closed-Loop Autonomous Driving with Large World Model

Authors: Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09627
Pdf URL: https://arxiv.org/pdf/2412.09627
Copy Paste: [[2412.09627]] Doe-1: Closed-Loop Autonomous Driving with Large World Model(https://arxiv.org/abs/2412.09627)
Keywords: transformer
Abstract: End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: this https URL.