2025-03-19

Title: SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models

Authors: Chuan Qin, Xin Chen, Chengrui Wang, Pengmin Wu, Xi Chen, Yihang Cheng, Jingyi Zhao, Meng Xiao, Xiangchao Dong, Qingqing Long, Boya Pan, Han Wu, Chengzan Li, Yuanchun Zhou, Hui Xiong, Hengshu Zhu
Subjects: cs.LG, cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2503.13503
Pdf URL: https://arxiv.org/pdf/2503.13503
Copy Paste: [[2503.13503]] SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models(https://arxiv.org/abs/2503.13503)
Keywords: fair, explainability, large language model
Abstract: In recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic and evolving field. However, there is still a lack of an effective framework for the overall assessment of AI4Science, particularly from a holistic perspective on data quality and model capability. Therefore, in this study, we propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science from both scientific data and LLM perspectives. First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance which are subdivided into 15 sub-dimensions. Drawing on data resource papers published between 2018 and 2023 in peer-reviewed journals, we present recommendation lists of AI-ready datasets for both Earth and Life Sciences, making a novel and original contribution to the field. Concurrently, to assess the capabilities of LLMs across multiple scientific disciplines, we establish 16 assessment dimensions based on five core indicators Knowledge, Understanding, Reasoning, Multimodality, and Values spanning Mathematics, Physics, Chemistry, Life Sciences, and Earth and Space Sciences. Using the developed benchmark datasets, we have conducted a comprehensive evaluation of over 20 representative open-source and closed source LLMs. All the results are publicly available and can be accessed online at this http URL.

Title: CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception

Authors: Rujia Wang, Xiangbo Gao, Hao Xiang, Runsheng Xu, Zhengzhong Tu
Subjects: cs.LG, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.13504
Pdf URL: https://arxiv.org/pdf/2503.13504
Copy Paste: [[2503.13504]] CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception(https://arxiv.org/abs/2503.13504)
Keywords: transformer
Abstract: Multi-agent collaborative perception enhances each agent perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth, 83 times less than SOTA methods, while improving AP70 by 1.1 percent. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy.

Title: Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

Authors: Mari Ashiga, Wei Jie, Fan Wu, Vardan Voskanyan, Fateme Dinmohammadi, Paul Brookes, Jingzhi Gong, Zheng Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13505
Pdf URL: https://arxiv.org/pdf/2503.13505
Copy Paste: [[2503.13505]] Ensemble Learning for Large Language Models in Text and Code Generation: A Survey(https://arxiv.org/abs/2503.13505)
Keywords: privacy, transformer, generative, large language model
Abstract: Generative pretrained transformers (GPT) are the common large language models (LLMs) used for generating text from natural language inputs. However, the fixed properties of language parameters in individual LLMs can lead to inconsistencies in the generated outputs. This limitation also restricts the models' ability to represent diverse language patterns due to inherent biases. Moreover, many powerful LLMs are closed-source. This prevents organizations from integrating their data into these systems, raising concerns about data privacy and limiting industry applications. Inspired by the successful application of LLM ensemble models in text generation, recent literature has also investigated their potential in code generation. This article reviews these emerging LLM ensemble approaches. Our goal is to enhance readers' understanding of existing techniques and encourage further research and practical implementation, aiming to expand the real-world applications of LLM ensemble models in both text and code generation. We categorize these approaches into seven main methods: weight merging, knowledge fusion, mixture of experts, reward ensemble, output ensemble, routing, and cascading. From this list, we focus on four methods and models that show strong performance and potential for broader applications. We analyze their modeling steps, training methods, and output features to provide a clear understanding of their capabilities. Our findings highlight the benefits of LLM ensemble techniques. These include better representation of diversity, improved output quality, and greater flexibility in applications. This information offers valuable insights for selecting models for various real-world tasks involving text and code generation, and potentially applying methods to multimodal LLMs.

Title: The Role of Hyperparameters in Predictive Multiplicity

Authors: Mustafa Cavus, Katarzyna Woźnica, Przemysław Biecek
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.13506
Pdf URL: https://arxiv.org/pdf/2503.13506
Copy Paste: [[2503.13506]] The Role of Hyperparameters in Predictive Multiplicity(https://arxiv.org/abs/2503.13506)
Keywords: fair
Abstract: This paper investigates the critical role of hyperparameters in predictive multiplicity, where different machine learning models trained on the same dataset yield divergent predictions for identical inputs. These inconsistencies can seriously impact high-stakes decisions such as credit assessments, hiring, and medical diagnoses. Focusing on six widely used models for tabular data - Elastic Net, Decision Tree, k-Nearest Neighbor, Support Vector Machine, Random Forests, and Extreme Gradient Boosting - we explore how hyperparameter tuning influences predictive multiplicity, as expressed by the distribution of prediction discrepancies across benchmark datasets. Key hyperparameters such as lambda in Elastic Net, gamma in Support Vector Machines, and alpha in Extreme Gradient Boosting play a crucial role in shaping predictive multiplicity, often compromising the stability of predictions within specific algorithms. Our experiments on 21 benchmark datasets reveal that tuning these hyperparameters leads to notable performance improvements but also increases prediction discrepancies, with Extreme Gradient Boosting exhibiting the highest discrepancy and substantial prediction instability. This highlights the trade-off between performance optimization and prediction consistency, raising concerns about the risk of arbitrary predictions. These findings provide insight into how hyperparameter optimization leads to predictive multiplicity. While predictive multiplicity allows prioritizing domain-specific objectives such as fairness and reduces reliance on a single model, it also complicates decision-making, potentially leading to arbitrary or unjustified outcomes.

Title: NeurIPS 2023 LLM Efficiency Fine-tuning Competition

Authors: Mark Saroufim, Yotam Perlitz, Leshem Choshen, Luca Antiga, Greg Bowyer, Christian Puhrsch, Driss Guessous, Supriya Rao, Geeta Chauhan, Ashvini Kumar, Jindal Pawan Kumar, Rajpoot Ankur Parikh, Joe Isaacson, Weiwei Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13507
Pdf URL: https://arxiv.org/pdf/2503.13507
Copy Paste: [[2503.13507]] NeurIPS 2023 LLM Efficiency Fine-tuning Competition(https://arxiv.org/abs/2503.13507)
Keywords: robust, generative, large language model
Abstract: Our analysis of the NeurIPS 2023 large language model (LLM) fine-tuning competition revealed the following trend: top-performing models exhibit significant overfitting on benchmark datasets, mirroring the broader issue of benchmark overfitting on popular leaderboards and that data curation is essential in order to get a high performing LLM. The competition, which consisted of two stages - an open evaluation stage with publicly available tasks and a closed evaluation stage with unseen tasks - allowed us to assess the generalizability of fine-tuned LLMs. Our results highlight the limitations of current benchmark-based evaluation schemes for generative models and demonstrate the need for more robust evaluation methods. Notably, the winning submissions utilized standard open-source libraries and focused primarily on data curation. To facilitate further research and promote reproducibility, we release all competition entries, Docker files, and evaluation infrastructure, providing a valuable resource for the community to explore fine-tuning, overfitting, and reproducibility in LLMs.

Title: It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

Authors: Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, Eric Karl Oermann
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.13508
Pdf URL: https://arxiv.org/pdf/2503.13508
Copy Paste: [[2503.13508]] It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education(https://arxiv.org/abs/2503.13508)
Keywords: generative, large language model
Abstract: The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.

Title: MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance

Authors: Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, Li Shen
Subjects: cs.LG, cs.AI, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.13509
Pdf URL: https://arxiv.org/pdf/2503.13509
Copy Paste: [[2503.13509]] MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance(https://arxiv.org/abs/2503.13509)
Keywords: privacy, large language model
Abstract: We introduce MentalChat16K, an English benchmark dataset combining a synthetic mental health counseling dataset and a dataset of anonymized transcripts from interventions between Behavioral Health Coaches and Caregivers of patients in palliative or hospice care. Covering a diverse range of conditions like depression, anxiety, and grief, this curated dataset is designed to facilitate the development and evaluation of large language models for conversational mental health assistance. By providing a high-quality resource tailored to this critical domain, MentalChat16K aims to advance research on empathetic, personalized AI solutions to improve access to mental health support services. The dataset prioritizes patient privacy, ethical considerations, and responsible data usage. MentalChat16K presents a valuable opportunity for the research community to innovate AI technologies that can positively impact mental well-being.

Title: Prompt Sentiment: The Catalyst for LLM Change

Authors: Vishal Gandhi, Sagar Gandhi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13510
Pdf URL: https://arxiv.org/pdf/2503.13510
Copy Paste: [[2503.13510]] Prompt Sentiment: The Catalyst for LLM Change(https://arxiv.org/abs/2503.13510)
Keywords: fair, transformer, large language model
Abstract: The rise of large language models (LLMs) has revolutionized natural language processing (NLP), yet the influence of prompt sentiment, a latent affective characteristic of input text, remains underexplored. This study systematically examines how sentiment variations in prompts affect LLM-generated outputs in terms of coherence, factuality, and bias. Leveraging both lexicon-based and transformer-based sentiment analysis methods, we categorize prompts and evaluate responses from five leading LLMs: Claude, DeepSeek, GPT-4, Gemini, and LLaMA. Our analysis spans six AI-driven applications, including content generation, conversational AI, legal and financial analysis, healthcare AI, creative writing, and technical documentation. By transforming prompts, we assess their impact on output quality. Our findings reveal that prompt sentiment significantly influences model responses, with negative prompts often reducing factual accuracy and amplifying bias, while positive prompts tend to increase verbosity and sentiment propagation. These results highlight the importance of sentiment-aware prompt engineering for ensuring fair and reliable AI-generated content.

Title: RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning through RAG and Incremental Knowledge Graph Learning Integration

Authors: Hong Qing Yu (University of Derby), Frank McQuade (Bloc Digital)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13514
Pdf URL: https://arxiv.org/pdf/2503.13514
Copy Paste: [[2503.13514]] RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning through RAG and Incremental Knowledge Graph Learning Integration(https://arxiv.org/abs/2503.13514)
Keywords: explainability, large language model
Abstract: This paper presents RAG-KG-IL, a novel multi-agent hybrid framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) by integrating Retrieval-Augmented Generation (RAG) and Knowledge Graphs (KGs) with an Incremental Learning (IL) approach. Despite recent advancements, LLMs still face significant challenges in reasoning with structured data, handling dynamic knowledge evolution, and mitigating hallucinations, particularly in mission-critical domains. Our proposed RAG-KG-IL framework addresses these limitations by employing a multi-agent architecture that enables continuous knowledge updates, integrates structured knowledge, and incorporates autonomous agents for enhanced explainability and reasoning. The framework utilizes RAG to ensure the generated responses are grounded in verifiable information, while KGs provide structured domain knowledge for improved consistency and depth of understanding. The Incremental Learning approach allows for dynamic updates to the knowledge base without full retraining, significantly reducing computational overhead and improving the model's adaptability. We evaluate the framework using real-world case studies involving health-related queries, comparing it to state-of-the-art models like GPT-4o and a RAG-only baseline. Experimental results demonstrate that our approach significantly reduces hallucination rates and improves answer completeness and reasoning accuracy. The results underscore the potential of combining RAG, KGs, and multi-agent systems to create intelligent, adaptable systems capable of real-time knowledge integration and reasoning in complex domains.

Title: CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning

Authors: Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, Subhashini Venugopalan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13517
Pdf URL: https://arxiv.org/pdf/2503.13517
Copy Paste: [[2503.13517]] CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning(https://arxiv.org/abs/2503.13517)
Keywords: extraction, large language model
Abstract: Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in this https URL

Title: Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results

Authors: Peter Fettke, Constantin Houy
Subjects: cs.CL, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2503.13520
Pdf URL: https://arxiv.org/pdf/2503.13520
Copy Paste: [[2503.13520]] Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results(https://arxiv.org/abs/2503.13520)
Keywords: large language model
Abstract: Large language models (LLM) have revolutionized the processing of natural language. Although first benchmarks of the process modeling abilities of LLM are promising, it is currently under debate to what extent an LLM can generate good process models. In this contribution, we argue that the evaluation of the process modeling abilities of LLM is far from being trivial. Hence, available evaluation results must be taken carefully. For example, even in a simple scenario, not only the quality of a model should be taken into account, but also the costs and time needed for generation. Thus, an LLM does not generate one optimal solution, but a set of Pareto-optimal variants. Moreover, there are several further challenges which have to be taken into account, e.g. conceptualization of quality, validation of results, generalizability, and data leakage. We discuss these challenges in detail and discuss future experiments to tackle these challenges scientifically.

Title: Agent-Enhanced Large Language Models for Researching Political Institutions

Authors: Joseph R. Loffredo, Suyeol Yun
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.13524
Pdf URL: https://arxiv.org/pdf/2503.13524
Copy Paste: [[2503.13524]] Agent-Enhanced Large Language Models for Researching Political Institutions(https://arxiv.org/abs/2503.13524)
Keywords: large language model
Abstract: The applications of Large Language Models (LLMs) in political science are rapidly expanding. This paper demonstrates how LLMs, when augmented with predefined functions and specialized tools, can serve as dynamic agents capable of streamlining tasks such as data collection, preprocessing, and analysis. Central to this approach is agentic retrieval-augmented generation (Agentic RAG), which equips LLMs with action-calling capabilities for interaction with external knowledge bases. Beyond information retrieval, LLM agents may incorporate modular tools for tasks like document summarization, transcript coding, qualitative variable classification, and statistical modeling. To demonstrate the potential of this approach, we introduce CongressRA, an LLM agent designed to support scholars studying the U.S. Congress. Through this example, we highlight how LLM agents can reduce the costs of replicating, testing, and extending empirical research using the domain-specific data that drives the study of political institutions.

Title: Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms

Authors: Xiaojian Li, Yongkang Leng, Ruiqing Ding, Hangjie Mo, Shanlin Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13530
Pdf URL: https://arxiv.org/pdf/2503.13530
Copy Paste: [[2503.13530]] Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms(https://arxiv.org/abs/2503.13530)
Keywords: extraction, interpretability, large language model
Abstract: The human-like reasoning capabilities exhibited by Large Language Models (LLMs) challenge the traditional neural network theory's understanding of the flexibility of fixed-parameter systems. This paper proposes the "Cognitive Activation" theory, revealing the essence of LLMs' reasoning mechanisms from the perspective of dynamic systems: the model's reasoning ability stems from a chaotic process of dynamic information extraction in the parameter space. By introducing the Quasi-Lyapunov Exponent (QLE), we quantitatively analyze the chaotic characteristics of the model at different layers. Experiments show that the model's information accumulation follows a nonlinear exponential law, and the Multilayer Perceptron (MLP) accounts for a higher proportion in the final output than the attention mechanism. Further experiments indicate that minor initial value perturbations will have a substantial impact on the model's reasoning ability, confirming the theoretical analysis that large language models are chaotic systems. This research provides a chaos theory framework for the interpretability of LLMs' reasoning and reveals potential pathways for balancing creativity and reliability in model design.

Title: Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution

Authors: Jin Kim, Byunghwee Lee, Taekho You, Jinhyuk Yun
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13531
Pdf URL: https://arxiv.org/pdf/2503.13531
Copy Paste: [[2503.13531]] Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution(https://arxiv.org/abs/2503.13531)
Keywords: diffusion, generative
Abstract: The rise of multimodal generative AI is transforming the intersection of technology and art, offering deeper insights into large-scale artwork. Although its creative capabilities have been widely explored, its potential to represent artwork in latent spaces remains underexamined. We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings by extracting two types of latent information with the model: formal aspects (e.g., colors) and contextual aspects (e.g., subject). Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements. Additionally, using contextual keywords extracted from paintings, we show how artistic expression evolves alongside societal changes. Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks, highlighting the significance of mutual interaction between society and art. This study demonstrates how multimodal AI expands traditional formal analysis by integrating temporal, cultural, and historical contexts.

Title: FedTilt: Towards Multi-Level Fairness-Preserving and Robust Federated Learning

Authors: Binghui Zhang, Luis Mares De La Cruz, Binghui Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13537
Pdf URL: https://arxiv.org/pdf/2503.13537
Copy Paste: [[2503.13537]] FedTilt: Towards Multi-Level Fairness-Preserving and Robust Federated Learning(https://arxiv.org/abs/2503.13537)
Keywords: privacy, robust, federate, fair
Abstract: Federated Learning (FL) is an emerging decentralized learning paradigm that can partly address the privacy concern that cannot be handled by traditional centralized and distributed learning. Further, to make FL practical, it is also necessary to consider constraints such as fairness and robustness. However, existing robust FL methods often produce unfair models, and existing fair FL methods only consider one-level (client) fairness and are not robust to persistent outliers (i.e., injected outliers into each training round) that are common in real-world FL settings. We propose \texttt{FedTilt}, a novel FL that can preserve multi-level fairness and be robust to outliers. In particular, we consider two common levels of fairness, i.e., \emph{client fairness} -- uniformity of performance across clients, and \emph{client data fairness} -- uniformity of performance across different classes of data within a client. \texttt{FedTilt} is inspired by the recently proposed tilted empirical risk minimization, which introduces tilt hyperparameters that can be flexibly tuned. Theoretically, we show how tuning tilt values can achieve the two-level fairness and mitigate the persistent outliers, and derive the convergence condition of \texttt{FedTilt} as well. Empirically, our evaluation results on a suite of realistic federated datasets in diverse settings show the effectiveness and flexibility of the \texttt{FedTilt} framework and the superiority to the state-of-the-arts.

Title: MSCMHMST: A traffic flow prediction model based on Transformer

Authors: Weiyang Geng, Yiming Pan, Zhecong Xing, Dongyu Liu, Rui Liu, Yuan Zhu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13540
Pdf URL: https://arxiv.org/pdf/2503.13540
Copy Paste: [[2503.13540]] MSCMHMST: A traffic flow prediction model based on Transformer(https://arxiv.org/abs/2503.13540)
Keywords: robust, transformer
Abstract: This study proposes a hybrid model based on Transformers, named MSCMHMST, aimed at addressing key challenges in traffic flow prediction. Traditional single-method approaches show limitations in traffic prediction tasks, whereas hybrid methods, by integrating the strengths of different models, can provide more accurate and robust predictions. The MSCMHMST model introduces a multi-head, multi-scale attention mechanism, allowing the model to parallel process different parts of the data and learn its intrinsic representations from multiple perspectives, thereby enhancing the model's ability to handle complex situations. This mechanism enables the model to capture features at various scales effectively, understanding both short-term changes and long-term trends. Verified through experiments on the PeMS04/08 dataset with specific experimental settings, the MSCMHMST model demonstrated excellent robustness and accuracy in long, medium, and short-term traffic flow predictions. The results indicate that this model has significant potential, offering a new and effective solution for the field of traffic flow prediction.

Title: HAR-DoReMi: Optimizing Data Mixture for Self-Supervised Human Activity Recognition Across Heterogeneous IMU Datasets

Authors: Lulu Ban, Tao Zhu, Xiangqing Lu, Qi Qiu, Wenyong Han, Shuangjian Li, Liming Chen, Kevin I-Kai Wang, Mingxing Nie, Yaping Wan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13542
Pdf URL: https://arxiv.org/pdf/2503.13542
Copy Paste: [[2503.13542]] HAR-DoReMi: Optimizing Data Mixture for Self-Supervised Human Activity Recognition Across Heterogeneous IMU Datasets(https://arxiv.org/abs/2503.13542)
Keywords: large language model
Abstract: Cross-dataset Human Activity Recognition (HAR) suffers from limited model generalization, hindering its practical deployment. To address this critical challenge, inspired by the success of DoReMi in Large Language Models (LLMs), we introduce a data mixture optimization strategy for pre-training HAR models, aiming to improve the recognition performance across heterogeneous datasets. However, directly applying DoReMi to the HAR field encounters new challenges due to the continuous, multi-channel and intrinsic heterogeneous characteristics of IMU sensor data. To overcome these limitations, we propose a novel framework HAR-DoReMi, which introduces a masked reconstruction task based on Mean Squared Error (MSE) loss. By raplacing the discrete language sequence prediction task, which relies on the Negative Log-Likelihood (NLL) loss, in the original DoReMi framework, the proposed framework is inherently more appropriate for handling the continuous and multi-channel characteristics of IMU data. In addition, HAR-DoReMi integrates the Mahony fusion algorithm into the self-supervised HAR pre-training, aiming to mitigate the heterogeneity of varying sensor orientation. This is achieved by estimating the sensor orientation within each dataset and facilitating alignment with a unified coordinate system, thereby improving the cross-dataset generalization ability of the HAR model. Experimental evaluation on multiple cross-dataset HAR transfer tasks demonstrates that HAR-DoReMi improves the accuracy by an average of 6.51%, compared to the current state-of-the-art method with only approximately 30% to 50% of the data usage. These results confirm the effectiveness of HAR-DoReMi in improving the generalization and data efficiency of pre-training HAR models, underscoring its significant potential to facilitate the practical deployment of HAR technology.

Title: Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning

Authors: Xinghao Wu, Jianwei Niu, Xuefeng Liu, Guogang Zhu, Jiayuan Zhang, Shaojie Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13543
Pdf URL: https://arxiv.org/pdf/2503.13543
Copy Paste: [[2503.13543]] Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning(https://arxiv.org/abs/2503.13543)
Keywords: federate, large language model
Abstract: Federated Prototype Learning (FedPL) has emerged as an effective strategy for handling data heterogeneity in Federated Learning (FL). In FedPL, clients collaboratively construct a set of global feature centers (prototypes), and let local features align with these prototypes to mitigate the effects of data heterogeneity. The performance of FedPL highly depends on the quality of prototypes. Existing methods assume that larger inter-class distances among prototypes yield better performance, and thus design different methods to increase these distances. However, we observe that while these methods increase prototype distances to enhance class discrimination, they inevitably disrupt essential semantic relationships among classes, which are crucial for model generalization. This raises an important question: how to construct prototypes that inherently preserve semantic relationships among classes? Directly learning these relationships from limited and heterogeneous client data can be problematic in FL. Recently, the success of pre-trained language models (PLMs) demonstrates their ability to capture semantic relationships from vast textual corpora. Motivated by this, we propose FedTSP, a novel method that leverages PLMs to construct semantically enriched prototypes from the textual modality, enabling more effective collaboration in heterogeneous data settings. We first use a large language model (LLM) to generate fine-grained textual descriptions for each class, which are then processed by a PLM on the server to form textual prototypes. To address the modality gap between client image models and the PLM, we introduce trainable prompts, allowing prototypes to adapt better to client tasks. Extensive experiments demonstrate that FedTSP mitigates data heterogeneity while significantly accelerating convergence.

Title: Semi-Decision-Focused Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization

Authors: Juhyeong Kim
Subjects: cs.LG, q-fin.CP, q-fin.PM
Abstract URL: https://arxiv.org/abs/2503.13544
Pdf URL: https://arxiv.org/pdf/2503.13544
Copy Paste: [[2503.13544]] Semi-Decision-Focused Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization(https://arxiv.org/abs/2503.13544)
Keywords: robust
Abstract: I propose Semi-Decision-Focused Learning, a practical adaptation of Decision-Focused Learning for portfolio optimization. Rather than directly optimizing complex financial metrics, I employ simple target portfolios (Max-Sortino or One-Hot) and train models with a convex, cross-entropy loss. I further incorporate Deep Ensemble methods to reduce variance and stabilize performance. Experiments on two universes (one upward-trending and another range-bound) show consistent outperformance over baseline portfolios, demonstrating the effectiveness and robustness of my approach. Code is available at this https URL

Title: CNCast: Leveraging 3D Swin Transformer and DiT for Enhanced Regional Weather Forecasting

Authors: Hongli Liang (1), Yuanting Zhang (1), Qingye Meng (1), Shuangshuang He (1), Xingyuan Yuan (1) ((1) ColorfulClouds Technology Co., Ltd)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13546
Pdf URL: https://arxiv.org/pdf/2503.13546
Copy Paste: [[2503.13546]] CNCast: Leveraging 3D Swin Transformer and DiT for Enhanced Regional Weather Forecasting(https://arxiv.org/abs/2503.13546)
Keywords: diffusion, transformer
Abstract: This study introduces a cutting-edge regional weather forecasting model based on the SwinTransformer 3D architecture. This model is specifically designed to deliver precise hourly weather predictions ranging from 1 hour to 5 days, significantly improving the reliability and practicality of short-term weather forecasts. Our model has demonstrated generally superior performance when compared to Pangu, a well-established global model. The evaluation indicates that our model excels in predicting most weather variables, highlighting its potential as a more effective alternative in the field of limited area modeling. A noteworthy feature of this model is the integration of enhanced boundary conditions, inspired by traditional numerical weather prediction (NWP) techniques. This integration has substantially improved the model's predictive accuracy. Additionally, the model includes an innovative approach for diagnosing hourly total precipitation at a high spatial resolution of approximately 5 kilometers. This is achieved through a latent diffusion model, offering an alternative method for generating high-resolution precipitation data.

Title: Fuzzy Rule-based Differentiable Representation Learning

Authors: Wei Zhang, Zhaohong Deng, Guanjin Wang, Kup-Sze Choi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13548
Pdf URL: https://arxiv.org/pdf/2503.13548
Copy Paste: [[2503.13548]] Fuzzy Rule-based Differentiable Representation Learning(https://arxiv.org/abs/2503.13548)
Keywords: robust, extraction, interpretability
Abstract: Representation learning has emerged as a crucial focus in machine and deep learning, involving the extraction of meaningful and useful features and patterns from the input data, thereby enhancing the performance of various downstream tasks such as classification, clustering, and prediction. Current mainstream representation learning methods primarily rely on non-linear data mining techniques such as kernel methods and deep neural networks to extract abstract knowledge from complex datasets. However, most of these methods are black-box, lacking transparency and interpretability in the learning process, which constrains their practical utility. To this end, this paper introduces a novel representation learning method grounded in an interpretable fuzzy rule-based model. Specifically, it is built upon the Takagi-Sugeno-Kang fuzzy system (TSK-FS) to initially map input data to a high-dimensional fuzzy feature space through the antecedent part of the TSK-FS. Subsequently, a novel differentiable optimization method is proposed for the consequence part learning which can preserve the model's interpretability and transparency while further exploring the nonlinear relationships within the data. This optimization method retains the essence of traditional optimization, with certain parts of the process parameterized corresponding differentiable modules constructed, and a deep optimization process implemented. Consequently, this method not only enhances the model's performance but also ensures its interpretability. Moreover, a second-order geometry preservation method is introduced to further improve the robustness of the proposed method. Extensive experiments conducted on various benchmark datasets validate the superiority of the proposed method, highlighting its potential for advancing representation learning methodologies.

Title: Towards Privacy-Preserving Data-Driven Education: The Potential of Federated Learning

Authors: Mohammad Khalil, Ronas Shakya, Qinyi Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13550
Pdf URL: https://arxiv.org/pdf/2503.13550
Copy Paste: [[2503.13550]] Towards Privacy-Preserving Data-Driven Education: The Potential of Federated Learning(https://arxiv.org/abs/2503.13550)
Keywords: privacy, protect, attack, federate
Abstract: The increasing adoption of data-driven applications in education such as in learning analytics and AI in education has raised significant privacy and data protection concerns. While these challenges have been widely discussed in previous works, there are still limited practical solutions. Federated learning has recently been discoursed as a promising privacy-preserving technique, yet its application in education remains scarce. This paper presents an experimental evaluation of federated learning for educational data prediction, comparing its performance to traditional non-federated approaches. Our findings indicate that federated learning achieves comparable predictive accuracy. Furthermore, under adversarial attacks, federated learning demonstrates greater resilience compared to non-federated settings. We summarise that our results reinforce the value of federated learning as a potential approach for balancing predictive performance and privacy in educational contexts.

Title: Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Shenyang Tong, Hailei Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13551
Pdf URL: https://arxiv.org/pdf/2503.13551
Copy Paste: [[2503.13551]] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models(https://arxiv.org/abs/2503.13551)
Keywords: robust, large language model
Abstract: Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate steps. In this paper, we propose a novel reward model approach, Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps from fine-grained and coarse-grained level. HRM performs better in assessing reasoning coherence and self-reflection, particularly when the previous reasoning step is incorrect. Furthermore, to address the inefficiency of autonomous generating PRM training data via Monte Carlo Tree Search (MCTS), we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC) based on node merging (combining two consecutive reasoning steps into one step) in the tree structure. This approach diversifies MCTS results for HRM with negligible computational overhead, enhancing label robustness by introducing noise. Empirical results on the PRM800K dataset demonstrate that HRM, in conjunction with HNC, achieves superior stability and reliability in evaluation compared to PRM. Furthermore, cross-domain evaluations on MATH500 and GSM8K confirm HRM's superior generalization and robustness across diverse reasoning tasks. The code for all experiments will be released at https: //github.com/tengwang0318/hierarchial_reward_model.

Title: MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG

Authors: Pingyu Wu, Daiheng Gao, Jing Tang, Huimin Chen, Wenbo Zhou, Weiming Zhang, Nenghai Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13563
Pdf URL: https://arxiv.org/pdf/2503.13563
Copy Paste: [[2503.13563]] MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG(https://arxiv.org/abs/2503.13563)
Keywords: secure, security, protect, large language model
Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by using external knowledge, but it struggles with precise entity information retrieval. In this paper, we proposed MES-RAG framework, which enhances entity-specific query handling and provides accurate, secure, and consistent responses. MES-RAG introduces proactive security measures that ensure system integrity by applying protections prior to data access. Additionally, the system supports real-time multi-modal outputs, including text, images, audio, and video, seamlessly integrating into existing RAG architectures. Experimental results demonstrate that MES-RAG significantly improves both accuracy and recall, highlighting its effectiveness in advancing the security and utility of question-answering, increasing accuracy to 0.83 (+0.25) on targeted task. Our code and data are available at this https URL.

Title: ExChanGeAI: An End-to-End Platform and Efficient Foundation Model for Electrocardiogram Analysis and Fine-tuning

Authors: Lucas Bickmann, Lucas Plagwitz, Antonius Büscher, Lars Eckardt, Julian Varghese
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13570
Pdf URL: https://arxiv.org/pdf/2503.13570
Copy Paste: [[2503.13570]] ExChanGeAI: An End-to-End Platform and Efficient Foundation Model for Electrocardiogram Analysis and Fine-tuning(https://arxiv.org/abs/2503.13570)
Keywords: privacy
Abstract: Electrocardiogram data, one of the most widely available biosignal data, has become increasingly valuable with the emergence of deep learning methods, providing novel insights into cardiovascular diseases and broader health conditions. However, heterogeneity of electrocardiogram formats, limited access to deep learning model weights and intricate algorithmic steps for effective fine-tuning for own disease target labels result in complex workflows. In this work, we introduce ExChanGeAI, a web-based end-to-end platform that streamlines the reading of different formats, pre-processing, visualization and custom machine learning with local and privacy-preserving fine-tuning. ExChanGeAI is adaptable for use on both personal computers and scalable to high performance server environments. The platform offers state-of-the-art deep learning models for training from scratch, alongside our novel open-source electrocardiogram foundation model CardX, pre-trained on over one million electrocardiograms. Evaluation across three external validation sets, including an entirely new testset extracted from routine care, demonstrate the fine-tuning capabilities of ExChanGeAI. CardX outperformed the benchmark foundation model while requiring significantly fewer parameters and lower computational resources. The platform enables users to empirically determine the most suitable model for their specific tasks based on systematic this http URL code is available at this https URL .

Title: Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

Authors: Kai Tong, Kang Pan, Xiao Zhang, Erli Meng, Run He, Yawen Cui, Nuoyan Guo, Huiping Zhuang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.13575
Pdf URL: https://arxiv.org/pdf/2503.13575
Copy Paste: [[2503.13575]] Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model(https://arxiv.org/abs/2503.13575)
Keywords: large language model
Abstract: Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers' features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental results demonstrate that our method achieves near-perfect retention of prior knowledge while seamlessly integrating new information, effectively overcoming the core limitations of existing methods. Our code will be released after acceptance.

Title: A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

Authors: Ziqiang Li, Jun Li, Lizhi Xiong, Zhangjie Fu, Zechao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13576
Pdf URL: https://arxiv.org/pdf/2503.13576
Copy Paste: [[2503.13576]] A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models(https://arxiv.org/abs/2503.13576)
Keywords: diffusion
Abstract: Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.

Title: Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization

Authors: Hao Li, Yubin Xiao, Ke Liang, Mengzhu Wang, Long Lan, Kenli Li, Xinwang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13617
Pdf URL: https://arxiv.org/pdf/2503.13617
Copy Paste: [[2503.13617]] Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization(https://arxiv.org/abs/2503.13617)
Keywords: diffusion, segmentation
Abstract: Single Domain Generalization (SDG) aims to train models with consistent performance across diverse scenarios using data from a single source. While using latent diffusion models (LDMs) show promise in augmenting limited source data, we demonstrate that directly using synthetic data can be detrimental due to significant feature distribution discrepancies between synthetic and real target domains, leading to performance degradation. To address this issue, we propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization. We employ LDMs to produce diverse pseudo-target domain samples and introduce two key modules to handle distribution bias. First, Discriminative Feature Decoupling and Reassembly (DFDR) module uses entropy-guided attention to recalibrate channel-level features, suppressing synthetic noise while preserving semantic consistency. Second, Multi-pseudo-domain Soft Fusion (MDSF) module uses adversarial training with latent-space feature interpolation, creating continuous feature transitions between domains. Extensive SDG experiments on object detection and semantic segmentation tasks demonstrate that DRSF achieves substantial performance gains with only marginal computational overhead. Notably, DRSF's plug-and-play architecture enables seamless integration with unsupervised domain adaptation paradigms, underscoring its broad applicability in addressing diverse and real-world domain challenges.

Title: XChainDataGen: A Cross-Chain Dataset Generation Framework

Authors: André Augusto, André Vasconcelos, Miguel Correia, Luyao Zhang
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2503.13637
Pdf URL: https://arxiv.org/pdf/2503.13637
Copy Paste: [[2503.13637]] XChainDataGen: A Cross-Chain Dataset Generation Framework(https://arxiv.org/abs/2503.13637)
Keywords: security
Abstract: The number of blockchain interoperability protocols for transferring data and assets between blockchains has grown significantly. However, no open dataset of cross-chain transactions exists to study interoperability protocols in operation. There is also no tool to generate such datasets and make them available to the community. This paper proposes XChainDataGen, a tool to extract cross-chain data from blockchains and generate datasets of cross-chain transactions (cctxs). Using XChainDataGen, we extracted over 35 GB of data from five cross-chain protocols deployed on 11 blockchains in the last seven months of 2024, identifying 11,285,753 cctxs that moved over 28 billion USD in cross-chain token transfers. Using the data collected, we compare protocols and provide insights into their security, cost, and performance trade-offs. As examples, we highlight differences between protocols that require full finality on the source blockchain and those that only demand soft finality (\textit{security}). We compare user costs, fee models, and the impact of variables such as the Ethereum gas price on protocol fees (\textit{cost}). Finally, we produce the first analysis of the implications of EIP-7683 for cross-chain intents, which are increasingly popular and greatly improve the speed with which cctxs are processed (\textit{performance}), thereby enhancing the user experience. The availability of XChainDataGen and this dataset allows various analyses, including trends in cross-chain activity, security assessments of interoperability protocols, and financial research on decentralized finance (DeFi) protocols.

Title: Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos

Authors: Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13646
Pdf URL: https://arxiv.org/pdf/2503.13646
Copy Paste: [[2503.13646]] Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos(https://arxiv.org/abs/2503.13646)
Keywords: large language model
Abstract: Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at this https URL.

Title: Web Artifact Attacks Disrupt Vision Language Models

Authors: Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13652
Pdf URL: https://arxiv.org/pdf/2503.13652
Copy Paste: [[2503.13652]] Web Artifact Attacks Disrupt Vision Language Models(https://arxiv.org/abs/2503.13652)
Keywords: attack, robust
Abstract: Vision-language models (VLMs) (e.g., CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a typographic attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them harder to defend against but also more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness.

Title: Pensez: Less Data, Better Reasoning -- Rethinking French LLM

Authors: Huy Hoang Ha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13661
Pdf URL: https://arxiv.org/pdf/2503.13661
Copy Paste: [[2503.13661]] Pensez: Less Data, Better Reasoning -- Rethinking French LLM(https://arxiv.org/abs/2503.13661)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, achieving strong performance in specialized domains like mathematical reasoning and non-English languages often requires extensive training on massive datasets. This paper investigates a contrasting approach: strategic fine-tuning on a small, high-quality, bilingual (English-French) dataset to enhance both the reasoning capabilities and French language proficiency of a large language model. Rather than relying on scale, we explore the hypothesis that targeted data curation and optimized training can achieve competitive, or even superior, performance. We demonstrate, through targeted supervised fine-tuning (SFT) on only 2,000 carefully selected samples, significant improvements in mathematical reasoning. Specifically, Pensez 7B exhibits an increase in accuracy of the base model up to 20% on the AIME25 and a 12% increase on a French MATH level 5 benchmark. These results challenge the prevailing assumption that massive datasets are aprerequisite for strong reasoning performance in LLMs, highlighting the potential of strategic data curation and optimized fine-tuning for enhancing both specialized skills and multilingual capabilities. Our findings have implications for the efficient development of high-performing, multilingual LLMs, especially in resource-constrained scenarios.

Title: FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

Authors: Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, Mengyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13684
Pdf URL: https://arxiv.org/pdf/2503.13684
Copy Paste: [[2503.13684]] FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models(https://arxiv.org/abs/2503.13684)
Keywords: fair, diffusion
Abstract: Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. Additionally, we adapt the latest rectified flow (RF) T2V generation models, Pyramid-Flow and Wan2.1, by introducing FlowEdit, resulting in training-free and inversion-free video editing models Pyramid-Edit and Wan-Edit. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime. To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More video demo available on the anonymous website: this https URL

Title: Feature Extraction and Analysis for GPT-Generated Text

Authors: A. Selvioğlu, V. Adanova, M. Atagoziev
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13687
Pdf URL: https://arxiv.org/pdf/2503.13687
Copy Paste: [[2503.13687]] Feature Extraction and Analysis for GPT-Generated Text(https://arxiv.org/abs/2503.13687)
Keywords: extraction
Abstract: With the rise of advanced natural language models like GPT, distinguishing between human-written and GPT-generated text has become increasingly challenging and crucial across various domains, including academia. The long-standing issue of plagiarism has grown more pressing, now compounded by concerns about the authenticity of information, as it is not always clear whether the presented facts are genuine or fabricated. In this paper, we present a comprehensive study of feature extraction and analysis for differentiating between human-written and GPT-generated text. By applying machine learning classifiers to these extracted features, we evaluate the significance of each feature in detection. Our results demonstrate that human and GPT-generated texts exhibit distinct writing styles, which can be effectively captured by our features. Given sufficiently long text, the two can be differentiated with high accuracy.

Title: Mitigating Spectral Bias in Neural Operators via High-Frequency Scaling for Physical Systems

Authors: Siavash Khodakarami, Vivek Oommen, Aniruddha Bora, George Em Karniadakis
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2503.13695
Pdf URL: https://arxiv.org/pdf/2503.13695
Copy Paste: [[2503.13695]] Mitigating Spectral Bias in Neural Operators via High-Frequency Scaling for Physical Systems(https://arxiv.org/abs/2503.13695)
Keywords: diffusion
Abstract: Neural operators have emerged as powerful surrogates for modeling complex physical problems. However, they suffer from spectral bias making them oblivious to high-frequency modes, which are present in multiscale physical systems. Therefore, they tend to produce over-smoothed solutions, which is particularly problematic in modeling turbulence and for systems with intricate patterns and sharp gradients such as multi-phase flow systems. In this work, we introduce a new approach named high-frequency scaling (HFS) to mitigate spectral bias in convolutional-based neural operators. By integrating HFS with proper variants of UNet neural operators, we demonstrate a higher prediction accuracy by mitigating spectral bias in single and two-phase flow problems. Unlike Fourier-based techniques, HFS is directly applied to the latent space, thus eliminating the computational cost associated with the Fourier transform. Additionally, we investigate alternative spectral bias mitigation through diffusion models conditioned on neural operators. While the diffusion model integrated with the standard neural operator may still suffer from significant errors, these errors are substantially reduced when the diffusion model is integrated with a HFS-enhanced neural operator.

Title: Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor Scenarios

Authors: Iryna Repinetska, Anna Hilsmann, Peter Eisert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13710
Pdf URL: https://arxiv.org/pdf/2503.13710
Copy Paste: [[2503.13710]] Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor Scenarios(https://arxiv.org/abs/2503.13710)
Keywords: robust
Abstract: Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ''floaters'' that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ''inside-out'' views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.

Title: SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint

Authors: Zhenlong Yuan, Zhidong Yang, Yujun Cai, Kuangxin Wu, Mufan Liu, Dapeng Zhang, Hao Jiang, Zhaoxin Li, Zhaoqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13721
Pdf URL: https://arxiv.org/pdf/2503.13721
Copy Paste: [[2503.13721]] SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint(https://arxiv.org/abs/2503.13721)
Keywords: robust, diffusion, segmentation
Abstract: Recently, patch-deformation methods have exhibited significant effectiveness in multi-view stereo owing to the deformable and expandable patches in reconstructing textureless areas. However, such methods primarily emphasize broadening the receptive field in textureless areas, while neglecting deformation instability caused by easily overlooked edge-skipping, potentially leading to matching distortions. To address this, we propose SED-MVS, which adopts panoptic segmentation and multi-trajectory diffusion strategy for segmentation-driven and edge-aligned patch deformation. Specifically, to prevent unanticipated edge-skipping, we first employ SAM2 for panoptic segmentation as depth-edge guidance to guide patch deformation, followed by multi-trajectory diffusion strategy to ensure patches are comprehensively aligned with depth edges. Moreover, to avoid potential inaccuracy of random initialization, we combine both sparse points from LoFTR and monocular depth map from DepthAnything V2 to restore reliable and realistic depth map for initialization and supervised guidance. Finally, we integrate segmentation image with monocular depth map to exploit inter-instance occlusion relationship, then further regard them as occlusion map to implement two distinct edge constraint, thereby facilitating occlusion-aware patch deformation. Extensive results on ETH3D, Tanks & Temples, BlendedMVS and Strecha datasets validate the state-of-the-art performance and robust generalization capability of our proposed method.

Title: Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

Authors: Shristi Das Biswas, Efstathia Soufleri, Arani Roy, Kaushik Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13724
Pdf URL: https://arxiv.org/pdf/2503.13724
Copy Paste: [[2503.13724]] Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition(https://arxiv.org/abs/2503.13724)
Keywords: robust, transformer
Abstract: Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.

Title: TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

Authors: Forouzan Fallah, Maitreya Patel, Agneet Chatterjee, Vlad I. Morariu, Chitta Baral, Yezhou Yang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.13730
Pdf URL: https://arxiv.org/pdf/2503.13730
Copy Paste: [[2503.13730]] TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark(https://arxiv.org/abs/2503.13730)
Keywords: diffusion
Abstract: Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.

Title: CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings

Authors: Daniil Orel, Dilshod Azizov, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13733
Pdf URL: https://arxiv.org/pdf/2503.13733
Copy Paste: [[2503.13733]] CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings(https://arxiv.org/abs/2503.13733)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, these advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. To this end, we propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis using evaluation of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting the authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Moreover, our extensive experiments show that our framework effectively distinguishes human- from LLM-written code and sets a new benchmark for this task.

Title: AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications

Authors: Haiying Shen, Tanmoy Sen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13737
Pdf URL: https://arxiv.org/pdf/2503.13737
Copy Paste: [[2503.13737]] AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications(https://arxiv.org/abs/2503.13737)
Keywords: large language model
Abstract: In this paper, we consider a mixed-prompt scenario for a large language model (LLM) inference serving system that supports diverse applications with both short prompts and long prompts and heterogeneous SLOs for iteration time. To improve throughput when handling long prompts, previous research introduces a chunking method, but has not addressed heterogeneous SLOs. To address the limitation, we propose AccelGen, a high-throughput LLM inference serving system with heterogeneous SLO guarantees for diverse applications. AccelGen introduces four core components: (1) SLO-guaranteed dynamic chunking, which dynamically adjusts chunk sizes to maximize GPU compute utilization while meeting iteration-level SLOs; (2) Iteration-level SLO-based task prioritization, which prioritizes tight-SLO requests and batches requests with similar SLOs; (3) Multi-resource-aware batching, which selects queued requests to maximize the utilizations of both GPU compute resource and key-value cache (KVC). Trace-driven real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches. It achieves performance near the Oracle, which optimally maximizes goodput.

Title: Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes

Authors: Keqi Chen, Vinkle Srivastav, Didier Mutter, Nicolas Padoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13739
Pdf URL: https://arxiv.org/pdf/2503.13739
Copy Paste: [[2503.13739]] Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes(https://arxiv.org/abs/2503.13739)
Keywords: robust
Abstract: Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at this https URL.

Title: FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Authors: Ali Mollaahmadi Dehaghi, Hossein KhademSohi, Reza Razavi, Steve Drew, Mohammad Moshirpour
Subjects: cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2503.13745
Pdf URL: https://arxiv.org/pdf/2503.13745
Copy Paste: [[2503.13745]] FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution(https://arxiv.org/abs/2503.13745)
Keywords: privacy, federate
Abstract: Video Super-Resolution (VSR) reconstructs high-resolution videos from low-resolution inputs to restore fine details and improve visual clarity. While deep learning-based VSR methods achieve impressive results, their centralized nature raises serious privacy concerns, particularly in applications with strict privacy requirements. Federated Learning (FL) offers an alternative approach, but existing FL methods struggle with low-level vision tasks, leading to suboptimal reconstructions. To address this, we propose FedVSR1, a novel, architecture-independent, and stateless FL framework for VSR. Our approach introduces a lightweight loss term that improves local optimization and guides global aggregation with minimal computational overhead. To the best of our knowledge, this is the first attempt at federated VSR. Extensive experiments show that FedVSR outperforms general FL methods by an average of 0.85 dB in PSNR, highlighting its effectiveness. The code is available at: this https URL

Title: Fast alignment of heterogeneous images in sliced Wasserstein distance

Authors: Yunpeng Shi, Amit Singer, Eric J. Verbeke
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2503.13756
Pdf URL: https://arxiv.org/pdf/2503.13756
Copy Paste: [[2503.13756]] Fast alignment of heterogeneous images in sliced Wasserstein distance(https://arxiv.org/abs/2503.13756)
Keywords: robust
Abstract: Many applications of computer vision rely on the alignment of similar but non-identical images. We present a fast algorithm for aligning heterogeneous images based on optimal transport. Our approach combines the speed of fast Fourier methods with the robustness of sliced probability metrics and allows us to efficiently compute the alignment between two $L \times L$ images using the sliced 2-Wasserstein distance in $O(L^2 \log L)$ operations. We show that our method is robust to translations, rotations and deformations in the images.

Title: Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems

Authors: Mohammad Partohaghighi, Roummel Marcia, YangQuan Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13764
Pdf URL: https://arxiv.org/pdf/2503.13764
Copy Paste: [[2503.13764]] Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems(https://arxiv.org/abs/2503.13764)
Keywords: robust
Abstract: Fractional-order stochastic gradient descent (FOSGD) leverages a fractional exponent to capture long-memory effects in optimization, yet its practical impact is often constrained by the difficulty of tuning and stabilizing this exponent. In this work, we introduce 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), a novel method that synergistically combines the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to automatically calibrate the fractional exponent in a data-driven manner. By continuously gauging model sensitivity and effective dimensionality, 2SED dynamically adjusts the exponent to curb erratic oscillations and enhance convergence rates. Theoretically, we demonstrate how this dimension-aware adaptation retains the benefits of fractional memory while averting the sluggish or unstable behaviors frequently observed in naive fractional SGD. Empirical evaluations across multiple benchmarks confirm that our 2SED-driven fractional exponent approach not only converges faster but also achieves more robust final performance, suggesting broad applicability for fractional-order methodologies in large-scale machine learning and related domains.

Title: Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion

Authors: Kartik Thakral, Tamar Glaser, Tal Hassner, Mayank Vatsa, Richa Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13769
Pdf URL: https://arxiv.org/pdf/2503.13769
Copy Paste: [[2503.13769]] Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion(https://arxiv.org/abs/2503.13769)
Keywords: generative
Abstract: How can we effectively unlearn selected concepts from pre-trained generative foundation models without resorting to extensive retraining? This research introduces `continual unlearning', a novel paradigm that enables the targeted removal of multiple specific concepts from foundational generative models, incrementally. We propose Decremental Unlearning without Generalization Erosion (DUGE) algorithm which selectively unlearns the generation of undesired concepts while preserving the generation of related, non-targeted concepts and alleviating generalization erosion. For this, DUGE targets three losses: a cross-attention loss that steers the focus towards images devoid of the target concept; a prior-preservation loss that safeguards knowledge related to non-target concepts; and a regularization loss that prevents the model from suffering from generalization erosion. Experimental results demonstrate the ability of the proposed approach to exclude certain concepts without compromising the overall integrity and performance of the model. This offers a pragmatic solution for refining generative models, adeptly handling the intricacies of model training and concept management lowering the risks of copyright infringement, personal or licensed material misuse, and replication of distinctive artistic styles. Importantly, it maintains the non-targeted concepts, thereby safeguarding the model's core capabilities and effectiveness.

Title: Mitigating KV Cache Competition to Enhance User Experience in LLM Inference

Authors: Haiying Shen, Tanmoy Sen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13773
Pdf URL: https://arxiv.org/pdf/2503.13773
Copy Paste: [[2503.13773]] Mitigating KV Cache Competition to Enhance User Experience in LLM Inference(https://arxiv.org/abs/2503.13773)
Keywords: large language model
Abstract: In Large Language Model (LLM) serving, the KV-cache (KVC) bottleneck causes high tail Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT), impairing user experience, particularly in time-sensitive applications. However, satisfying both TTFT and TBT service-level objectives (SLOs) is challenging. To address this, we propose a system, named CacheOPT for mitigating KV Cache competition, based on key insights from our measurements, incorporating novel components. First, it estimates a request's output length, bounding the deviation with a high specified probability, adjusted based on the request arrival rate. Second, it allocates the estimated KVC demand to a request, and reuses other requests' allocated KVC to avoid preemptions while reducing waiting time. Third, it proactively allocates KVC before instead of at the time a request exhausts its allocation and reserves KVC globally to prevent preemptions. Fourth, it chooses a request that has long TBT SLO, long job remaining time and short preemption time to preempt. Fifth, it selects the shortest-latency strategy between swapping and recomputation for preemptions. Experiments show that CacheOPT achieves up to 3.29$\times$ and 2.83$\times$ lower tail TBT and tail TTFT, 47\% and 53\% higher TTFT and TBT SLO attainments, and supports up to 1.58$\times$ higher request arrival rate than the state-of-the-art methods.

Title: 8-Calves Image dataset

Authors: Xuyang Fang, Sion Hannuna, Neill Campbell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13777
Pdf URL: https://arxiv.org/pdf/2503.13777
Copy Paste: [[2503.13777]] 8-Calves Image dataset(https://arxiv.org/abs/2503.13777)
Keywords: transformer
Abstract: We introduce the 8-Calves dataset, a benchmark for evaluating object detection and identity classification in occlusion-rich, temporally consistent environments. The dataset comprises a 1-hour video (67,760 frames) of eight Holstein Friesian calves in a barn, with ground truth bounding boxes and identities, alongside 900 static frames for detection tasks. Each calf exhibits a unique coat pattern, enabling precise identity distinction. For cow detection, we fine-tuned 28 models (25 YOLO variants, 3 transformers) on 600 frames, testing on the full video. Results reveal smaller YOLO models (e.g., YOLOV9c) outperform larger counterparts despite potential bias from a YOLOv8m-based labeling pipeline. For identity classification, embeddings from 23 pretrained vision models (ResNet, ConvNextV2, ViTs) were evaluated via linear classifiers and KNN. Modern architectures like ConvNextV2 excelled, while larger models frequently overfit, highlighting inefficiencies in scaling. Key findings include: (1) Minimal, targeted augmentations (e.g., rotation) outperform complex strategies on simpler datasets; (2) Pretraining strategies (e.g., BEiT, DinoV2) significantly boost identity recognition; (3) Temporal continuity and natural motion patterns offer unique challenges absent in synthetic or domain-specific benchmarks. The dataset's controlled design and extended sequences (1 hour vs. prior 10-minute benchmarks) make it a pragmatic tool for stress-testing occlusion handling, temporal consistency, and efficiency. The link to the dataset is this https URL.

Title: Using 3D reconstruction from image motion to predict total leaf area in dwarf tomato plants

Authors: Dmitrii Usenko, David Helman, Chen Giladi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13778
Pdf URL: https://arxiv.org/pdf/2503.13778
Copy Paste: [[2503.13778]] Using 3D reconstruction from image motion to predict total leaf area in dwarf tomato plants(https://arxiv.org/abs/2503.13778)
Keywords: robust
Abstract: Accurate estimation of total leaf area (TLA) is crucial for evaluating plant growth, photosynthetic activity, and transpiration. However, it remains challenging for bushy plants like dwarf tomatoes due to their complex canopies. Traditional methods are often labor-intensive, damaging to plants, or limited in capturing canopy complexity. This study evaluated a non-destructive method combining sequential 3D reconstructions from RGB images and machine learning to estimate TLA for three dwarf tomato cultivars: Mohamed, Hahms Gelbe Topftomate, and Red Robin -- grown under controlled greenhouse conditions. Two experiments (spring-summer and autumn-winter) included 73 plants, yielding 418 TLA measurements via an "onion" approach. High-resolution videos were recorded, and 500 frames per plant were used for 3D reconstruction. Point clouds were processed using four algorithms (Alpha Shape, Marching Cubes, Poisson's, Ball Pivoting), and meshes were evaluated with seven regression models: Multivariable Linear Regression, Lasso Regression, Ridge Regression, Elastic Net Regression, Random Forest, Extreme Gradient Boosting, and Multilayer Perceptron. The Alpha Shape reconstruction ($\alpha = 3$) with Extreme Gradient Boosting achieved the best performance ($R^2 = 0.80$, $MAE = 489 cm^2$). Cross-experiment validation showed robust results ($R^2 = 0.56$, $MAE = 579 cm^2$). Feature importance analysis identified height, width, and surface area as key predictors. This scalable, automated TLA estimation method is suited for urban farming and precision agriculture, offering applications in automated pruning, resource efficiency, and sustainable food production. The approach demonstrated robustness across variable environmental conditions and canopy structures.

Title: Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Jing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13792
Pdf URL: https://arxiv.org/pdf/2503.13792
Copy Paste: [[2503.13792]] Identifying and Mitigating Position Bias of Multi-image Vision-Language Models(https://arxiv.org/abs/2503.13792)
Keywords: robust
Abstract: The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.

Title: LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Authors: Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Dimitris N. Metaxas
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13794
Pdf URL: https://arxiv.org/pdf/2503.13794
Copy Paste: [[2503.13794]] LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation(https://arxiv.org/abs/2503.13794)
Keywords: large language model
Abstract: Large foundation models trained on large-scale visual-text data can significantly enhance Open Vocabulary Object Detection (OVD) through data generation. However, this may lead to biased synthetic data and overfitting to specific configurations. It can sidestep biases of manually curated data generation by directly leveraging hidden states of Large Language Models (LLMs), which is surprisingly rarely explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of a MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge transfer from LLMs to object detectors, an new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We demonstrate that intermediate hidden states from early LLM layers retain strong spatial-semantic correlations that are beneficial to grounding tasks. Experiments show that our adaptation strategy significantly enhances the performance on complex free-form text queries while remaining the same on plain categories. With our adaptation, Qwen2-0.5B with Swin-T as the vision encoder improves GroundingDINO by 2.33% on Omnilabel, at the overhead of 8.7% more GFLOPs. Qwen2-0.5B with a larger vision encoder can further boost the performance by 6.22%. We further validate our design by ablating on varied adapter architectures, sizes of LLMs, and which layers to add adaptation.

Title: AI-Powered Prediction of Nanoparticle Pharmacokinetics: A Multi-View Learning Approach

Authors: Amirhossein Khakpour, Lucia Florescu, Richard Tilley, Haibo Jiang, K. Swaminathan Iyer, Gustavo Carneiro
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13798
Pdf URL: https://arxiv.org/pdf/2503.13798
Copy Paste: [[2503.13798]] AI-Powered Prediction of Nanoparticle Pharmacokinetics: A Multi-View Learning Approach(https://arxiv.org/abs/2503.13798)
Keywords: robust, interpretability
Abstract: The clinical translation of nanoparticle-based treatments remains limited due to the unpredictability of (nanoparticle) NP pharmacokinetics$\unicode{x2014}$how they distribute, accumulate, and clear from the body. Predicting these behaviours is challenging due to complex biological interactions and the difficulty of obtaining high-quality experimental datasets. Existing AI-driven approaches rely heavily on data-driven learning but fail to integrate crucial knowledge about NP properties and biodistribution mechanisms. We introduce a multi-view deep learning framework that enhances pharmacokinetic predictions by incorporating prior knowledge of key NP properties such as size and charge into a cross-attention mechanism, enabling context-aware feature selection and improving generalization despite small datasets. To further enhance prediction robustness, we employ an ensemble learning approach, combining deep learning with XGBoost (XGB) and Random Forest (RF), which significantly outperforms existing AI models. Our interpretability analysis reveals key physicochemical properties driving NP biodistribution, providing biologically meaningful insights into possible mechanisms governing NP behaviour in vivo rather than a black-box model. Furthermore, by bridging machine learning with physiologically based pharmacokinetic (PBPK) modelling, this work lays the foundation for data-efficient AI-driven drug discovery and precision nanomedicine.

Title: SMILE: a Scale-aware Multiple Instance Learning Method for Multicenter STAS Lung Cancer Histopathology Diagnosis

Authors: Liangrui Pan, Xiaoyu Li, Yutao Dou, Qiya Song, Jiadi Luo, Qingchun Liang, Shaoliang Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13799
Pdf URL: https://arxiv.org/pdf/2503.13799
Copy Paste: [[2503.13799]] SMILE: a Scale-aware Multiple Instance Learning Method for Multicenter STAS Lung Cancer Histopathology Diagnosis(https://arxiv.org/abs/2503.13799)
Keywords: interpretability
Abstract: Spread through air spaces (STAS) represents a newly identified aggressive pattern in lung cancer, which is known to be associated with adverse prognostic factors and complex pathological features. Pathologists currently rely on time consuming manual assessments, which are highly subjective and prone to variation. This highlights the urgent need for automated and precise diag nostic solutions. 2,970 lung cancer tissue slides are comprised from multiple centers, re-diagnosed them, and constructed and publicly released three lung cancer STAS datasets: STAS CSU (hospital), STAS TCGA, and STAS CPTAC. All STAS datasets provide corresponding pathological feature diagnoses and related clinical data. To address the bias, sparse and heterogeneous nature of STAS, we propose an scale-aware multiple instance learning(SMILE) method for STAS diagnosis of lung cancer. By introducing a scale-adaptive attention mechanism, the SMILE can adaptively adjust high attention instances, reducing over-reliance on local regions and promoting consistent detection of STAS lesions. Extensive experiments show that SMILE achieved competitive diagnostic results on STAS CSU, diagnosing 251 and 319 STAS samples in CPTAC andTCGA,respectively, surpassing clinical average AUC. The 11 open baseline results are the first to be established for STAS research, laying the foundation for the future expansion, interpretability, and clinical integration of computational pathology technologies. The datasets and code are available at this https URL.

Title: Text-Guided Image Invariant Feature Learning for Robust Image Watermarking

Authors: Muhammad Ahtesham, Xin Zhong
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.13805
Pdf URL: https://arxiv.org/pdf/2503.13805
Copy Paste: [[2503.13805]] Text-Guided Image Invariant Feature Learning for Robust Image Watermarking(https://arxiv.org/abs/2503.13805)
Keywords: robust, extraction, watermark
Abstract: Ensuring robustness in image watermarking is crucial for and maintaining content integrity under diverse transformations. Recent self-supervised learning (SSL) approaches, such as DINO, have been leveraged for watermarking but primarily focus on general feature representation rather than explicitly learning invariant features. In this work, we propose a novel text-guided invariant feature learning framework for robust image watermarking. Our approach leverages CLIP's multimodal capabilities, using text embeddings as stable semantic anchors to enforce feature invariance under distortions. We evaluate the proposed method across multiple datasets, demonstrating superior robustness against various image transformations. Compared to state-of-the-art SSL methods, our model achieves higher cosine similarity in feature consistency tests and outperforms existing watermarking schemes in extraction accuracy under severe distortions. These results highlight the efficacy of our method in learning invariant representations tailored for robust deep learning-based watermarking.

Title: Organ-aware Multi-scale Medical Image Segmentation Using Text Prompt Engineering

Authors: Wenjie Zhang, Ziyang Zhang, Mengnan He, Jiancheng Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13806
Pdf URL: https://arxiv.org/pdf/2503.13806
Copy Paste: [[2503.13806]] Organ-aware Multi-scale Medical Image Segmentation Using Text Prompt Engineering(https://arxiv.org/abs/2503.13806)
Keywords: segmentation
Abstract: Accurate segmentation is essential for effective treatment planning and disease monitoring. Existing medical image segmentation methods predominantly rely on uni-modal visual inputs, such as images or videos, requiring labor-intensive manual annotations. Additionally, medical imaging techniques capture multiple intertwined organs within a single scan, further complicating segmentation accuracy. To address these challenges, MedSAM, a large-scale medical segmentation model based on the Segment Anything Model (SAM), was developed to enhance segmentation accuracy by integrating image features with user-provided prompts. While MedSAM has demonstrated strong performance across various medical segmentation tasks, it primarily relies on geometric prompts (e.g., points and bounding boxes) and lacks support for text-based prompts, which could help specify subtle or ambiguous anatomical structures. To overcome these limitations, we propose the Organ-aware Multi-scale Text-guided Medical Image Segmentation Model (OMT-SAM) for multi-organ segmentation. Our approach introduces CLIP encoders as a novel image-text prompt encoder, operating with the geometric prompt encoder to provide informative contextual guidance. We pair descriptive textual prompts with corresponding images, processing them through pre-trained CLIP encoders and a cross-attention mechanism to generate fused image-text embeddings. Additionally, we extract multi-scale visual features from MedSAM, capturing fine-grained anatomical details at different levels of granularity. We evaluate OMT-SAM on the FLARE 2021 dataset, benchmarking its performance against existing segmentation methods. Empirical results demonstrate that OMT-SAM achieves a mean Dice Similarity Coefficient of 0.937, outperforming MedSAM (0.893) and other segmentation models, highlighting its superior capability in handling complex medical image segmentation tasks.

Title: FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification

Authors: Jinping Wang, Weiwei Song, Hao Chen, Jinchang Ren, Huimin Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13814
Pdf URL: https://arxiv.org/pdf/2503.13814
Copy Paste: [[2503.13814]] FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification(https://arxiv.org/abs/2503.13814)
Keywords: diffusion
Abstract: World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emph{i.e.}, hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at this https URL.

Title: MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Authors: Zhixuan Liu, Haokun Zhu, Rui Chen, Jonathan Francis, Soonmin Hwang, Ji Zhang, Jean Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13816
Pdf URL: https://arxiv.org/pdf/2503.13816
Copy Paste: [[2503.13816]] MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments(https://arxiv.org/abs/2503.13816)
Keywords: privacy, diffusion
Abstract: We introduce a novel diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a novel inference-time optimization that avoids error accumulation common in sequential or single-room constraint in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising processes when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Project page is available at: this https URL

Title: Scale-Aware Contrastive Reverse Distillation for Unsupervised Medical Anomaly Detection

Authors: Chunlei Li, Yilei Shi, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13828
Pdf URL: https://arxiv.org/pdf/2503.13828
Copy Paste: [[2503.13828]] Scale-Aware Contrastive Reverse Distillation for Unsupervised Medical Anomaly Detection(https://arxiv.org/abs/2503.13828)
Keywords: generative
Abstract: Unsupervised anomaly detection using deep learning has garnered significant research attention due to its broad applicability, particularly in medical imaging where labeled anomalous data are scarce. While earlier approaches leverage generative models like autoencoders and generative adversarial networks (GANs), they often fall short due to overgeneralization. Recent methods explore various strategies, including memory banks, normalizing flows, self-supervised learning, and knowledge distillation, to enhance discrimination. Among these, knowledge distillation, particularly reverse distillation, has shown promise. Following this paradigm, we propose a novel scale-aware contrastive reverse distillation model that addresses two key limitations of existing reverse distillation methods: insufficient feature discriminability and inability to handle anomaly scale variations. Specifically, we introduce a contrastive student-teacher learning approach to derive more discriminative representations by generating and exploring out-of-normal distributions. Further, we design a scale adaptation mechanism to softly weight contrastive distillation losses at different scales to account for the scale variation issue. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, validating the efficacy of the proposed method. Code is available at this https URL.

Title: SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

Authors: Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, Junyong Noh
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13836
Pdf URL: https://arxiv.org/pdf/2503.13836
Copy Paste: [[2503.13836]] SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing(https://arxiv.org/abs/2503.13836)
Keywords: diffusion
Abstract: Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.

Title: Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations

Authors: Rui Yang, Jiayi Tong, Haoyuan Wang, Hui Huang, Ziyang Hu, Peiyu Li, Nan Liu, Christopher J. Lindsell, Michael J. Pencina, Yong Chen, Chuan Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13857
Pdf URL: https://arxiv.org/pdf/2503.13857
Copy Paste: [[2503.13857]] Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations(https://arxiv.org/abs/2503.13857)
Keywords: extraction, large language model
Abstract: Background. Systematic reviews in comparative effectiveness research require timely evidence synthesis. Preprints accelerate knowledge dissemination but vary in quality, posing challenges for systematic reviews. Methods. We propose AutoConfidence (automated confidence assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time. Results. The random forest classifier achieved AUROC 0.692 with LLM-driven scores, improving to 0.733 with semantic embeddings and 0.747 with article usage metrics. The survival cure model reached AUROC 0.716 with LLM-driven scores, improving to 0.731 with semantic embeddings. For publication risk prediction, it achieved a concordance index of 0.658, increasing to 0.667 with semantic embeddings. Conclusion. Our study advances the framework for preprint publication prediction through automated data extraction and multiple feature integration. By combining semantic embeddings with LLM-driven evaluations, AudoConfidence enhances predictive performance while reducing manual annotation burden. The framework has the potential to facilitate systematic incorporation of preprint articles in evidence-based medicine, supporting researchers in more effective evaluation and utilization of preprint resources.

Title: Less is More: Improving Motion Diffusion Models with Sparse Keyframes

Authors: Jinseok Bae, Inwoo Hwang, Young Yoon Lee, Ziyu Guo, Joseph Liu, Yizhak Ben-Shabat, Young Min Kim, Mubbasir Kapadia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13859
Pdf URL: https://arxiv.org/pdf/2503.13859
Copy Paste: [[2503.13859]] Less is More: Improving Motion Diffusion Models with Sparse Keyframes(https://arxiv.org/abs/2503.13859)
Keywords: robust, diffusion, generative
Abstract: Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks. Source code and pre-trained models will be released upon acceptance.

Title: Robust3D-CIL: Robust Class-Incremental Learning for 3D Perception

Authors: Jinge Ma, Jiangpeng He, Fengqing Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13869
Pdf URL: https://arxiv.org/pdf/2503.13869
Copy Paste: [[2503.13869]] Robust3D-CIL: Robust Class-Incremental Learning for 3D Perception(https://arxiv.org/abs/2503.13869)
Keywords: robust
Abstract: 3D perception plays a crucial role in real-world applications such as autonomous driving, robotics, and AR/VR. In practical scenarios, 3D perception models must continuously adapt to new data and emerging object categories, but retraining from scratch incurs prohibitive costs. Therefore, adopting class-incremental learning (CIL) becomes particularly essential. However, real-world 3D point cloud data often include corrupted samples, which poses significant challenges for existing CIL methods and leads to more severe forgetting on corrupted data. To address these challenges, we consider the scenario in which a CIL model can be updated using point clouds with unknown corruption to better simulate real-world conditions. Inspired by Farthest Point Sampling, we propose a novel exemplar selection strategy that effectively preserves intra-class diversity when selecting replay exemplars, mitigating forgetting induced by data corruption. Furthermore, we introduce a point cloud downsampling-based replay method to utilize the limited replay buffer memory more efficiently, thereby further enhancing the model's continual learning ability. Extensive experiments demonstrate that our method improves the performance of replay-based CIL baselines by 2% to 11%, proving its effectiveness and promising potential for real-world 3D applications.

Title: Empirical Calibration and Metric Differential Privacy in Language Models

Authors: Pedro Faustini, Natasha Fernandes, Annabelle McIver, Mark Dras
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13872
Pdf URL: https://arxiv.org/pdf/2503.13872
Copy Paste: [[2503.13872]] Empirical Calibration and Metric Differential Privacy in Language Models(https://arxiv.org/abs/2503.13872)
Keywords: privacy, attack, membership infer
Abstract: NLP models trained with differential privacy (DP) usually adopt the DP-SGD framework, and privacy guarantees are often reported in terms of the privacy budget $\epsilon$. However, $\epsilon$ does not have any intrinsic meaning, and it is generally not possible to compare across variants of the framework. Work in image processing has therefore explored how to empirically calibrate noise across frameworks using Membership Inference Attacks (MIAs). However, this kind of calibration has not been established for NLP. In this paper, we show that MIAs offer little help in calibrating privacy, whereas reconstruction attacks are more useful. As a use case, we define a novel kind of directional privacy based on the von Mises-Fisher (VMF) distribution, a metric DP mechanism that perturbs angular distance rather than adding (isotropic) Gaussian noise, and apply this to NLP architectures. We show that, even though formal guarantees are incomparable, empirical privacy calibration reveals that each mechanism has different areas of strength with respect to utility-privacy trade-offs.

Title: Multi-label feature selection based on binary hashing learning and dynamic graph constraints

Authors: Cong Guo, Changqin Huang, Wenhua Zhou, Xiaodi Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13874
Pdf URL: https://arxiv.org/pdf/2503.13874
Copy Paste: [[2503.13874]] Multi-label feature selection based on binary hashing learning and dynamic graph constraints(https://arxiv.org/abs/2503.13874)
Keywords: robust
Abstract: Multi-label learning poses significant challenges in extracting reliable supervisory signals from the label space. Existing approaches often employ continuous pseudo-labels to replace binary labels, improving supervisory information representation. However, these methods can introduce noise from irrelevant labels and lead to unreliable graph structures. To overcome these limitations, this study introduces a novel multi-label feature selection method called Binary Hashing and Dynamic Graph Constraint (BHDG), the first method to integrate binary hashing into multi-label learning. BHDG utilizes low-dimensional binary hashing codes as pseudo-labels to reduce noise and improve representation robustness. A dynamically constrained sample projection space is constructed based on the graph structure of these binary pseudo-labels, enhancing the reliability of the dynamic graph. To further enhance pseudo-label quality, BHDG incorporates label graph constraints and inner product minimization within the sample space. Additionally, an $l_{2,1}$-norm regularization term is added to the objective function to facilitate the feature selection process. The augmented Lagrangian multiplier (ALM) method is employed to optimize binary variables effectively. Comprehensive experiments on 10 benchmark datasets demonstrate that BHDG outperforms ten state-of-the-art methods across six evaluation metrics. BHDG achieves the highest overall performance ranking, surpassing the next-best method by an average of at least 2.7 ranks per metric, underscoring its effectiveness and robustness in multi-label feature selection.

Title: MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation

Authors: Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, Dae-Shik Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13881
Pdf URL: https://arxiv.org/pdf/2503.13881
Copy Paste: [[2503.13881]] MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation(https://arxiv.org/abs/2503.13881)
Keywords: large language model, segmentation
Abstract: The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textit{turn on the TV"}, there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.

Title: MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

Authors: Zhengsheng Guo, Linwei Zheng, Xinyang Chen, Xuefeng Bai, Kehai Chen, Min Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13882
Pdf URL: https://arxiv.org/pdf/2503.13882
Copy Paste: [[2503.13882]] MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments(https://arxiv.org/abs/2503.13882)
Keywords: large language model
Abstract: While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.

Title: YOLO-LLTS: Real-Time Low-Light Traffic Sign Detection via Prior-Guided Enhancement and Multi-Branch Feature Interaction

Authors: Ziyu Lin, Yunfan Wu, Yuhang Ma, Junzhou Chen, Ronghui Zhang, Jiaming Wu, Guodong Yin, Liang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13883
Pdf URL: https://arxiv.org/pdf/2503.13883
Copy Paste: [[2503.13883]] YOLO-LLTS: Real-Time Low-Light Traffic Sign Detection via Prior-Guided Enhancement and Multi-Branch Feature Interaction(https://arxiv.org/abs/2503.13883)
Keywords: extraction
Abstract: Detecting traffic signs effectively under low-light conditions remains a significant challenge. To address this issue, we propose YOLO-LLTS, an end-to-end real-time traffic sign detection algorithm specifically designed for low-light environments. Firstly, we introduce the High-Resolution Feature Map for Small Object Detection (HRFM-TOD) module to address indistinct small-object features in low-light scenarios. By leveraging high-resolution feature maps, HRFM-TOD effectively mitigates the feature dilution problem encountered in conventional PANet frameworks, thereby enhancing both detection accuracy and inference speed. Secondly, we develop the Multi-branch Feature Interaction Attention (MFIA) module, which facilitates deep feature interaction across multiple receptive fields in both channel and spatial dimensions, significantly improving the model's information extraction capabilities. Finally, we propose the Prior-Guided Enhancement Module (PGFE) to tackle common image quality challenges in low-light environments, such as noise, low contrast, and blurriness. This module employs prior knowledge to enrich image details and enhance visibility, substantially boosting detection performance. To support this research, we construct a novel dataset, the Chinese Nighttime Traffic Sign Sample Set (CNTSSS), covering diverse nighttime scenarios, including urban, highway, and rural environments under varying weather conditions. Experimental evaluations demonstrate that YOLO-LLTS achieves state-of-the-art performance, outperforming the previous best methods by 2.7% mAP50 and 1.6% mAP50:95 on TT100K-night, 1.3% mAP50 and 1.9% mAP50:95 on CNTSSS, and achieving superior results on the CCTSDB2021 dataset. Moreover, deployment experiments on edge devices confirm the real-time applicability and effectiveness of our proposed approach.

Title: COMM:Concentrated Margin Maximization for Robust Document-Level Relation Extraction

Authors: Zhichao Duan, Tengyu Pan, Zhenyu Li, Xiuxing Li, Jianyong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13885
Pdf URL: https://arxiv.org/pdf/2503.13885
Copy Paste: [[2503.13885]] COMM:Concentrated Margin Maximization for Robust Document-Level Relation Extraction(https://arxiv.org/abs/2503.13885)
Keywords: robust, extraction
Abstract: Document-level relation extraction (DocRE) is the process of identifying and extracting relations between entities that span multiple sentences within a document. Due to its realistic settings, DocRE has garnered increasing research attention in recent years. Previous research has mostly focused on developing sophisticated encoding models to better capture the intricate patterns between entity pairs. While these advancements are undoubtedly crucial, an even more foundational challenge lies in the data itself. The complexity inherent in DocRE makes the labeling process prone to errors, compounded by the extreme sparsity of positive relation samples, which is driven by both the limited availability of positive instances and the broad diversity of positive relation types. These factors can lead to biased optimization processes, further complicating the task of accurate relation extraction. Recognizing these challenges, we have developed a robust framework called \textit{\textbf{COMM}} to better solve DocRE. \textit{\textbf{COMM}} operates by initially employing an instance-aware reasoning method to dynamically capture pertinent information of entity pairs within the document and extract relational features. Following this, \textit{\textbf{COMM}} takes into account the distribution of relations and the difficulty of samples to dynamically adjust the margins between prediction logits and the decision threshold, a process we call Concentrated Margin Maximization. In this way, \textit{\textbf{COMM}} not only enhances the extraction of relevant relational features but also boosts DocRE performance by addressing the specific challenges posed by the data. Extensive experiments and analysis demonstrate the versatility and effectiveness of \textit{\textbf{COMM}}, especially its robustness when trained on low-quality data (achieves \textgreater 10\% performance gains).

Title: Exploiting Inherent Class Label: Towards Robust Scribble Supervised Semantic Segmentation

Authors: Xinliang Zhang, Lei Zhu, Shuang Zeng, Hangzhou He, Ourui Fu, Zhengjian Yao, Zhaoheng Xie, Yanye Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13895
Pdf URL: https://arxiv.org/pdf/2503.13895
Copy Paste: [[2503.13895]] Exploiting Inherent Class Label: Towards Robust Scribble Supervised Semantic Segmentation(https://arxiv.org/abs/2503.13895)
Keywords: robust, segmentation
Abstract: Scribble-based weakly supervised semantic segmentation leverages only a few annotated pixels as labels to train a segmentation model, presenting significant potential for reducing the human labor involved in the annotation process. This approach faces two primary challenges: first, the sparsity of scribble annotations can lead to inconsistent predictions due to limited supervision; second, the variability in scribble annotations, reflecting differing human annotator preferences, can prevent the model from consistently capturing the discriminative regions of objects, potentially leading to unstable predictions. To address these issues, we propose a holistic framework, the class-driven scribble promotion network, for robust scribble-supervised semantic segmentation. This framework not only utilizes the provided scribble annotations but also leverages their associated class labels to generate reliable pseudo-labels. Within the network, we introduce a localization rectification module to mitigate noisy labels and a distance perception module to identify reliable regions surrounding scribble annotations and pseudo-labels. In addition, we introduce new large-scale benchmarks, ScribbleCOCO and ScribbleCityscapes, accompanied by a scribble simulation algorithm that enables evaluation across varying scribble styles. Our method demonstrates competitive performance in both accuracy and robustness, underscoring its superiority over existing approaches. The datasets and the codes will be made publicly available.

Title: TGBFormer: Transformer-GraphFormer Blender Network for Video Object Detection

Authors: Qiang Qi, Xiao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13903
Pdf URL: https://arxiv.org/pdf/2503.13903
Copy Paste: [[2503.13903]] TGBFormer: Transformer-GraphFormer Blender Network for Video Object Detection(https://arxiv.org/abs/2503.13903)
Keywords: transformer
Abstract: Video object detection has made significant progress in recent years thanks to convolutional neural networks (CNNs) and vision transformers (ViTs). Typically, CNNs excel at capturing local features but struggle to model global representations. Conversely, ViTs are adept at capturing long-range global features but face challenges in representing local feature details. Off-the-shelf video object detection methods solely rely on CNNs or ViTs to conduct feature aggregation, which hampers their capability to simultaneously leverage global and local information, thereby resulting in limited detection performance. In this paper, we propose a Transformer-GraphFormer Blender Network (TGBFormer) for video object detection, with three key technical improvements to fully exploit the advantages of transformers and graph convolutional networks while compensating for their limitations. First, we develop a spatial-temporal transformer module to aggregate global contextual information, constituting global representations with long-range feature dependencies. Second, we introduce a spatial-temporal GraphFormer module that utilizes local spatial and temporal relationships to aggregate features, generating new local representations that are complementary to the transformer outputs. Third, we design a global-local feature blender module to adaptively couple transformer-based global representations and GraphFormer-based local representations. Extensive experiments demonstrate that our TGBFormer establishes new state-of-the-art results on the ImageNet VID dataset. Particularly, our TGBFormer achieves 86.5% mAP while running at around 41.0 FPS on a single Tesla A100 GPU.

Title: Quantification of Uncertainties in Probabilistic Deep Neural Network by Implementing Boosting of Variational Inference

Authors: Pavia Bera, Sanjukta Bhanja
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.13909
Pdf URL: https://arxiv.org/pdf/2503.13909
Copy Paste: [[2503.13909]] Quantification of Uncertainties in Probabilistic Deep Neural Network by Implementing Boosting of Variational Inference(https://arxiv.org/abs/2503.13909)
Keywords: interpretability
Abstract: Modern neural network architectures have achieved remarkable accuracies but remain highly dependent on their training data, often lacking interpretability in their learned mappings. While effective on large datasets, they tend to overfit on smaller ones. Probabilistic neural networks, such as those utilizing variational inference, address this limitation by incorporating uncertainty estimation through weight distributions rather than point estimates. However, standard variational inference often relies on a single-density approximation, which can lead to poor posterior estimates and hinder model performance. We propose Boosted Bayesian Neural Networks (BBNN), a novel approach that enhances neural network weight distribution approximations using Boosting Variational Inference (BVI). By iteratively constructing a mixture of densities, BVI expands the approximating family, enabling a more expressive posterior that leads to improved generalization and uncertainty estimation. While this approach increases computational complexity, it significantly enhances accuracy an essential tradeoff, particularly in high-stakes applications such as medical diagnostics, where false negatives can have severe consequences. Our experimental results demonstrate that BBNN achieves ~5% higher accuracy compared to conventional neural networks while providing superior uncertainty quantification. This improvement highlights the effectiveness of leveraging a mixture-based variational family to better approximate the posterior distribution, ultimately advancing probabilistic deep learning.

Title: PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds

Authors: Barza Nisar, Steven L. Waslander
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13914
Pdf URL: https://arxiv.org/pdf/2503.13914
Copy Paste: [[2503.13914]] PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds(https://arxiv.org/abs/2503.13914)
Keywords: segmentation
Abstract: Self-supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry-sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. We propose PSA-SSL, a novel extension to point cloud SSL that learns object pose and size-aware (PSA) features. Our approach defines a self-supervised bounding box regression pretext task, which retains object pose and size information. Furthermore, we incorporate LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor-agnostic features. Our experiments demonstrate that with a single pretrained model, our light-weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels across popular autonomous driving datasets (Waymo, nuScenes, SemanticKITTI). Moreover, our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation (using up to 10 times less labels), as well as on 3D object detection. Our code will be released on this https URL.

Title: Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels

Authors: Yujia Tong, Yuze Wang, Jingling Yuan, Chuang Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13917
Pdf URL: https://arxiv.org/pdf/2503.13917
Copy Paste: [[2503.13917]] Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels(https://arxiv.org/abs/2503.13917)
Keywords: privacy, robust
Abstract: Model quantization enables efficient deployment of deep neural networks on edge devices through low-bit parameter representation, yet raises critical challenges for implementing machine unlearning (MU) under data privacy regulations. Existing MU methods designed for full-precision models fail to address two fundamental limitations in quantized networks: 1) Noise amplification from label mismatch during data processing, and 2) Gradient imbalance between forgotten and retained data during training. These issues are exacerbated by quantized models' constrained parameter space and discrete optimization. We propose Q-MUL, the first dedicated unlearning framework for quantized models. Our method introduces two key innovations: 1) Similar Labels assignment replaces random labels with semantically consistent alternatives to minimize noise injection, and 2) Adaptive Gradient Reweighting dynamically aligns parameter update contributions from forgotten and retained data. Through systematic analysis of quantized model vulnerabilities, we establish theoretical foundations for these mechanisms. Extensive evaluations on benchmark datasets demonstrate Q-MUL's superiority over existing approaches.

Title: ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models

Authors: Alexey Karev, Dong Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13923
Pdf URL: https://arxiv.org/pdf/2503.13923
Copy Paste: [[2503.13923]] ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models(https://arxiv.org/abs/2503.13923)
Keywords: generative, large language model
Abstract: Large language models (LLMs) have been one of the most important discoveries in machine learning in recent years. LLM-based artificial intelligence (AI) assistants, such as ChatGPT, have consistently attracted the attention from researchers, investors, and the general public, driving the rapid growth of this industry. With the frequent introduction of new LLMs to the market, it becomes increasingly difficult to differentiate between them, creating a demand for new LLM comparison methods. In this research, the Consistency-focused Similarity Comparison Framework (ConSCompF) for generative large language models is proposed. It compares texts generated by two LLMs and produces a similarity score, indicating the overall degree of similarity between their responses. The main advantage of this framework is that it can operate on a small number of unlabeled data, such as chatbot instruction prompts, and does not require LLM developers to disclose any information about their product. To evaluate the efficacy of ConSCompF, two experiments aimed at identifying similarities between multiple LLMs are conducted. Additionally, these experiments examine the correlation between the similarity scores generated by ConSCompF and the differences in the outputs produced by other benchmarking techniques, such as ROUGE-L. Finally, a series of few-shot LLM comparison experiments is conducted to evaluate the performance of ConSCompF in a few-shot LLM comparison scenario. The proposed framework can be used for calculating similarity matrices of multiple LLMs, which can be effectively visualized using principal component analysis (PCA). The ConSCompF output may provide useful insights into data that might have been used during LLM training and help detect possible investment fraud attempts.

Title: Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning

Authors: Da Kuang, Guanwen Qiu, Junhyong Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13925
Pdf URL: https://arxiv.org/pdf/2503.13925
Copy Paste: [[2503.13925]] Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning(https://arxiv.org/abs/2503.13925)
Keywords: transformer
Abstract: How a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells grow, divide, and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell lineage division and differentiation histories, providing an analytical framework for dissecting individual cells' molecular decisions during replication and differentiation. Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. In contrast, modern single-cell technologies can measure high-content molecular profiles (e.g., transcriptomes) in a wide range of biological systems. Here, we introduce CellTreeQM, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating lineage reconstruction as a tree-metric learning problem, we have systematically explored supervised, weakly supervised, and unsupervised training settings and present a Lineage Reconstruction Benchmark to facilitate comprehensive evaluation of our learning method. We benchmarked the method on (1) synthetic data modeled via Brownian motion with independent noise and spurious signals and (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show that CellTreeQM recovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships in challenging animal models. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage.

Title: Learning Shape-Independent Transformation via Spherical Representations for Category-Level Object Pose Estimation

Authors: Huan Ren, Wenfei Yang, Xiang Liu, Shifeng Zhang, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13926
Pdf URL: https://arxiv.org/pdf/2503.13926
Copy Paste: [[2503.13926]] Learning Shape-Independent Transformation via Spherical Representations for Category-Level Object Pose Estimation(https://arxiv.org/abs/2503.13926)
Keywords: robust, extraction
Abstract: Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from semantic incoherence across diverse object shapes. To resolve this issue, we innovatively leverage the sphere as a shared proxy shape of objects to learn shape-independent transformation via spherical representations. Based on this insight, we introduce a novel architecture called SpherePose, which yields precise correspondence prediction through three core designs. Firstly, We endow the point-wise feature extraction with SO(3)-invariance, which facilitates robust mapping between camera coordinate space and object coordinate space regardless of rotation transformation. Secondly, the spherical attention mechanism is designed to propagate and integrate features among spherical anchors from a comprehensive perspective, thus mitigating the interference of noise and incomplete point cloud. Lastly, a hyperbolic correspondence loss function is designed to distinguish subtle distinctions, which can promote the precision of correspondence prediction. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate the superior performance of our method, verifying the effectiveness of spherical representations and architectural innovations.

Title: Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Authors: Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Xiaofeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13939
Pdf URL: https://arxiv.org/pdf/2503.13939
Copy Paste: [[2503.13939]] Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models(https://arxiv.org/abs/2503.13939)
Keywords: robust
Abstract: Vision-language models (VLMs) have advanced reasoning in natural scenes, but their role in medical imaging remains underexplored. Medical reasoning tasks demand robust image analysis and well-justified answers, posing challenges due to the complexity of medical images. Transparency and trustworthiness are essential for clinical adoption and regulatory compliance. We introduce Med-R1, a framework exploring reinforcement learning (RL) to enhance VLMs' generalizability and trustworthiness in medical reasoning. Leveraging the DeepSeek strategy, we employ Group Relative Policy Optimization (GRPO) to guide reasoning paths via reward signals. Unlike supervised fine-tuning (SFT), which often overfits and lacks generalization, RL fosters robust and diverse reasoning. Med-R1 is evaluated across eight medical imaging modalities: CT, MRI, Ultrasound, Dermoscopy, Fundus Photography, Optical Coherence Tomography (OCT), Microscopy, and X-ray Imaging. Compared to its base model, Qwen2-VL-2B, Med-R1 achieves a 29.94% accuracy improvement and outperforms Qwen2-VL-72B, which has 36 times more parameters. Testing across five question types-modality recognition, anatomy identification, disease diagnosis, lesion grading, and biological attribute analysis Med-R1 demonstrates superior generalization, exceeding Qwen2-VL-2B by 32.06% and surpassing Qwen2-VL-72B in question-type generalization. These findings show that RL improves medical reasoning and enables parameter-efficient models to outperform significantly larger ones. With interpretable reasoning outputs, Med-R1 represents a promising step toward generalizable, trustworthy, and clinically viable medical VLMs.

Title: Multi-Modal Self-Supervised Semantic Communication

Authors: Hang Zhao, Hongru Li, Dongfang Xu, Shenghui Song, Khaled B. Letaief
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2503.13940
Pdf URL: https://arxiv.org/pdf/2503.13940
Copy Paste: [[2503.13940]] Multi-Modal Self-Supervised Semantic Communication(https://arxiv.org/abs/2503.13940)
Keywords: extraction
Abstract: Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.

Title: Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization

Authors: Long Tang, Dengpan Ye, Sirun Chen, Xiuwen Shi, Yunna Lv, Ziyi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13945
Pdf URL: https://arxiv.org/pdf/2503.13945
Copy Paste: [[2503.13945]] Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization(https://arxiv.org/abs/2503.13945)
Keywords: privacy, attack, diffusion
Abstract: The fine-tuning technique for text-to-image diffusion models facilitates image customization but risks privacy breaches and opinion manipulation. Current research focuses on prompt- or image-level adversarial attacks for anti-customization, yet it overlooks the correlation between these two levels and the relationship between internal modules and inputs. This hinders anti-customization performance in practical threat scenarios. We propose Dual Anti-Diffusion (DADiff), a two-stage adversarial attack targeting diffusion customization, which, for the first time, integrates the adversarial prompt-level attack into the generation process of image-level adversarial examples. In stage 1, we generate prompt-level adversarial vectors to guide the subsequent image-level attack. In stage 2, besides conducting the end-to-end attack on the UNet model, we disrupt its self- and cross-attention modules, aiming to break the correlations between image pixels and align the cross-attention results computed using instance prompts and adversarial prompt vectors within the images. Furthermore, we introduce a local random timestep gradient ensemble strategy, which updates adversarial perturbations by integrating random gradients from multiple segmented timesets. Experimental results on various mainstream facial datasets demonstrate 10%-30% improvements in cross-prompt, keyword mismatch, cross-model, and cross-mechanism anti-customization with DADiff compared to existing methods.

Title: FrustumFusionNets: A Three-Dimensional Object Detection Network Based on Tractor Road Scene

Authors: Lili Yang, Mengshuai Chang, Xiao Guo, Yuxin Feng, Yiwen Mei, Caicong Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13951
Pdf URL: https://arxiv.org/pdf/2503.13951
Copy Paste: [[2503.13951]] FrustumFusionNets: A Three-Dimensional Object Detection Network Based on Tractor Road Scene(https://arxiv.org/abs/2503.13951)
Keywords: extraction
Abstract: To address the issues of the existing frustum-based methods' underutilization of image information in road three-dimensional object detection as well as the lack of research on agricultural scenes, we constructed an object detection dataset using an 80-line Light Detection And Ranging (LiDAR) and a camera in a complex tractor road scene and proposed a new network called FrustumFusionNets (FFNets). Initially, we utilize the results of image-based two-dimensional object detection to narrow down the search region in the three-dimensional space of the point cloud. Next, we introduce a Gaussian mask to enhance the point cloud information. Then, we extract the features from the frustum point cloud and the crop image using the point cloud feature extraction pipeline and the image feature extraction pipeline, respectively. Finally, we concatenate and fuse the data features from both modalities to achieve three-dimensional object detection. Experiments demonstrate that on the constructed test set of tractor road data, the FrustumFusionNetv2 achieves 82.28% and 95.68% accuracy in the three-dimensional object detection of the two main road objects, cars and people, respectively. This performance is 1.83% and 2.33% better than the original model. It offers a hybrid fusion-based multi-object, high-precision, real-time three-dimensional object detection technique for unmanned agricultural machines in tractor road scenarios. On the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) Benchmark Suite validation set, the FrustumFusionNetv2 also demonstrates significant superiority in detecting road pedestrian objects compared with other frustum-based three-dimensional object detection methods.

Title: SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model

Authors: Xinqing Li, Ruiqi Song, Qingyu Xie, Ye Wu, Nanxin Zeng, Yunfeng Ai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13952
Pdf URL: https://arxiv.org/pdf/2503.13952
Copy Paste: [[2503.13952]] SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model(https://arxiv.org/abs/2503.13952)
Keywords: robust, generative
Abstract: With the rapid advancement of autonomous driving technology, a lack of data has become a major obstacle to enhancing perception model accuracy. Researchers are now exploring controllable data generation using world models to diversify datasets. However, previous work has been limited to studying image generation quality on specific public datasets. There is still relatively little research on how to build data generation engines for real-world application scenes to achieve large-scale data generation for challenging scenes. In this paper, a simulator-conditioned scene generation engine based on world model is proposed. By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected. It is a novel data generation pipeline by combining the powerful scene simulation capabilities of the simulation engine with the robust data generation capabilities of the world model. In addition, a benchmark with proportionally constructed virtual and real data, is provided for exploring the capabilities of world models in real-world scenes. Quantitative results show that these generated images significantly improve downstream perception models performance. Finally, we explored the generative performance of the world model in urban autonomous driving scenarios. All the data and code will be available at this https URL.

Title: Improving LLM Video Understanding with 16 Frames Per Second

Authors: Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13956
Pdf URL: https://arxiv.org/pdf/2503.13956
Copy Paste: [[2503.13956]] Improving LLM Video Understanding with 16 Frames Per Second(https://arxiv.org/abs/2503.13956)
Keywords: large language model
Abstract: Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (\textit{e.g.}, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. Upon acceptance, we will release the source code, model checkpoints, and data.

Title: DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation

Authors: Mu Chen, Liulei Li, Wenguan Wang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13957
Pdf URL: https://arxiv.org/pdf/2503.13957
Copy Paste: [[2503.13957]] DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation(https://arxiv.org/abs/2503.13957)
Keywords: diffusion
Abstract: Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline. Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DIFFVSGG, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects. This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames. Extensive experiments on three setups of Action Genome demonstrate the superiority of DIFFVSGG.

Title: Survey of Adversarial Robustness in Multimodal Large Language Models

Authors: Chengze Jiang, Zhuangzhuang Wang, Minjing Dong, Jie Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13962
Pdf URL: https://arxiv.org/pdf/2503.13962
Copy Paste: [[2503.13962]] Survey of Adversarial Robustness in Multimodal Large Language Models(https://arxiv.org/abs/2503.13962)
Keywords: attack, robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance in artificial intelligence by facilitating integrated understanding across diverse modalities, including text, images, video, audio, and speech. However, their deployment in real-world applications raises significant concerns about adversarial vulnerabilities that could compromise their safety and reliability. Unlike unimodal models, MLLMs face unique challenges due to the interdependencies among modalities, making them susceptible to modality-specific threats and cross-modal adversarial manipulations. This paper reviews the adversarial robustness of MLLMs, covering different modalities. We begin with an overview of MLLMs and a taxonomy of adversarial attacks tailored to each modality. Next, we review key datasets and evaluation metrics used to assess the robustness of MLLMs. After that, we provide an in-depth review of attacks targeting MLLMs across different modalities. Our survey also identifies critical challenges and suggests promising future research directions.

Title: MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Authors: Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13964
Pdf URL: https://arxiv.org/pdf/2503.13964
Copy Paste: [[2503.13964]] MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding(https://arxiv.org/abs/2503.13964)
Keywords: robust, large language model
Abstract: Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at this https URL.

Title: FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks

Authors: Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Longteng Guo, Zhihua Wei, Jing Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.13966
Pdf URL: https://arxiv.org/pdf/2503.13966
Copy Paste: [[2503.13966]] FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks(https://arxiv.org/abs/2503.13966)
Keywords: robust, large language model
Abstract: The aspiration of the Vision-and-Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset-specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi-model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.

Title: SoccerSynth Field: enhancing field detection with synthetic data from virtual soccer simulator

Authors: HaoBin Qin, Jiale Fang, Keisuke Fujii
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13969
Pdf URL: https://arxiv.org/pdf/2503.13969
Copy Paste: [[2503.13969]] SoccerSynth Field: enhancing field detection with synthetic data from virtual soccer simulator(https://arxiv.org/abs/2503.13969)
Keywords: robust
Abstract: Field detection in team sports is an essential task in sports video analysis. However, collecting large-scale and diverse real-world datasets for training detection models is often cost and time-consuming. Synthetic datasets, which allow controlled variability in lighting, textures, and camera angles, will be a promising alternative for addressing these problems. This study addresses the challenges of high costs and difficulties in collecting real-world datasets by investigating the effectiveness of pretraining models using synthetic datasets. In this paper, we propose the effectiveness of using a synthetic dataset (SoccerSynth-Field) for soccer field detection. A synthetic soccer field dataset was created to pretrain models, and the performance of these models was compared with models trained on real-world datasets. The results demonstrate that models pretrained on the synthetic dataset exhibit superior performance in detecting soccer fields. This highlights the effectiveness of synthetic data in enhancing model robustness and accuracy, offering a cost-effective and scalable solution for advancing detection tasks in sports field detection.

Title: Empowering LLMs in Decision Games through Algorithmic Data Synthesis

Authors: Haolin Wang, Xueyan Li, Yazhe Niu, Shuai Hu, Hongsheng Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13980
Pdf URL: https://arxiv.org/pdf/2503.13980
Copy Paste: [[2503.13980]] Empowering LLMs in Decision Games through Algorithmic Data Synthesis(https://arxiv.org/abs/2503.13980)
Keywords: large language model
Abstract: Large Language Models (LLMs) have exhibited impressive capabilities across numerous domains, yet they often struggle with complex reasoning and decision-making tasks. Decision-making games, which inherently require multifaceted reasoning logic, serve as ideal sandboxes for evaluating and enhancing the reasoning abilities of LLMs. In this work, we first explore whether LLMs can master complex decision-making games through targeted post-training. To this end, we design data synthesis strategies and curate extensive offline datasets from two classic games, Doudizhu and Go. We further develop a suite of techniques to effectively incorporate this data into LLM training, resulting in two novel agents: Mastermind-Dou and Mastermind-Go. Our experimental results demonstrate that these Mastermind LLMs achieve competitive performance in their respective games. Additionally, we explore whether integrating decision-making data can enhance the general reasoning abilities of LLMs. Our findings suggest that such post-training improves certain aspects of reasoning, providing valuable insights for optimizing LLM data collection and synthesis strategies.

Title: A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios

Authors: Huy-Hoang Bui, Bach-Thuan Bui, Quang-Vinh Tran, Yasuyuki Fujii, Joo-Ho Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13982
Pdf URL: https://arxiv.org/pdf/2503.13982
Copy Paste: [[2503.13982]] A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios(https://arxiv.org/abs/2503.13982)
Keywords: transformer
Abstract: Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: this https URL.

Title: SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Authors: Jiankang Wang, Zhihan zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13983
Pdf URL: https://arxiv.org/pdf/2503.13983
Copy Paste: [[2503.13983]] SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability(https://arxiv.org/abs/2503.13983)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets and model will be released.

Title: DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection

Authors: Jaewoo Song, Daemin Park, Kanghyun Baek, Sangyub Lee, Jooyoung Choi, Eunji Kim, Sungroh Yoon
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13985
Pdf URL: https://arxiv.org/pdf/2503.13985
Copy Paste: [[2503.13985]] DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection(https://arxiv.org/abs/2503.13985)
Keywords: diffusion
Abstract: Developing effective visual inspection models remains challenging due to the scarcity of defect data. While image generation models have been used to synthesize defect images, producing highly realistic defects remains difficult. We propose DefectFill, a novel method for realistic defect generation that requires only a few reference defect images. It leverages a fine-tuned inpainting diffusion model, optimized with our custom loss functions incorporating defect, object, and attention terms. It enables precise capture of detailed, localized defect features and their seamless integration into defect-free objects. Additionally, our Low-Fidelity Selection method further enhances the defect sample quality. Experiments show that DefectFill generates high-quality defect images, enabling visual inspection models to achieve state-of-the-art performance on the MVTec AD dataset.

Title: Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks

Authors: Mykyta Syromiatnikov, Victoria Ruvinskaya, Nataliia Komleva
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13988
Pdf URL: https://arxiv.org/pdf/2503.13988
Copy Paste: [[2503.13988]] Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks(https://arxiv.org/abs/2503.13988)
Keywords: robust, interpretability, large language model
Abstract: Leading large language models have demonstrated impressive capabilities in reasoning-intensive tasks, such as standardized educational testing. However, they often require extensive training in low-resource settings with inaccessible infrastructure. Small or compact models, though more efficient, frequently lack sufficient support for underrepresented languages, leaving a performance gap in critical domains. This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks in the underrepresented Ukrainian language, building on the findings of the ZNO-Eval benchmark. Parameter-efficient fine-tuning of LLaMA 3.1 (8 billion parameters), LLaMA 3.2 (3 billion parameters), and Gemma 2 (9 billion parameters) models on chain-of-thought solutions resulted in a modest test score improvement of up to 17.4% on complex matching tasks and 1.6% overall compared to tuning on answer letters alone, offering enhanced interpretability and robustness. In addition, the proposed tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks and provides a 5.4% gain over the best LLaMA 3.2 model due to guiding the model to recall and apply domain-relevant information. Contrasting obtained results with zero-shot evaluations of leading open-weight and proprietary models such as Qwen, DeepSeek R1, OpenAI o1 and o3, Gemini, and Claude, highlight that fine-tuning LLaMA and Gemma models with 2,032 step-by-step solutions and 20 to 50 million trainable parameters on a single A100 GPU lets them outperform GPT-4o mini, Mistral Large, and larger open-weight models. This research also evaluates how merging the quantized adapter with the base model influences the generation quality. Source code and tuned models are available at this https URL.

Title: TarPro: Targeted Protection against Malicious Image Editing

Authors: Kaixin Shen, Ruijie Quan, Jiaxu Miao, Jun Xiao, Yi Yang
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2503.13994
Pdf URL: https://arxiv.org/pdf/2503.13994
Copy Paste: [[2503.13994]] TarPro: Targeted Protection against Malicious Image Editing(https://arxiv.org/abs/2503.13994)
Keywords: secure, protect, robust
Abstract: The rapid advancement of image editing techniques has raised concerns about their misuse for generating Not-Safe-for-Work (NSFW) content. This necessitates a targeted protection mechanism that blocks malicious edits while preserving normal editability. However, existing protection methods fail to achieve this balance, as they indiscriminately disrupt all edits while still allowing some harmful content to be generated. To address this, we propose TarPro, a targeted protection framework that prevents malicious edits while maintaining benign modifications. TarPro achieves this through a semantic-aware constraint that only disrupts malicious content and a lightweight perturbation generator that produces a more stable, imperceptible, and robust perturbation for image protection. Extensive experiments demonstrate that TarPro surpasses existing methods, achieving a high protection efficacy while ensuring minimal impact on normal edits. Our results highlight TarPro as a practical solution for secure and controlled image editing.

Title: Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight

Authors: Yi Xiao, Qiannan Han, Guiping Liang, Hongyan Zhang, Song Wang, Zhihao Xu, Weican Wan, Chuang Li, Guitao Jiang, Wenbo Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14001
Pdf URL: https://arxiv.org/pdf/2503.14001
Copy Paste: [[2503.14001]] Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight(https://arxiv.org/abs/2503.14001)
Keywords: robust, transformer
Abstract: Accurate body dimension and weight measurements are critical for optimizing poultry management, health assessment, and economic efficiency. This study introduces an innovative deep learning-based model leveraging multimodal data-2D RGB images from different views, depth images, and 3D point clouds-for the non-invasive estimation of duck body dimensions and weight. A dataset of 1,023 Linwu ducks, comprising over 5,000 samples with diverse postures and conditions, was collected to support model training. The proposed method innovatively employs PointNet++ to extract key feature points from point clouds, extracts and computes corresponding 3D geometric features, and fuses them with multi-view convolutional 2D features. A Transformer encoder is then utilized to capture long-range dependencies and refine feature interactions, thereby enhancing prediction robustness. The model achieved a mean absolute percentage error (MAPE) of 6.33% and an R2 of 0.953 across eight morphometric parameters, demonstrating strong predictive capability. Unlike conventional manual measurements, the proposed model enables high-precision estimation while eliminating the necessity for physical handling, thereby reducing animal stress and broadening its application scope. This study marks the first application of deep learning techniques to poultry body dimension and weight estimation, providing a valuable reference for the intelligent and precise management of the livestock industry with far-reaching practical significance.

Title: MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling

Authors: Damian Boborzi, Phillip Mueller, Jonas Emrich, Dominik Schmid, Sebastian Mueller, Lars Mikelsons
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14002
Pdf URL: https://arxiv.org/pdf/2503.14002
Copy Paste: [[2503.14002]] MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling(https://arxiv.org/abs/2503.14002)
Keywords: generative
Abstract: Generative models have recently made remarkable progress in the field of 3D objects. However, their practical application in fields like engineering remains limited since they fail to deliver the accuracy, quality, and controllability needed for domain-specific tasks. Fine-tuning large generative models is a promising perspective for making these models available in these fields. Creating high-quality, domain-specific 3D datasets is crucial for fine-tuning large generative models, yet the data filtering and annotation process remains a significant bottleneck. We present MeshFleet, a filtered and annotated 3D vehicle dataset extracted from Objaverse-XL, the most extensive publicly available collection of 3D objects. Our approach proposes a pipeline for automated data filtering based on a quality classifier. This classifier is trained on a manually labeled subset of Objaverse, incorporating DINOv2 and SigLIP embeddings, refined through caption-based analysis and uncertainty estimation. We demonstrate the efficacy of our filtering method through a comparative analysis against caption and image aesthetic score-based techniques and fine-tuning experiments with SV3D, highlighting the importance of targeted data selection for domain-specific 3D generative modeling.

Title: Predicting Human Choice Between Textually Described Lotteries

Authors: Eyal Marantz, Ori Plonsky
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14004
Pdf URL: https://arxiv.org/pdf/2503.14004
Copy Paste: [[2503.14004]] Predicting Human Choice Between Textually Described Lotteries(https://arxiv.org/abs/2503.14004)
Keywords: large language model
Abstract: Predicting human decision-making under risk and uncertainty is a long-standing challenge in cognitive science, economics, and AI. While prior research has focused on numerically described lotteries, real-world decisions often rely on textual descriptions. This study conducts the first large-scale exploration of human decision-making in such tasks using a large dataset of one-shot binary choices between textually described lotteries. We evaluate multiple computational approaches, including fine-tuning Large Language Models (LLMs), leveraging embeddings, and integrating behavioral theories of choice under risk. Our results show that fine-tuned LLMs, specifically RoBERTa and GPT-4o outperform hybrid models that incorporate behavioral theory, challenging established methods in numerical settings. These findings highlight fundamental differences in how textual and numerical information influence decision-making and underscore the need for new modeling strategies to bridge this gap.

Title: Securing Automated Insulin Delivery Systems: A Review of Security Threats and Protectives Strategies

Authors: Yuchen Niu, Siew-Kei Lam
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.14006
Pdf URL: https://arxiv.org/pdf/2503.14006
Copy Paste: [[2503.14006]] Securing Automated Insulin Delivery Systems: A Review of Security Threats and Protectives Strategies(https://arxiv.org/abs/2503.14006)
Keywords: secure, security, protect, attack
Abstract: Automated insulin delivery (AID) systems have emerged as a significant technological advancement in diabetes care. These systems integrate a continuous glucose monitor, an insulin pump, and control algorithms to automate insulin delivery, reducing the burden of self-management and offering enhanced glucose control. However, the increasing reliance on wireless connectivity and software control has exposed AID systems to critical security risks that could result in life-threatening treatment errors. This review first presents a comprehensive examination of the security landscape, covering technical vulnerabilities, legal frameworks, and commercial product considerations, and an analysis of existing research on attack vectors, defence mechanisms, as well as evaluation methods and resources for AID systems. Despite recent advancements, several open challenges remain in achieving secure AID systems, particularly in standardising security evaluation frameworks and developing comprehensive, lightweight, and adaptive defence strategies. As one of the most widely adopted and extensively studied physiologic closed-loop control systems, this review serves as a valuable reference for understanding security challenges and solutions applicable to analogous medical systems.

Title: LEGNet: Lightweight Edge-Gaussian Driven Network for Low-Quality Remote Sensing Image Object Detection

Authors: Wei Lu, Si-Bao Chen, Hui-Dong Li, Qing-Ling Shu, Chris H. Q. Ding, Jin Tang, Bin Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14012
Pdf URL: https://arxiv.org/pdf/2503.14012
Copy Paste: [[2503.14012]] LEGNet: Lightweight Edge-Gaussian Driven Network for Low-Quality Remote Sensing Image Object Detection(https://arxiv.org/abs/2503.14012)
Keywords: robust, fair
Abstract: Remote sensing object detection (RSOD) faces formidable challenges in complex visual environments. Aerial and satellite images inherently suffer from limitations such as low spatial resolution, sensor noise, blurred objects, low-light degradation, and partial occlusions. These degradation factors collectively compromise the feature discriminability in detection models, resulting in three key issues: (1) reduced contrast that hampers foreground-background separation, (2) structural discontinuities in edge representations, and (3) ambiguous feature responses caused by variations in illumination. These collectively weaken model robustness and deployment feasibility. To address these challenges, we propose LEGNet, a lightweight network that incorporates a novel edge-Gaussian aggregation (EGA) module specifically designed for low-quality remote sensing images. Our key innovation lies in the synergistic integration of Scharr operator-based edge priors with uncertainty-aware Gaussian modeling: (a) The orientation-aware Scharr filters preserve high-frequency edge details with rotational invariance; (b) The uncertainty-aware Gaussian layers probabilistically refine low-confidence features through variance estimation. This design enables precision enhancement while maintaining architectural simplicity. Comprehensive evaluations across four RSOD benchmarks (DOTA-v1.0, v1.5, DIOR-R, FAIR1M-v1.0) and a UAV-view dataset (VisDrone2019) demonstrate significant improvements. LEGNet achieves state-of-the-art performance across five benchmark datasets while ensuring computational efficiency, making it well-suited for deployment on resource-constrained edge devices in real-world remote sensing applications. The code is available at this https URL.

Title: Boosting Semi-Supervised Medical Image Segmentation via Masked Image Consistency and Discrepancy Learning

Authors: Pengcheng Zhou, Lantian Zhang, Wei Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14013
Pdf URL: https://arxiv.org/pdf/2503.14013
Copy Paste: [[2503.14013]] Boosting Semi-Supervised Medical Image Segmentation via Masked Image Consistency and Discrepancy Learning(https://arxiv.org/abs/2503.14013)
Keywords: robust, segmentation
Abstract: Semi-supervised learning is of great significance in medical image segmentation by exploiting unlabeled data. Among its strategies, the co-training framework is prominent. However, previous co-training studies predominantly concentrate on network initialization variances and pseudo-label generation, while overlooking the equilibrium between information interchange and model diversity preservation. In this paper, we propose the Masked Image Consistency and Discrepancy Learning (MICD) framework with three key modules. The Masked Cross Pseudo Consistency (MCPC) module enriches context perception and small sample learning via pseudo-labeling across masked-input branches. The Cross Feature Consistency (CFC) module fortifies information exchange and model robustness by ensuring decoder feature consistency. The Cross Model Discrepancy (CMD) module utilizes EMA teacher networks to oversee outputs and preserve branch diversity. Together, these modules address existing limitations by focusing on fine-grained local information and maintaining diversity in a heterogeneous framework. Experiments on two public medical image datasets, AMOS and Synapse, demonstrate that our approach outperforms state-of-the-art methods.

Title: MP-GUI: Modality Perception with MLLMs for GUI Understanding

Authors: Ziwei Wang, Weizhi Chen, Leyang Yang, Sheng Zhou, Shengchu Zhao, Hanbei Zhan, Jiongchao Jin, Liangcheng Li, Zirui Shao, Jiajun Bu
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.14021
Pdf URL: https://arxiv.org/pdf/2503.14021
Copy Paste: [[2503.14021]] MP-GUI: Modality Perception with MLLMs for GUI Understanding(https://arxiv.org/abs/2503.14021)
Keywords: privacy, large language model
Abstract: Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current multi-modal large language models (MLLMs) already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To address these challenges, we present MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modalities from the screen as GUI-tailored visual clues, with spatial structure refinement strategy and adaptively combined via a fusion gate to meet the specific preferences of different GUI understanding tasks. To cope with the scarcity of training data, we also introduce a pipeline for automatically data collecting. Extensive experiments demonstrate that MP-GUI achieves impressive results on various GUI understanding tasks with limited data.

Title: Synthetic Data Generation Using Large Language Models: Advances in Text and Code

Authors: Mihai Nadas, Laura Diosan, Andreea Tomescu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14023
Pdf URL: https://arxiv.org/pdf/2503.14023
Copy Paste: [[2503.14023]] Synthetic Data Generation Using Large Language Models: Advances in Text and Code(https://arxiv.org/abs/2503.14023)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. By producing artificial but task-relevant examples, these models can significantly augment or even replace real-world datasets, especially when labeled data is scarce or sensitive. This paper surveys recent advances in using LLMs to create synthetic text and code, emphasizing prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We show how these methods enrich low-resource tasks such as classification and question answering, as well as code-centric applications such as instruction tuning, code translation, and bug repair, by enabling automated verification of functional correctness. Alongside potential benefits like cost-effectiveness, broad coverage, and controllable diversity, we address challenges such as factual inaccuracies in generated text, lack of stylistic realism, and the risk of bias amplification. Proposed mitigations include filtering and weighting outputs and reinforcement learning with execution feedback for code. We conclude with open research directions like automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, highlighting the importance of LLM-generated synthetic data in advancing AI while emphasizing ethical and quality safeguards.

Title: Uncertainty-Aware Global-View Reconstruction for Multi-View Multi-Label Feature Selection

Authors: Pingting Hao, Kunpeng Liu, Wanfu Gao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.14024
Pdf URL: https://arxiv.org/pdf/2503.14024
Copy Paste: [[2503.14024]] Uncertainty-Aware Global-View Reconstruction for Multi-View Multi-Label Feature Selection(https://arxiv.org/abs/2503.14024)
Keywords: segmentation
Abstract: In recent years, multi-view multi-label learning (MVML) has gained popularity due to its close resemblance to real-world scenarios. However, the challenge of selecting informative features to ensure both performance and efficiency remains a significant question in MVML. Existing methods often extract information separately from the consistency part and the complementary part, which may result in noise due to unclear segmentation. In this paper, we propose a unified model constructed from the perspective of global-view reconstruction. Additionally, while feature selection methods can discern the importance of features, they typically overlook the uncertainty of samples, which is prevalent in realistic scenarios. To address this, we incorporate the perception of sample uncertainty during the reconstruction process to enhance trustworthiness. Thus, the global-view is reconstructed through the graph structure between samples, sample confidence, and the view relationship. The accurate mapping is established between the reconstructed view and the label matrix. Experimental results demonstrate the superior performance of our method on multi-view datasets.

Title: Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting

Authors: Runsong Zhu, Shi Qiu, Zhengzhe Liu, Ka-Hei Hui, Qianyi Wu, Pheng-Ann Heng, Chi-Wing Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14029
Pdf URL: https://arxiv.org/pdf/2503.14029
Copy Paste: [[2503.14029]] Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting(https://arxiv.org/abs/2503.14029)
Keywords: robust, segmentation
Abstract: Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end-to-end lifting, yielding inferior results; or employ a two-stage solution constrained by complex pre- or post-processing. In this work, we design a new end-to-end object-aware lifting approach, named Unified-Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object-level codebook to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. While promising, achieving effective codebook learning is non-trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified-Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \href{this https URL}{this https URL}.

Title: A Revisit to the Decoder for Camouflaged Object Detection

Authors: Seung Woo Ko, Joopyo Hong, Suyoung Kim, Seungjai Bang, Sungzoon Cho, Nojun Kwak, Hyung-Sin Kim, Joonseok Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14035
Pdf URL: https://arxiv.org/pdf/2503.14035
Copy Paste: [[2503.14035]] A Revisit to the Decoder for Camouflaged Object Detection(https://arxiv.org/abs/2503.14035)
Keywords: segmentation
Abstract: Camouflaged object detection (COD) aims to generate a fine-grained segmentation map of camouflaged objects hidden in their background. Due to the hidden nature of camouflaged objects, it is essential for the decoder to be tailored to effectively extract proper features of camouflaged objects and extra-carefully generate their complex boundaries. In this paper, we propose a novel architecture that augments the prevalent decoding strategy in COD with Enrich Decoder and Retouch Decoder, which help to generate a fine-grained segmentation map. Specifically, the Enrich Decoder amplifies the channels of features that are important for COD using channel-wise attention. Retouch Decoder further refines the segmentation maps by spatially attending to important pixels, such as the boundary regions. With extensive experiments, we demonstrate that ENTO shows superior performance using various encoders, with the two novel components playing their unique roles that are mutually complementary.

Title: Intra and Inter Parser-Prompted Transformers for Effective Image Restoration

Authors: Cong Wang, Jinshan Pan, Liyan Wang, Wei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14037
Pdf URL: https://arxiv.org/pdf/2503.14037
Copy Paste: [[2503.14037]] Intra and Inter Parser-Prompted Transformers for Effective Image Restoration(https://arxiv.org/abs/2503.14037)
Keywords: transformer
Abstract: We propose Intra and Inter Parser-Prompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost restoration. To enhance the integration of the parser within IRNet, we propose Intra Parser-Prompted Attention (IntraPPA) and Inter Parser-Prompted Attention (InterPPA) to implicitly and explicitly learn useful parser features to facilitate restoration. The IntraPPA re-considers cross attention between parser and restoration features, enabling implicit perception of the parser from a long-range and intra-layer perspective. Conversely, the InterPPA initially fuses restoration features with those of the parser, followed by formulating these fused features within an attention mechanism to explicitly perceive parser information. Further, we propose a parser-prompted feed-forward network to guide restoration within pixel-wise gating modulation. Experimental results show that PPTformer achieves state-of-the-art performance on image deraining, defocus deblurring, desnowing, and low-light enhancement.

Title: Learning on LLM Output Signatures for gray-box LLM Behavior Analysis

Authors: Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, Haggai Maron
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14043
Pdf URL: https://arxiv.org/pdf/2503.14043
Copy Paste: [[2503.14043]] Learning on LLM Output Signatures for gray-box LLM Behavior Analysis(https://arxiv.org/abs/2503.14043)
Keywords: transformer, large language model
Abstract: Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited, particularly in detecting data contamination and hallucinations. While recently proposed probing techniques provide insights through activation analysis, they require "white-box" access to model internals, often unavailable. Current "gray-box" approaches typically analyze only the probability of the actual tokens in the sequence with simple task-specific heuristics. Importantly, these methods overlook the rich information contained in the full token distribution at each processing step. To address these limitations, we propose that gray-box analysis should leverage the complete observable output of LLMs, consisting of both the previously used token probabilities as well as the complete token distribution sequences - a unified data type we term LOS (LLM Output Signature). To this end, we develop a transformer-based approach to process LOS that theoretically guarantees approximation of existing techniques while enabling more nuanced analysis. Our approach achieves superior performance on hallucination and data contamination detection in gray-box settings, significantly outperforming existing baselines. Furthermore, it demonstrates strong transfer capabilities across datasets and LLMs, suggesting that LOS captures fundamental patterns in LLM behavior. Our code is available at: this https URL.

Title: ON-Traffic: An Operator Learning Framework for Online Traffic Flow Estimation and Uncertainty Quantification from Lagrangian Sensors

Authors: Jake Rap, Amritam Das
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2503.14053
Pdf URL: https://arxiv.org/pdf/2503.14053
Copy Paste: [[2503.14053]] ON-Traffic: An Operator Learning Framework for Online Traffic Flow Estimation and Uncertainty Quantification from Lagrangian Sensors(https://arxiv.org/abs/2503.14053)
Keywords: robust
Abstract: Accurate traffic flow estimation and prediction are critical for the efficient management of transportation systems, particularly under increasing urbanization. Traditional methods relying on static sensors often suffer from limited spatial coverage, while probe vehicles provide richer, albeit sparse and irregular data. This work introduces ON-Traffic, a novel deep operator Network and a receding horizon learning-based framework tailored for online estimation of spatio-temporal traffic state along with quantified uncertainty by using measurements from moving probe vehicles and downstream boundary inputs. Our framework is evaluated in both numerical and simulation datasets, showcasing its ability to handle irregular, sparse input data, adapt to time-shifted scenarios, and provide well-calibrated uncertainty estimates. The results demonstrate that the model captures complex traffic phenomena, including shockwaves and congestion propagation, while maintaining robustness to noise and sensor dropout. These advancements present a significant step toward online, adaptive traffic management systems.

Title: Fast Autoregressive Video Generation with Diagonal Decoding

Authors: Yang Ye, Junliang Guo, Haoyu Wu, Tianyu He, Tim Pearce, Tabish Rashid, Katja Hofmann, Jiang Bian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14070
Pdf URL: https://arxiv.org/pdf/2503.14070
Copy Paste: [[2503.14070]] Fast Autoregressive Video Generation with Diagonal Decoding(https://arxiv.org/abs/2503.14070)
Keywords: transformer, generative
Abstract: Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10\times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.

Title: Theoretical Foundation of Flow-Based Time Series Generation: Provable Approximation, Generalization, and Efficiency

Authors: Jiangxuan Long, Zhao Song, Chiwun Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14076
Pdf URL: https://arxiv.org/pdf/2503.14076
Copy Paste: [[2503.14076]] Theoretical Foundation of Flow-Based Time Series Generation: Provable Approximation, Generalization, and Efficiency(https://arxiv.org/abs/2503.14076)
Keywords: diffusion, transformer, generative
Abstract: Recent studies suggest utilizing generative models instead of traditional auto-regressive algorithms for time series forecasting (TSF) tasks. These non-auto-regressive approaches involving different generative methods, including GAN, Diffusion, and Flow Matching for time series, have empirically demonstrated high-quality generation capability and accuracy. However, we still lack an appropriate understanding of how it processes approximation and generalization. This paper presents the first theoretical framework from the perspective of flow-based generative models to relieve the knowledge of limitations. In particular, we provide our insights with strict guarantees from three perspectives: $\textbf{Approximation}$, $\textbf{Generalization}$ and $\textbf{Efficiency}$. In detail, our analysis achieves the contributions as follows: $\bullet$ By assuming a general data model, the fitting of the flow-based generative models is confirmed to converge to arbitrary error under the universal approximation of Diffusion Transformer (DiT). $\bullet$ Introducing a polynomial-based regularization for flow matching, the generalization error thus be bounded since the generalization of polynomial approximation. $\bullet$ The sampling for generation is considered as an optimization process, we demonstrate its fast convergence with updating standard first-order gradient descent of some objective.

Title: Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia

Authors: Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, Detlef Stolten
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14090
Pdf URL: https://arxiv.org/pdf/2503.14090
Copy Paste: [[2503.14090]] Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia(https://arxiv.org/abs/2503.14090)
Keywords: extraction
Abstract: To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38,738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data.

Title: Condensing Action Segmentation Datasets via Generative Network Inversion

Authors: Guodong Ding, Rongyu Chen, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14112
Pdf URL: https://arxiv.org/pdf/2503.14112
Copy Paste: [[2503.14112]] Condensing Action Segmentation Datasets via Generative Network Inversion(https://arxiv.org/abs/2503.14112)
Keywords: generative, segmentation
Abstract: This work presents the first condensation approach for procedural video datasets used in temporal action segmentation. We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes with significant storage reduced across temporal and channel aspects. Orthogonally, we propose sampling diverse and representative action sequences to minimize video-wise redundancy. Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances. Specifically, on the Breakfast dataset, our approach reduces storage by over 500$\times$ while retaining 83% of the performance compared to training with the full dataset. Furthermore, when applied to a downstream incremental learning task, it yields superior performance compared to the state-of-the-art.

Title: SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models

Authors: Subhadeep Koley, Tapas Kumar Dutta, Aneeshan Sain, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Yi-Zhe Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14129
Pdf URL: https://arxiv.org/pdf/2503.14129
Copy Paste: [[2503.14129]] SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models(https://arxiv.org/abs/2503.14129)
Keywords: diffusion, segmentation
Abstract: While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.

Title: CARE: A QLoRA-Fine Tuned Multi-Domain Chatbot With Fast Learning On Minimal Hardware

Authors: Ankit Dutta, Nabarup Ghosh, Ankush Chatterjee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14136
Pdf URL: https://arxiv.org/pdf/2503.14136
Copy Paste: [[2503.14136]] CARE: A QLoRA-Fine Tuned Multi-Domain Chatbot With Fast Learning On Minimal Hardware(https://arxiv.org/abs/2503.14136)
Keywords: large language model
Abstract: Large Language models have demonstrated excellent domain-specific question-answering capabilities when finetuned with a particular dataset of that specific domain. However, fine-tuning the models requires a significant amount of training time and a considerable amount of hardware. In this work, we propose CARE (Customer Assistance and Response Engine), a lightweight model made by fine-tuning Phi3.5-mini on very minimal hardware and data, designed to handle queries primarily across three domains: telecommunications support, medical support, and banking support. For telecommunications and banking, the chatbot addresses issues and problems faced by customers regularly in the above-mentioned domains. In the medical domain, CARE provides preliminary support by offering basic diagnoses and medical suggestions that a user might take before consulting a healthcare professional. Since CARE is built on Phi3.5-mini, it can be used even on mobile devices, increasing its usability. Our research also shows that CARE performs relatively well on various medical benchmarks, indicating that it can be used to make basic medical suggestions.

Title: Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

Authors: Zining Wang, Tongkun Guan, Pei Fu, Chen Duan, Qianyi Jiang, Zhentao Guo, Shan Guo, Junfeng Luo, Wei Shen, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14140
Pdf URL: https://arxiv.org/pdf/2503.14140
Copy Paste: [[2503.14140]] Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding(https://arxiv.org/abs/2503.14140)
Keywords: large language model
Abstract: Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at this https URL.

Title: Comparative and Interpretative Analysis of CNN and Transformer Models in Predicting Wildfire Spread Using Remote Sensing Data

Authors: Yihang Zhou, Ruige Kong, Zhengsen Xu, Linlin Xu, Sibo Cheng
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14150
Pdf URL: https://arxiv.org/pdf/2503.14150
Copy Paste: [[2503.14150]] Comparative and Interpretative Analysis of CNN and Transformer Models in Predicting Wildfire Spread Using Remote Sensing Data(https://arxiv.org/abs/2503.14150)
Keywords: interpretability, explainability, transformer
Abstract: Facing the escalating threat of global wildfires, numerous computer vision techniques using remote sensing data have been applied in this area. However, the selection of deep learning methods for wildfire prediction remains uncertain due to the lack of comparative analysis in a quantitative and explainable manner, crucial for improving prevention measures and refining models. This study aims to thoroughly compare the performance, efficiency, and explainability of four prevalent deep learning architectures: Autoencoder, ResNet, UNet, and Transformer-based Swin-UNet. Employing a real-world dataset that includes nearly a decade of remote sensing data from California, U.S., these models predict the spread of wildfires for the following day. Through detailed quantitative comparison analysis, we discovered that Transformer-based Swin-UNet and UNet generally outperform Autoencoder and ResNet, particularly due to the advanced attention mechanisms in Transformer-based Swin-UNet and the efficient use of skip connections in both UNet and Transformer-based Swin-UNet, which contribute to superior predictive accuracy and model interpretability. Then we applied XAI techniques on all four models, this not only enhances the clarity and trustworthiness of models but also promotes focused improvements in wildfire prediction capabilities. The XAI analysis reveals that UNet and Transformer-based Swin-UNet are able to focus on critical features such as 'Previous Fire Mask', 'Drought', and 'Vegetation' more effectively than the other two models, while also maintaining balanced attention to the remaining features, leading to their superior performance. The insights from our thorough comparative analysis offer substantial implications for future model design and also provide guidance for model selection in different scenarios.

Title: Speculative Decoding for Verilog: Speed and Quality, All in One

Authors: Changran Xu, Yi Liu, Yunhao Zhou, Shan Huang, Ningyi Xu, Qiang Xu
Subjects: cs.LG, cs.AR, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14153
Pdf URL: https://arxiv.org/pdf/2503.14153
Copy Paste: [[2503.14153]] Speculative Decoding for Verilog: Speed and Quality, All in One(https://arxiv.org/abs/2503.14153)
Keywords: large language model
Abstract: The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific syntax and lower representation in training datasets, pose significant challenges for conventional tokenization and decoding approaches. In this paper, we introduce a novel application of speculative decoding for Verilog code generation, showing that it can improve both inference speed and output quality, effectively achieving speed and quality all in one. Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model's ability to capture Verilog's logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05x speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.

Title: RBFIM: Perceptual Quality Assessment for Compressed Point Clouds Using Radial Basis Function Interpolation

Authors: Zhang Chen, Shuai Wan, Siyu Ren, Fuzheng Yang, Mengting Yu, Junhui Hou
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14154
Pdf URL: https://arxiv.org/pdf/2503.14154
Copy Paste: [[2503.14154]] RBFIM: Perceptual Quality Assessment for Compressed Point Clouds Using Radial Basis Function Interpolation(https://arxiv.org/abs/2503.14154)
Keywords: robust
Abstract: One of the main challenges in point cloud compression (PCC) is how to evaluate the perceived distortion so that the codec can be optimized for perceptual quality. Current standard practices in PCC highlight a primary issue: while single-feature metrics are widely used to assess compression distortion, the classic method of searching point-to-point nearest neighbors frequently fails to adequately build precise correspondences between point clouds, resulting in an ineffective capture of human perceptual features. To overcome the related limitations, we propose a novel assessment method called RBFIM, utilizing radial basis function (RBF) interpolation to convert discrete point features into a continuous feature function for the distorted point cloud. By substituting the geometry coordinates of the original point cloud into the feature function, we obtain the bijective sets of point features. This enables an establishment of precise corresponding features between distorted and original point clouds and significantly improves the accuracy of quality assessments. Moreover, this method avoids the complexity caused by bidirectional searches. Extensive experiments on multiple subjective quality datasets of compressed point clouds demonstrate that our RBFIM excels in addressing human perception tasks, thereby providing robust support for PCC optimization efforts.

Title: Towards Harmless Multimodal Assistants with Blind Preference Optimization

Authors: Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, Liqiang Nie
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.14189
Pdf URL: https://arxiv.org/pdf/2503.14189
Copy Paste: [[2503.14189]] Towards Harmless Multimodal Assistants with Blind Preference Optimization(https://arxiv.org/abs/2503.14189)
Keywords: defense, robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at this https URL.

Title: RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images

Authors: Junjin Xiao, Qing Zhang, Yonewei Nie, Lei Zhu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14198
Pdf URL: https://arxiv.org/pdf/2503.14198
Copy Paste: [[2503.14198]] RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images(https://arxiv.org/abs/2503.14198)
Keywords: robust
Abstract: This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at this https URL.

Title: AI-Driven Diabetic Retinopathy Diagnosis Enhancement through Image Processing and Salp Swarm Algorithm-Optimized Ensemble Network

Authors: Saif Ur Rehman Khan, Muhammad Nabeel Asim, Sebastian Vollmer, Andreas Dengel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14209
Pdf URL: https://arxiv.org/pdf/2503.14209
Copy Paste: [[2503.14209]] AI-Driven Diabetic Retinopathy Diagnosis Enhancement through Image Processing and Salp Swarm Algorithm-Optimized Ensemble Network(https://arxiv.org/abs/2503.14209)
Keywords: extraction
Abstract: Diabetic retinopathy is a leading cause of blindness in diabetic patients and early detection plays a crucial role in preventing vision loss. Traditional diagnostic methods are often time-consuming and prone to errors. The emergence of deep learning techniques has provided innovative solutions to improve diagnostic efficiency. However, single deep learning models frequently face issues related to extracting key features from complex retinal images. To handle this problem, we present an effective ensemble method for DR diagnosis comprising four main phases: image pre-processing, selection of backbone pre-trained models, feature enhancement, and optimization. Our methodology initiates with the pre-processing phase, where we apply CLAHE to enhance image contrast and Gamma correction is then used to adjust the brightness for better feature recognition. We then apply Discrete Wavelet Transform (DWT) for image fusion by combining multi-resolution details to create a richer dataset. Then, we selected three pre-trained models with the best performance named DenseNet169, MobileNetV1, and Xception for diverse feature extraction. To further improve feature extraction, an improved residual block is integrated into each model. Finally, the predictions from these base models are then aggregated using weighted ensemble approach, with the weights optimized by using Salp Swarm Algorithm (SSA).SSA intelligently explores the weight space and finds the optimal configuration of base architectures to maximize the performance of the ensemble model. The proposed model is evaluated on the multiclass Kaggle APTOS 2019 dataset and obtained 88.52% accuracy.

Title: Decision Tree Induction Through LLMs via Semantically-Aware Evolution

Authors: Tennison Liu, Nicolas Huynh, Mihaela van der Schaar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14217
Pdf URL: https://arxiv.org/pdf/2503.14217
Copy Paste: [[2503.14217]] Decision Tree Induction Through LLMs via Semantically-Aware Evolution(https://arxiv.org/abs/2503.14217)
Keywords: robust, interpretability, generative, large language model
Abstract: Decision trees are a crucial class of models offering robust predictive performance and inherent interpretability across various domains, including healthcare, finance, and logistics. However, current tree induction methods often face limitations such as suboptimal solutions from greedy methods or prohibitive computational costs and limited applicability of exact optimization approaches. To address these challenges, we propose an evolutionary optimization method for decision tree induction based on genetic programming (GP). Our key innovation is the integration of semantic priors and domain-specific knowledge about the search space into the optimization algorithm. To this end, we introduce $\texttt{LLEGO}$, a framework that incorporates semantic priors into genetic search operators through the use of Large Language Models (LLMs), thereby enhancing search efficiency and targeting regions of the search space that yield decision trees with superior generalization performance. This is operationalized through novel genetic operators that work with structured natural language prompts, effectively utilizing LLMs as conditional generative models and sources of semantic knowledge. Specifically, we introduce $\textit{fitness-guided}$ crossover to exploit high-performing regions, and $\textit{diversity-guided}$ mutation for efficient global exploration of the search space. These operators are controlled by corresponding hyperparameters that enable a more nuanced balance between exploration and exploitation across the search space. Empirically, we demonstrate across various benchmarks that $\texttt{LLEGO}$ evolves superior-performing trees compared to existing tree induction methods, and exhibits significantly more efficient search performance compared to conventional GP approaches.

Title: Segmentation-Guided Neural Radiance Fields for Novel Street View Synthesis

Authors: Yizhou Li, Yusuke Monno, Masatoshi Okutomi, Yuuichi Tanaka, Seiichi Kataoka, Teruaki Kosiba
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14219
Pdf URL: https://arxiv.org/pdf/2503.14219
Copy Paste: [[2503.14219]] Segmentation-Guided Neural Radiance Fields for Novel Street View Synthesis(https://arxiv.org/abs/2503.14219)
Keywords: segmentation
Abstract: Recent advances in Neural Radiance Fields (NeRF) have shown great potential in 3D reconstruction and novel view synthesis, particularly for indoor and small-scale scenes. However, extending NeRF to large-scale outdoor environments presents challenges such as transient objects, sparse cameras and textures, and varying lighting conditions. In this paper, we propose a segmentation-guided enhancement to NeRF for outdoor street scenes, focusing on complex urban environments. Our approach extends ZipNeRF and utilizes Grounded SAM for segmentation mask generation, enabling effective handling of transient objects, modeling of the sky, and regularization of the ground. We also introduce appearance embeddings to adapt to inconsistent lighting across view sequences. Experimental results demonstrate that our method outperforms the baseline ZipNeRF, improving novel view synthesis quality with fewer artifacts and sharper details.

Title: Panoramic Distortion-Aware Tokenization for Person Detection and Localization Using Transformers in Overhead Fisheye Images

Authors: Nobuhiko Wakai, Satoshi Sato, Yasunori Ishii, Takayoshi Yamashita
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14228
Pdf URL: https://arxiv.org/pdf/2503.14228
Copy Paste: [[2503.14228]] Panoramic Distortion-Aware Tokenization for Person Detection and Localization Using Transformers in Overhead Fisheye Images(https://arxiv.org/abs/2503.14228)
Keywords: transformer
Abstract: Person detection methods are used widely in applications including visual surveillance, pedestrian detection, and robotics. However, accurate detection of persons from overhead fisheye images remains an open challenge because of factors including person rotation and small-sized persons. To address the person rotation problem, we convert the fisheye images into panoramic images. For smaller people, we focused on the geometry of the panoramas. Conventional detection methods tend to focus on larger people because these larger people yield large significant areas for feature maps. In equirectangular panoramic images, we find that a person's height decreases linearly near the top of the images. Using this finding, we leverage the significance values and aggregate tokens that are sorted based on these values to balance the significant areas. In this leveraging process, we introduce panoramic distortion-aware tokenization. This tokenization procedure divides a panoramic image using self-similarity figures that enable determination of optimal divisions without gaps, and we leverage the maximum significant values in each tile of token groups to preserve the significant areas of smaller people. To achieve higher detection accuracy, we propose a person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods when applied to large-scale datasets.

Title: Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties

Authors: Ziyao Ling, Giovanni Delnevo, Paola Salomoni, Silvia Mirri
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14231
Pdf URL: https://arxiv.org/pdf/2503.14231
Copy Paste: [[2503.14231]] Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties(https://arxiv.org/abs/2503.14231)
Keywords: robust
Abstract: Chinese porcelain holds immense historical and cultural value, making its accurate classification essential for archaeological research and cultural heritage preservation. Traditional classification methods rely heavily on expert analysis, which is time-consuming, subjective, and difficult to scale. This paper explores the application of DL and transfer learning techniques to automate the classification of porcelain artifacts across four key attributes: dynasty, glaze, ware, and type. We evaluate four Convolutional Neural Networks (CNNs) - ResNet50, MobileNetV2, VGG16, and InceptionV3 - comparing their performance with and without pre-trained weights. Our results demonstrate that transfer learning significantly enhances classification accuracy, particularly for complex tasks like type classification, where models trained from scratch exhibit lower performance. MobileNetV2 and ResNet50 consistently achieve high accuracy and robustness across all tasks, while VGG16 struggles with more diverse classifications. We further discuss the impact of dataset limitations and propose future directions, including domain-specific pre-training, integration of attention mechanisms, explainable AI methods, and generalization to other cultural artifacts.

Title: CRCE: Coreference-Retention Concept Erasure in Text-to-Image Diffusion Models

Authors: Yuyang Xue, Edward Moroshko, Feng Chen, Steven McDonagh, Sotirios A. Tsaftaris
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14232
Pdf URL: https://arxiv.org/pdf/2503.14232
Copy Paste: [[2503.14232]] CRCE: Coreference-Retention Concept Erasure in Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.14232)
Keywords: diffusion, large language model
Abstract: Text-to-Image diffusion models can produce undesirable content that necessitates concept erasure techniques. However, existing methods struggle with under-erasure, leaving residual traces of targeted concepts, or over-erasure, mistakenly eliminating unrelated but visually similar concepts. To address these limitations, we introduce CRCE, a novel concept erasure framework that leverages Large Language Models to identify both semantically related concepts that should be erased alongside the target and distinct concepts that should be preserved. By explicitly modeling coreferential and retained concepts semantically, CRCE enables more precise concept removal, without unintended erasure. Experiments demonstrate that CRCE outperforms existing methods on diverse erasure tasks.

Title: Make Your Training Flexible: Towards Deployment-Efficient Video Models

Authors: Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14237
Pdf URL: https://arxiv.org/pdf/2503.14237
Copy Paste: [[2503.14237]] Make Your Training Flexible: Towards Deployment-Efficient Video Models(https://arxiv.org/abs/2503.14237)
Keywords: robust
Abstract: Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at this https URL.

Title: Predicting Cardiopulmonary Exercise Testing Outcomes in Congenital Heart Disease Through Multi-modal Data Integration and Geometric Learning

Authors: Muhammet Alkan, Gruschen Veldtman, Fani Deligianni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14239
Pdf URL: https://arxiv.org/pdf/2503.14239
Copy Paste: [[2503.14239]] Predicting Cardiopulmonary Exercise Testing Outcomes in Congenital Heart Disease Through Multi-modal Data Integration and Geometric Learning(https://arxiv.org/abs/2503.14239)
Keywords: robust
Abstract: Cardiopulmonary exercise testing (CPET) provides a comprehensive assessment of functional capacity by measuring key physiological variables including oxygen consumption ($VO_2$), carbon dioxide production ($VCO_2$), and pulmonary ventilation ($VE$) during exercise. Previous research has established that parameters such as peak $VO_2$ and $VE/VCO_2$ ratio serve as robust predictors of mortality risk in chronic heart failure patients. In this study, we leverage CPET variables as surrogate mortality endpoints for patients with Congenital Heart Disease (CHD). To our knowledge, this represents the first successful implementation of an advanced machine learning approach that predicts CPET outcomes by integrating electrocardiograms (ECGs) with information derived from clinical letters. Our methodology began with extracting unstructured patient information-including intervention history, diagnoses, and medication regimens-from clinical letters using natural language processing techniques, organizing this data into a structured database. We then digitized ECGs to obtain quantifiable waveforms and established comprehensive data linkages. The core innovation of our approach lies in exploiting the Riemannian geometric properties of covariance matrices derived from both 12-lead ECGs and clinical text data to develop robust regression and classification models. Through extensive ablation studies, we demonstrated that the integration of ECG signals with clinical documentation, enhanced by covariance augmentation techniques in Riemannian space, consistently produced superior predictive performance compared to conventional approaches.

Title: Deep Unsupervised Segmentation of Log Point Clouds

Authors: Fedor Zolotarev, Tuomas Eerola, Tomi Kauppi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14244
Pdf URL: https://arxiv.org/pdf/2503.14244
Copy Paste: [[2503.14244]] Deep Unsupervised Segmentation of Log Point Clouds(https://arxiv.org/abs/2503.14244)
Keywords: transformer, segmentation
Abstract: In sawmills, it is essential to accurately measure the raw material, i.e. wooden logs, to optimise the sawing process. Earlier studies have shown that accurate predictions of the inner structure of the logs can be obtained using just surface point clouds produced by a laser scanner. This provides a cost-efficient and fast alternative to the X-ray CT-based measurement devices. The essential steps in analysing log point clouds is segmentation, as it forms the basis for finding the fine surface details that provide the cues about the inner structure of the log. We propose a novel Point Transformer-based point cloud segmentation technique that learns to find the points belonging to the log surface in unsupervised manner. This is obtained using a loss function that utilises the geometrical properties of a cylinder while taking into account the shape variation common in timber logs. We demonstrate the accuracy of the method on wooden logs, but the approach could be utilised also on other cylindrical objects.

Title: Trading-off Accuracy and Communication Cost in Federated Learning

Authors: Mattia Jacopo Villani, Emanuele Natale, Frederik Mallmann-Trenn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14246
Pdf URL: https://arxiv.org/pdf/2503.14246
Copy Paste: [[2503.14246]] Trading-off Accuracy and Communication Cost in Federated Learning(https://arxiv.org/abs/2503.14246)
Keywords: federate
Abstract: Leveraging the training-by-pruning paradigm introduced by Zhou et al. and Isik et al. introduced a federated learning protocol that achieves a 34-fold reduction in communication cost. We achieve a compression improvements of orders of orders of magnitude over the state-of-the-art. The central idea of our framework is to encode the network weights $\vec w$ by a the vector of trainable parameters $\vec p$, such that $\vec w = Q\cdot \vec p$ where $Q$ is a carefully-generate sparse random matrix (that remains fixed throughout training). In such framework, the previous work of Zhou et al. [NeurIPS'19] is retrieved when $Q$ is diagonal and $\vec p$ has the same dimension of $\vec w$. We instead show that $\vec p$ can effectively be chosen much smaller than $\vec w$, while retaining the same accuracy at the price of a decrease of the sparsity of $Q$. Since server and clients only need to share $\vec p$, such a trade-off leads to a substantial improvement in communication cost. Moreover, we provide theoretical insight into our framework and establish a novel link between training-by-sampling and random convex geometry.

Title: Quantization-Free Autoregressive Action Transformer

Authors: Ziyad Sheebaelhamd, Michael Tschannen, Michael Muehlebach, Claire Vernade
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14259
Pdf URL: https://arxiv.org/pdf/2503.14259
Copy Paste: [[2503.14259]] Quantization-Free Autoregressive Action Transformer(https://arxiv.org/abs/2503.14259)
Keywords: transformer, generative
Abstract: Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.

Title: DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal

Authors: Vaibhav Aggarwal, Ojasv Kamal, Abhinav Japesh, Zhijing Jin, Bernhard Schölkopf
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14269
Pdf URL: https://arxiv.org/pdf/2503.14269
Copy Paste: [[2503.14269]] DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal(https://arxiv.org/abs/2503.14269)
Keywords: large language model
Abstract: Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development, by enabling automation. In software engineering, LLM-powered coding agents have garnered significant attention due to their potential to automate complex development tasks, assist in debugging, and enhance productivity. However, existing approaches often struggle with sub-optimal decision-making, requiring either extensive manual intervention or inefficient compute scaling strategies. To improve coding agent performance, we present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents, that is faster and more effective at recovering from sub-optimal decisions compared to baselines. While traditional agents either follow linear trajectories or rely on random sampling for scaling compute, our approach DARS works by branching out a trajectory at certain key decision points by taking an alternative action given the history of the trajectory and execution feedback of the previous attempt from that point. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2. Our framework achieves a pass@1 rate of 47%, outperforming state-of-the-art (SOTA) open-source frameworks.

Title: CTSR: Controllable Fidelity-Realness Trade-off Distillation for Real-World Image Super Resolution

Authors: Runyi Li, Bin Chen, Jian Zhang, Radu Timofte
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14272
Pdf URL: https://arxiv.org/pdf/2503.14272
Copy Paste: [[2503.14272]] CTSR: Controllable Fidelity-Realness Trade-off Distillation for Real-World Image Super Resolution(https://arxiv.org/abs/2503.14272)
Keywords: diffusion
Abstract: Real-world image super-resolution is a critical image processing task, where two key evaluation criteria are the fidelity to the original image and the visual realness of the generated results. Although existing methods based on diffusion models excel in visual realness by leveraging strong priors, they often struggle to achieve an effective balance between fidelity and realness. In our preliminary experiments, we observe that a linear combination of multiple models outperforms individual models, motivating us to harness the strengths of different models for a more effective trade-off. Based on this insight, we propose a distillation-based approach that leverages the geometric decomposition of both fidelity and realness, alongside the performance advantages of multiple teacher models, to strike a more balanced trade-off. Furthermore, we explore the controllability of this trade-off, enabling a flexible and adjustable super-resolution process, which we call CTSR (Controllable Trade-off Super-Resolution). Experiments conducted on several real-world image super-resolution benchmarks demonstrate that our method surpasses existing state-of-the-art approaches, achieving superior performance across both fidelity and realness metrics.

Title: Manual Labelling Artificially Inflates Deep Learning-Based Segmentation Performance on Closed Canopy: Validation Using TLS

Authors: Matthew J. Allen, Harry J. F. Owen, Stuart W. D. Grieve, Emily R. Lines
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14273
Pdf URL: https://arxiv.org/pdf/2503.14273
Copy Paste: [[2503.14273]] Manual Labelling Artificially Inflates Deep Learning-Based Segmentation Performance on Closed Canopy: Validation Using TLS(https://arxiv.org/abs/2503.14273)
Keywords: segmentation
Abstract: Monitoring forest dynamics at an individual tree scale is essential for accurately assessing ecosystem responses to climate change, yet traditional methods relying on field-based forest inventories are labor-intensive and limited in spatial coverage. Advances in remote sensing using drone-acquired RGB imagery combined with deep learning models have promised precise individual tree crown (ITC) segmentation; however, existing methods are frequently validated against human-annotated images, lacking rigorous independent ground truth. In this study, we generate high-fidelity validation labels from co-located Terrestrial Laser Scanning (TLS) data for drone imagery of mixed unmanaged boreal and Mediterranean forests. We evaluate the performance of two widely used deep learning ITC segmentation models - DeepForest (RetinaNet) and Detectree2 (Mask R-CNN) - on these data, and compare to performance on further Mediterranean forest data labelled manually. When validated against TLS-derived ground truth from Mediterranean forests, model performance decreased significantly compared to assessment based on hand-labelled from an ecologically similar site (AP50: 0.094 vs. 0.670). Restricting evaluation to only canopy trees shrank this gap considerably (Canopy AP50: 0.365), although performance was still far lower than on similar hand-labelled data. Models also performed poorly on boreal forest data (AP50: 0.142), although again increasing when evaluated on canopy trees only (Canopy AP50: 0.308). Both models showed very poor localisation accuracy at stricter IoU thresholds, even when restricted to canopy trees (Max AP75: 0.051). Similar results have been observed in studies using aerial LiDAR data, suggesting fundamental limitations in aerial-based segmentation approaches in closed canopy forests.

Title: Free-Lunch Color-Texture Disentanglement for Stylized Image Generation

Authors: Jiang Qin, Senmao Li, Alexandra Gomez-Villa, Shiqi Yang, Yaxing Wang, Kai Wang, Joost van de Weijer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14275
Pdf URL: https://arxiv.org/pdf/2503.14275
Copy Paste: [[2503.14275]] Free-Lunch Color-Texture Disentanglement for Stylized Image Generation(https://arxiv.org/abs/2503.14275)
Keywords: diffusion
Abstract: Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.

Title: Anti-Tamper Radio meets Reconfigurable Intelligent Surface for System-Level Tamper Detection

Authors: Maryam Shaygan Tabar, Johannes Kortz, Paul Staat, Harald Elders-Boll, Christof Paar, Christian Zenger
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.14279
Pdf URL: https://arxiv.org/pdf/2503.14279
Copy Paste: [[2503.14279]] Anti-Tamper Radio meets Reconfigurable Intelligent Surface for System-Level Tamper Detection(https://arxiv.org/abs/2503.14279)
Keywords: security, protect, attack, robust
Abstract: Many computing systems need to be protected against physical attacks using active tamper detection based on sensors. One technical solution is to employ an ATR (Anti-Tamper Radio) approach, analyzing the radio wave propagation effects within a protected device to detect unauthorized physical alterations. However, ATR systems face key challenges in terms of susceptibility to signal manipulation attacks, limited reliability due to environmental noise, and regulatory constraints from wide bandwidth usage. In this work, we propose and experimentally evaluate an ATR system complemented by an RIS to dynamically reconfigure the wireless propagation environment. We show that this approach can enhance resistance against signal manipulation attacks, reduce bandwidth requirements from several~GHz down to as low as 20 MHz, and improve robustness to environmental disturbances such as internal fan movements. Our work demonstrates that RIS integration can strengthen the ATR performance to enhance security, sensitivity, and robustness, recognizing the potential of smart radio environments for ATR-based tamper detection

Title: XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants

Authors: Adam Štorek, Mukur Gupta, Noopur Bhatt, Aditya Gupta, Janie Kim, Prashast Srivastava, Suman Jana
Subjects: cs.CR, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2503.14281
Pdf URL: https://arxiv.org/pdf/2503.14281
Copy Paste: [[2503.14281]] XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants(https://arxiv.org/abs/2503.14281)
Keywords: security, defense, attack, steal
Abstract: AI coding assistants are widely used for tasks like code generation, bug detection, and comprehension. These tools now require large and complex contexts, automatically sourced from various origins$\unicode{x2014}$across files, projects, and contributors$\unicode{x2014}$forming part of the prompt fed to underlying LLMs. This automatic context-gathering introduces new vulnerabilities, allowing attackers to subtly poison input to compromise the assistant's outputs, potentially generating vulnerable code, overlooking flaws, or introducing critical errors. We propose a novel attack, Cross-Origin Context Poisoning (XOXO), that is particularly challenging to detect as it relies on adversarial code modifications that are semantically equivalent. Traditional program analysis techniques struggle to identify these correlations since the semantics of the code remain correct, making it appear legitimate. This allows attackers to manipulate code assistants into producing incorrect outputs, including vulnerabilities or backdoors, while shifting the blame to the victim developer or tester. We introduce a novel, task-agnostic black-box attack algorithm GCGS that systematically searches the transformation space using a Cayley Graph, achieving an 83.09% attack success rate on average across five tasks and eleven models, including GPT-4o and Claude 3.5 Sonnet v2 used by many popular AI coding assistants. Furthermore, existing defenses, including adversarial fine-tuning, are ineffective against our attack, underscoring the need for new security measures in LLM-powered coding tools.

Title: Entente: Cross-silo Intrusion Detection on Network Log Graphs with Federated Learning

Authors: Jiacen Xu, Chenang Li, Yu Zheng, Zhou Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.14284
Pdf URL: https://arxiv.org/pdf/2503.14284
Copy Paste: [[2503.14284]] Entente: Cross-silo Intrusion Detection on Network Log Graphs with Federated Learning(https://arxiv.org/abs/2503.14284)
Keywords: privacy, attack, federate
Abstract: Graph-based Network Intrusion Detection System (GNIDS) has gained significant momentum in detecting sophisticated cyber-attacks, like Advanced Persistent Threat (APT), in an organization or across organizations. Though achieving satisfying detection accuracy and adapting to ever-changing attacks and normal patterns, all prior GNIDSs assume the centralized data settings directly, but non-trivial data collection is not always practical under privacy regulations nowadays. We argue that training a GNIDS model has to consider privacy regulations, and propose to leverage federated learning (FL) to address this prominent challenge. Yet, directly applying FL to GNIDS is unlikely to succeed, due to issues like non-IID (independent and identically distributed) graph data over clients and the diverse design choices taken by different GNIDS. We address these issues with a set of novel techniques tailored to the graph datasets, including reference graph synthesis, graph sketching and adaptive contribution scaling, and develop a new system Entente. We evaluate Entente on the large-scale LANL, OpTC and Pivoting datasets. The result shows Entente outperforms the other baseline FL algorithms and sometimes even the non-FL GNIDS. We also evaluate Entente under FL poisoning attacks tailored to the GNIDS setting, and show Entente is able to bound the attack success rate to low values. Overall, our result suggests building cross-silo GNIDS is feasible and we hope to encourage more efforts in this direction.

Title: Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

Authors: Nicolas Le Roux, Marc G. Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alex Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer Sándor Toth, Samantha Work
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14286
Pdf URL: https://arxiv.org/pdf/2503.14286
Copy Paste: [[2503.14286]] Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs(https://arxiv.org/abs/2503.14286)
Keywords: generative, large language model
Abstract: We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.

Title: Improved Scalable Lipschitz Bounds for Deep Neural Networks

Authors: Usman Syed, Bin Hu
Subjects: cs.LG, eess.SY, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2503.14297
Pdf URL: https://arxiv.org/pdf/2503.14297
Copy Paste: [[2503.14297]] Improved Scalable Lipschitz Bounds for Deep Neural Networks(https://arxiv.org/abs/2503.14297)
Keywords: robust
Abstract: Computing tight Lipschitz bounds for deep neural networks is crucial for analyzing their robustness and stability, but existing approaches either produce relatively conservative estimates or rely on semidefinite programming (SDP) formulations (namely the LipSDP condition) that face scalability issues. Building upon ECLipsE-Fast, the state-of-the-art Lipschitz bound method that avoids SDP formulations, we derive a new family of improved scalable Lipschitz bounds that can be combined to outperform ECLipsE-Fast. Specifically, we leverage more general parameterizations of feasible points of LipSDP to derive various closed-form Lipschitz bounds, avoiding the use of SDP solvers. In addition, we show that our technique encompasses ECLipsE-Fast as a special case and leads to a much larger class of scalable Lipschitz bounds for deep neural networks. Our empirical study shows that our bounds improve ECLipsE-Fast, further advancing the scalability and precision of Lipschitz estimation for large neural networks.

Title: Unveiling the Role of Randomization in Multiclass Adversarial Classification: Insights from Graph Theory

Authors: Lucas Gnecco-Heredia, Matteo Sammut, Muni Sreenivas Pydi, Rafael Pinot, Benjamin Negrevergne, Yann Chevaleyre
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14299
Pdf URL: https://arxiv.org/pdf/2503.14299
Copy Paste: [[2503.14299]] Unveiling the Role of Randomization in Multiclass Adversarial Classification: Insights from Graph Theory(https://arxiv.org/abs/2503.14299)
Keywords: attack, robust
Abstract: Randomization as a mean to improve the adversarial robustness of machine learning models has recently attracted significant attention. Unfortunately, much of the theoretical analysis so far has focused on binary classification, providing only limited insights into the more complex multiclass setting. In this paper, we take a step toward closing this gap by drawing inspiration from the field of graph theory. Our analysis focuses on discrete data distributions, allowing us to cast the adversarial risk minimization problems within the well-established framework of set packing problems. By doing so, we are able to identify three structural conditions on the support of the data distribution that are necessary for randomization to improve robustness. Furthermore, we are able to construct several data distributions where (contrarily to binary classification) switching from a deterministic to a randomized solution significantly reduces the optimal adversarial risk. These findings highlight the crucial role randomization can play in enhancing robustness to adversarial attacks in multiclass classification.

Title: COPA: Comparing the Incomparable to Explore the Pareto Front

Authors: Adrián Javaloy, Antonio Vergari, Isabel Valera
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14321
Pdf URL: https://arxiv.org/pdf/2503.14321
Copy Paste: [[2503.14321]] COPA: Comparing the Incomparable to Explore the Pareto Front(https://arxiv.org/abs/2503.14321)
Keywords: large language model
Abstract: In machine learning (ML), it is common to account for multiple objectives when, e.g., selecting a model to deploy. However, it is often unclear how one should compare, aggregate and, ultimately, trade-off these objectives, as they might be measured in different units or scales. For example, when deploying large language models (LLMs), we might not only care about their performance, but also their CO2 consumption. In this work, we investigate how objectives can be sensibly compared and aggregated to navigate their Pareto front. To do so, we propose to make incomparable objectives comparable via their CDFs, approximated by their relative rankings. This allows us to aggregate them while matching user-specific preferences, allowing practitioners to meaningfully navigate and search for models in the Pareto front. We demonstrate the potential impact of our methodology in diverse areas such as LLM selection, domain generalization, and AutoML benchmarking, where classical ways to aggregate and normalize objectives fail.

Title: DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Authors: Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, Kaicheng Yu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14324
Pdf URL: https://arxiv.org/pdf/2503.14324
Copy Paste: [[2503.14324]] DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies(https://arxiv.org/abs/2503.14324)
Keywords: large language model
Abstract: The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.

Title: LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

Authors: Yu Cheng, Fajie Yuan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14325
Pdf URL: https://arxiv.org/pdf/2503.14325
Copy Paste: [[2503.14325]] LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models(https://arxiv.org/abs/2503.14325)
Keywords: diffusion
Abstract: Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent this http URL, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video this http URL model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video this http URL models and code are available at this https URL.

Title: EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment

Authors: Yufei Zhu, Yiming Zhong, Zemin Yang, Peishan Cong, Jingyi Yu, Xinge Zhu, Yuexin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14329
Pdf URL: https://arxiv.org/pdf/2503.14329
Copy Paste: [[2503.14329]] EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment(https://arxiv.org/abs/2503.14329)
Keywords: robust
Abstract: Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments, an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference finetuning, and ensures physical plausibility throughout the process. Extensive experiments across four benchmark datasets demonstrate state of the art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.

Title: Revealing higher-order neural representations with generative artificial intelligence

Authors: Hojjat Azimi Asrari, Megan A. K. Peters
Subjects: cs.LG, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.14333
Pdf URL: https://arxiv.org/pdf/2503.14333
Copy Paste: [[2503.14333]] Revealing higher-order neural representations with generative artificial intelligence(https://arxiv.org/abs/2503.14333)
Keywords: diffusion, generative
Abstract: Studies often aim to reveal how neural representations encode aspects of an observer's environment, such as its contents or structure. These are ``first-order" representations (FORs), because they're ``about" the external world. A less-common target is ``higher-order" representations (HORs), which are ``about" FORs -- their contents, stability, or uncertainty. HORs of uncertainty appear critically involved in adaptive behaviors including learning under uncertainty, influencing learning rates and internal model updating based on environmental feedback. However, HORs about uncertainty are unlikely to be direct ``read-outs" of FOR characteristics, instead reflecting estimation processes which may be lossy, bias-prone, or distortive and which may also incorporate estimates of distributions of uncertainty the observer is likely to experience. While some research has targeted neural representations of ``instantaneously" estimated uncertainty, how the brain represents \textit{distributions} of expected uncertainty remains largely unexplored. Here, we propose a novel reinforcement learning (RL) based generative artificial intelligence (genAI) approach to explore neural representations of uncertainty distributions. We use existing functional magnetic resonance imaging data, where humans learned to `de-noise' their brain states to achieve target neural patterns, to train denoising diffusion genAI models with RL algorithms to learn noise distributions similar to how humans might learn to do the same. We then explore these models' learned noise-distribution HORs compared to control models trained with traditional backpropagation. Results reveal model-dependent differences in noise distribution representations -- with the RL-based model offering much higher explanatory power for human behavior -- offering an exciting path towards using genAI to explore neural noise-distribution HORs.

Title: PENCIL: Long Thoughts with Short Memory

Authors: Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14337
Pdf URL: https://arxiv.org/pdf/2503.14337
Copy Paste: [[2503.14337]] PENCIL: Long Thoughts with Short Memory(https://arxiv.org/abs/2503.14337)
Keywords: transformer
Abstract: While recent works (e.g. o1, DeepSeek R1) have demonstrated great promise of using long Chain-of-Thought (CoT) to improve reasoning capabilities of language models, scaling it up during test-time is challenging due to inefficient memory usage -- intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We propose PENCIL, which incorporates a reduction mechanism into the autoregressive generation process, allowing the model to recursively clean up intermediate thoughts based on patterns learned from training. With this reduction mechanism, PENCIL significantly reduces the maximal context length required during generation, and thus can generate longer thoughts with limited memory, solving larger-scale problems given more thinking time. For example, we demonstrate PENCIL achieves 97\% accuracy on the challenging Einstein's puzzle -- a task even large models like GPT-4 struggle with -- using only a small 25M-parameter transformer with 2048 context length. Theoretically, we prove PENCIL can perform universal space-efficient computation by simulating Turing machines with optimal time and space complexity, and thus can solve arbitrary computational tasks that would otherwise be intractable given context window constraints.

Title: 3D Densification for Multi-Map Monocular VSLAM in Endoscopy

Authors: X. Anadón, Javier Rodríguez-Puigvert, J.M.M. Montiel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14346
Pdf URL: https://arxiv.org/pdf/2503.14346
Copy Paste: [[2503.14346]] 3D Densification for Multi-Map Monocular VSLAM in Endoscopy(https://arxiv.org/abs/2503.14346)
Keywords: robust
Abstract: Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.

Title: VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

Authors: Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14350
Pdf URL: https://arxiv.org/pdf/2503.14350
Copy Paste: [[2503.14350]] VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation(https://arxiv.org/abs/2503.14350)
Keywords: diffusion, segmentation
Abstract: Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.

Title: MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven Prompts

Authors: Runqi Meng, Sifan Song, Pengfei Jin, Yujin Oh, Lin Teng, Yulin Wang, Yiqun Sun, Ling Chen, Xiang Li, Quanzheng Li, Ning Guo, Dinggang Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14355
Pdf URL: https://arxiv.org/pdf/2503.14355
Copy Paste: [[2503.14355]] MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven Prompts(https://arxiv.org/abs/2503.14355)
Keywords: segmentation
Abstract: Accurate tumor segmentation is crucial for cancer diagnosis and treatment. While foundation models have advanced general-purpose segmentation, existing methods still struggle with: (1) limited incorporation of medical priors, (2) imbalance between generic and tumor-specific features, and (3) high computational costs for clinical adaptation. To address these challenges, we propose MAST-Pro (Mixture-of-experts for Adaptive Segmentation of pan-Tumors with knowledge-driven Prompts), a novel framework that integrates dynamic Mixture-of-Experts (D-MoE) and knowledge-driven prompts for pan-tumor segmentation. Specifically, text and anatomical prompts provide domain-specific priors, guiding tumor representation learning, while D-MoE dynamically selects experts to balance generic and tumor-specific feature learning, improving segmentation accuracy across diverse tumor types. To enhance efficiency, we employ Parameter-Efficient Fine-Tuning (PEFT), optimizing MAST-Pro with significantly reduced computational overhead. Experiments on multi-anatomical tumor datasets demonstrate that MAST-Pro outperforms state-of-the-art approaches, achieving up to a 5.20% improvement in average DSC while reducing trainable parameters by 91.04%, without compromising accuracy.

Title: Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis

Authors: Alexander Partin (1), Priyanka Vasanthakumari (1), Oleksandr Narykov (1), Andreas Wilke (1), Natasha Koussa (2), Sara E. Jones (2), Yitan Zhu (1), Jamie C. Overbeek (1), Rajeev Jain (1), Gayara Demini Fernando (3), Cesar Sanchez-Villalobos (4), Cristina Garcia-Cardona (5), Jamaludin Mohd-Yusof (5), Nicholas Chia (1), Justin M. Wozniak (1), Souparno Ghosh (3), Ranadip Pal (4), Thomas S. Brettin (1), M. Ryan Weil (2), Rick L. Stevens (1 and 6) ((1) Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA, (2) Frederick National Laboratory for Cancer Research, Cancer Data Science Initiatives, Cancer Research Technology Program, Frederick, MD, USA, (3) Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA, (4) Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, USA, (5) Division of Computer, Computational and Statistical Sciences, Los Alamos National Laboratory, Los Alamos, NM, USA, (6) Department of Computer Science, The University of Chicago, Chicago, IL, USA)
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.14356
Pdf URL: https://arxiv.org/pdf/2503.14356
Copy Paste: [[2503.14356]] Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis(https://arxiv.org/abs/2503.14356)
Keywords: robust
Abstract: Deep learning (DL) and machine learning (ML) models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it difficult to assess true predictive capabilities. In this work, we introduce a benchmarking framework for evaluating cross-dataset prediction generalization in DRP models. Our framework incorporates five publicly available drug screening datasets, six standardized DRP models, and a scalable workflow for systematic evaluation. To assess model generalization, we introduce a set of evaluation metrics that quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results), enabling a more comprehensive assessment of model transferability. Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments. While several models demonstrate relatively strong cross-dataset generalization, no single model consistently outperforms across all datasets. Furthermore, we identify CTRPv2 as the most effective source dataset for training, yielding higher generalization scores across target datasets. By sharing this standardized evaluation framework with the community, our study aims to establish a rigorous foundation for model comparison, and accelerate the development of robust DRP models for real-world applications.

Title: RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment

Authors: Chao Wang, Giulio Franzese, Alessandro Finamore, Pietro Michiardi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14358
Pdf URL: https://arxiv.org/pdf/2503.14358
Copy Paste: [[2503.14358]] RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment(https://arxiv.org/abs/2503.14358)
Keywords: diffusion
Abstract: Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation. Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt, i.e., images show wrong attribute binding, subject positioning, numeracy, etc. While the literature offers many methods to improve T2I alignment, they all consider only Diffusion Models, and require auxiliary datasets, scoring models, and linguistic analysis of the prompt. In this paper we aim to address these gaps. First, we introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation. Then, we investigate a self-supervised fine-tuning approach for T2I alignment based on RFMI that does not require auxiliary information other than the pre-trained model itself. Specifically, a fine-tuning set is constructed by selecting synthetic images generated from the pre-trained RF model and having high point-wise MI between images and prompts. Our experiments on MI estimation benchmarks demonstrate the validity of RFMI, and empirical fine-tuning on SD3.5-Medium confirms the effectiveness of RFMI for improving T2I alignment while maintaining image quality.

Title: Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels

Authors: Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Sepp Hochreiter
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14376
Pdf URL: https://arxiv.org/pdf/2503.14376
Copy Paste: [[2503.14376]] Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels(https://arxiv.org/abs/2503.14376)
Keywords: transformer
Abstract: Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels. Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM. Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.

Title: Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation

Authors: Rikuto Tsuchida, Hibiki Yokoyama, Takehito Utsuro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14382
Pdf URL: https://arxiv.org/pdf/2503.14382
Copy Paste: [[2503.14382]] Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation(https://arxiv.org/abs/2503.14382)
Keywords: large language model
Abstract: The purpose of this paper is to examine whether large language models (LLMs) can understand what is good and evil with respect to judging good/evil reputation of celebrities. Specifically, we first apply a large language model (namely, ChatGPT) to the task of collecting sentences that mention the target celebrity from articles about celebrities on Web pages. Next, the collected sentences are categorized based on their contents by ChatGPT, where ChatGPT assigns a category name to each of those categories. Those assigned category names are referred to as "aspects" of each celebrity. Then, by applying the framework of retrieval augmented generation (RAG), we show that the large language model is quite effective in the task of judging good/evil reputation of aspects and descriptions of each celebrity. Finally, also in terms of proving the advantages of the proposed method over existing services incorporating RAG functions, we show that the proposed method of judging good/evil of aspects/descriptions of each celebrity significantly outperform an existing service incorporating RAG functions.

Title: Vexed by VEX tools: Consistency evaluation of container vulnerability scanners

Authors: Yekatierina Churakova Mathias Ekstedt
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.14388
Pdf URL: https://arxiv.org/pdf/2503.14388
Copy Paste: [[2503.14388]] Vexed by VEX tools: Consistency evaluation of container vulnerability scanners(https://arxiv.org/abs/2503.14388)
Keywords: secure, security
Abstract: This paper presents a study that analyzed state-of-the-art vulnerability scanning tools applied to containers. We have focused the work on tools following the Vulnerability Exploitability eXchange (VEX) format, which has been introduced to complement Software Bills of Material (SBOM) with security advisories of known vulnerabilities. Being able to get an accurate understanding of vulnerabilities found in the dependencies of third-party software is critical for secure software development and risk analysis. Accepting the overwhelming challenge of estimating the precise accuracy and precision of a vulnerability scanner, we have in this study instead set out to explore how consistently different tools perform. By doing this, we aim to assess the maturity of the VEX tool field as a whole (rather than any particular tool). We have used the Jaccard and Tversky indices to produce similarity scores of tool performance for several different datasets created from container images. Overall, our results show a low level of consistency among the tools, thus indicating a low level of maturity in VEX tool space. We have performed a number of experiments to find and explanation to our results, but largely they are inconclusive and further research is needed to understand the underlying causalities of our findings.

Title: How much do LLMs learn from negative examples?

Authors: Shadi Hamdan, Deniz Yuret
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14391
Pdf URL: https://arxiv.org/pdf/2503.14391
Copy Paste: [[2503.14391]] How much do LLMs learn from negative examples?(https://arxiv.org/abs/2503.14391)
Keywords: large language model
Abstract: Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.

Title: From "Hallucination" to "Suture": Insights from Language Philosophy to Enhance Large Language Models

Authors: Qiantong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14392
Pdf URL: https://arxiv.org/pdf/2503.14392
Copy Paste: [[2503.14392]] From "Hallucination" to "Suture": Insights from Language Philosophy to Enhance Large Language Models(https://arxiv.org/abs/2503.14392)
Keywords: robust, large language model
Abstract: This paper explores hallucination phenomena in large language models (LLMs) through the lens of language philosophy and psychoanalysis. By incorporating Lacan's concepts of the "chain of signifiers" and "suture points," we propose the Anchor-RAG framework as a novel approach to mitigate hallucinations. In contrast to the predominant reliance on trial-and-error experiments, constant adjustments of mathematical formulas, or resource-intensive methods that emphasize quantity over quality, our approach returns to the fundamental principles of linguistics to analyze the root causes of hallucinations in LLMs. Drawing from robust theoretical foundations, we derive algorithms and models that are not only effective in reducing hallucinations but also enhance LLM performance and improve output quality. This paper seeks to establish a comprehensive theoretical framework for understanding hallucinations in LLMs and aims to challenge the prevalent "guess-and-test" approach and rat race mentality in the field. We aspire to pave the way for a new era of interpretable LLMs, offering deeper insights into the inner workings of language-based AI systems.

Title: Technical Report: Aggregation on Learnable Manifolds for Asynchronous Federated Optimization

Authors: Archie Licudi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14396
Pdf URL: https://arxiv.org/pdf/2503.14396
Copy Paste: [[2503.14396]] Technical Report: Aggregation on Learnable Manifolds for Asynchronous Federated Optimization(https://arxiv.org/abs/2503.14396)
Keywords: federate
Abstract: In Federated Learning (FL), a primary challenge to the server-side aggregation of client models is device heterogeneity, in both loss landscape geometry and computational capacity. This issue can be particularly pronounced in clinical contexts where variations in data distribution (aggravated by class imbalance), infrastructure requirements, and sample sizes are common. We propose AsyncManifold, a novel asynchronous FL framework to address these issues by taking advantage of underlying solution space geometry, at each of the local training, delay-correction, and aggregation stages. Our proposal is accompanied by a convergence proof in a general form and, motivated thorough exploratory studies of local behaviour, a proof-of-concept algorithm which performs aggregation along non-linear mode connections and hence avoids barriers to convergence that techniques based on linear interpolation will encounter.

Title: Diffusion-based Facial Aesthetics Enhancement with 3D Structure Guidance

Authors: Lisha Li, Jingwen Hou, Weide Liu, Yuming Fang, Jiebin Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14402
Pdf URL: https://arxiv.org/pdf/2503.14402
Copy Paste: [[2503.14402]] Diffusion-based Facial Aesthetics Enhancement with 3D Structure Guidance(https://arxiv.org/abs/2503.14402)
Keywords: diffusion
Abstract: Facial Aesthetics Enhancement (FAE) aims to improve facial attractiveness by adjusting the structure and appearance of a facial image while preserving its identity as much as possible. Most existing methods adopted deep feature-based or score-based guidance for generation models to conduct FAE. Although these methods achieved promising results, they potentially produced excessively beautified results with lower identity consistency or insufficiently improved facial attractiveness. To enhance facial aesthetics with less loss of identity, we propose the Nearest Neighbor Structure Guidance based on Diffusion (NNSG-Diffusion), a diffusion-based FAE method that beautifies a 2D facial image with 3D structure guidance. Specifically, we propose to extract FAE guidance from a nearest neighbor reference face. To allow for less change of facial structures in the FAE process, a 3D face model is recovered by referring to both the matched 2D reference face and the 2D input face, so that the depth and contour guidance can be extracted from the 3D face model. Then the depth and contour clues can provide effective guidance to Stable Diffusion with ControlNet for FAE. Extensive experiments demonstrate that our method is superior to previous relevant methods in enhancing facial aesthetics while preserving facial identity.

Title: DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

Authors: Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lucas, Pau de Jorge, Diane Larlus, Yannis Kalantidis
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14405
Pdf URL: https://arxiv.org/pdf/2503.14405
Copy Paste: [[2503.14405]] DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers(https://arxiv.org/abs/2503.14405)
Keywords: segmentation
Abstract: Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.

Title: Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models

Authors: Siwei Zhang, Yun Xiong, Yateng Tang, Xi Chen, Zian Jia, Zehao Gu, Jiarong Xu, Jiawei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14411
Pdf URL: https://arxiv.org/pdf/2503.14411
Copy Paste: [[2503.14411]] Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models(https://arxiv.org/abs/2503.14411)
Keywords: robust, large language model
Abstract: Temporal graph neural networks (TGNNs) have shown remarkable performance in temporal graph modeling. However, real-world temporal graphs often possess rich textual information, giving rise to temporal text-attributed graphs (TTAGs). Such combination of dynamic text semantics and evolving graph structures introduces heightened complexity. Existing TGNNs embed texts statically and rely heavily on encoding mechanisms that biasedly prioritize structural information, overlooking the temporal evolution of text semantics and the essential interplay between semantics and structures for synergistic reinforcement. To tackle these issues, we present \textbf{Cross}, a novel framework that seamlessly extends existing TGNNs for TTAG modeling. The key idea is to employ the advanced large language models (LLMs) to extract the dynamic semantics in text space and then generate expressive representations unifying both semantics and structures. Specifically, we propose a Temporal Semantics Extractor in the {Cross} framework, which empowers the LLM to offer the temporal semantic understanding of node's evolving contexts of textual neighborhoods, facilitating semantic dynamics. Subsequently, we introduce the Semantic-structural Co-encoder, which collaborates with the above Extractor for synthesizing illuminating representations by jointly considering both semantic and structural information while encouraging their mutual reinforcement. Extensive experimental results on four public datasets and one practical industrial dataset demonstrate {Cross}'s significant effectiveness and robustness.

Title: ExDDV: A New Dataset for Explainable Deepfake Detection in Video

Authors: Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.14421
Pdf URL: https://arxiv.org/pdf/2503.14421
Copy Paste: [[2503.14421]] ExDDV: A New Dataset for Explainable Deepfake Detection in Video(https://arxiv.org/abs/2503.14421)
Keywords: robust
Abstract: The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at this https URL.

Title: MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

Authors: Hongyu Zhang, Yufan Deng, Shenghai Yuan, Peng Jin, Zesen Cheng, Yian Zhao, Chang Liu, Jie Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14428
Pdf URL: https://arxiv.org/pdf/2503.14428
Copy Paste: [[2503.14428]] MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation(https://arxiv.org/abs/2503.14428)
Keywords: diffusion
Abstract: Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: this https URL.

Title: Joint Image-Instance Spatial-Temporal Attention for Few-shot Action Recognition

Authors: Zefeng Qian, Chongyang Zhang, Yifei Huang, Gang Wang, Jiangyong Ying
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14430
Pdf URL: https://arxiv.org/pdf/2503.14430
Copy Paste: [[2503.14430]] Joint Image-Instance Spatial-Temporal Attention for Few-shot Action Recognition(https://arxiv.org/abs/2503.14430)
Keywords: segmentation
Abstract: Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial-temporal attention approach (I2ST) for Few-shot Action Recognition. The core concept of I2ST is to perceive the action-related instances and integrate them with image features via spatial-temporal attention. Specifically, I2ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial-temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial-temporal Attention is used to construct the feature dependency between instances and images...

Title: PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

Authors: Wei Fang, Yang Zhang, Kaizhi Qian, James Glass, Yada Zhu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14432
Pdf URL: https://arxiv.org/pdf/2503.14432
Copy Paste: [[2503.14432]] PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play(https://arxiv.org/abs/2503.14432)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically "plays" with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.

Title: LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Authors: Nikhil Abhyankar, Parshin Shojaee, Chandan K. Reddy
Subjects: cs.LG, cs.AI, cs.CL, cs.NE
Abstract URL: https://arxiv.org/abs/2503.14434
Pdf URL: https://arxiv.org/pdf/2503.14434
Copy Paste: [[2503.14434]] LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers(https://arxiv.org/abs/2503.14434)
Keywords: large language model
Abstract: Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.

Title: EnvBench: A Benchmark for Automated Environment Setup

Authors: Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, Yaroslav Zharov
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2503.14443
Pdf URL: https://arxiv.org/pdf/2503.14443
Copy Paste: [[2503.14443]] EnvBench: A Benchmark for Automated Environment Setup(https://arxiv.org/abs/2503.14443)
Keywords: large language model
Abstract: Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench. It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches. Our benchmark suite is publicly available at this https URL. The dataset and experiment trajectories are available at this https URL.

Title: Bolt3D: Generating 3D Scenes in Seconds

Authors: Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14445
Pdf URL: https://arxiv.org/pdf/2503.14445
Copy Paste: [[2503.14445]] Bolt3D: Generating 3D Scenes in Seconds(https://arxiv.org/abs/2503.14445)
Keywords: diffusion, generative
Abstract: We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

Title: RWKV-7 "Goose" with Expressive Dynamic State Evolution

Authors: Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Haowen Hou, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, Christian Zhou-Zheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14456
Pdf URL: https://arxiv.org/pdf/2503.14456
Copy Paste: [[2503.14456]] RWKV-7 "Goose" with Expressive Dynamic State Evolution(https://arxiv.org/abs/2503.14456)
Keywords: transformer
Abstract: We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at this https URL, and our training and inference code at this https URL all under the Apache 2.0 License.

Title: SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model

Authors: Yucheng Mao, Boyang Wang, Nilesh Kulkarni, Jeong Joon Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14463
Pdf URL: https://arxiv.org/pdf/2503.14463
Copy Paste: [[2503.14463]] SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model(https://arxiv.org/abs/2503.14463)
Keywords: robust, diffusion
Abstract: The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.

Title: Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Authors: Xinyu Fang, Zhijian Chen, Kai Lan, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14478
Pdf URL: https://arxiv.org/pdf/2503.14478
Copy Paste: [[2503.14478]] Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM(https://arxiv.org/abs/2503.14478)
Keywords: generative, large language model
Abstract: Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on this https URL.

Title: DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

Authors: Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14487
Pdf URL: https://arxiv.org/pdf/2503.14487
Copy Paste: [[2503.14487]] DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers(https://arxiv.org/abs/2503.14487)
Keywords: diffusion, transformer
Abstract: Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: this https URL

Title: Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Authors: Jensen (Jinghao)Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14489
Pdf URL: https://arxiv.org/pdf/2503.14489
Copy Paste: [[2503.14489]] Stable Virtual Camera: Generative View Synthesis with Diffusion Models(https://arxiv.org/abs/2503.14489)
Keywords: diffusion, generative
Abstract: We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings.

Title: Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Authors: NVIDIA: Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, Tianchang Shen, Shitao Tang, Ting-Chun Wang, Jay Wu, Jiashu Xu, Stella Xu, Kevin Xie, Yuchong Ye, Xiaodong Yang, Xiaohui Zeng, Yu Zeng
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14492
Pdf URL: https://arxiv.org/pdf/2503.14492
Copy Paste: [[2503.14492]] Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control(https://arxiv.org/abs/2503.14492)
Keywords: segmentation
Abstract: We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at this https URL.

Title: State Space Model Meets Transformer: A New Paradigm for 3D Object Detection

Authors: Chuxin Wang, Wenfei Yang, Xiang Liu, Tianzhu Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14493
Pdf URL: https://arxiv.org/pdf/2503.14493
Copy Paste: [[2503.14493]] State Space Model Meets Transformer: A New Paradigm for 3D Object Detection(https://arxiv.org/abs/2503.14493)
Keywords: transformer
Abstract: DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.

Title: Deeply Supervised Flow-Based Generative Models

Authors: Inkyu Shin, Chenglin Yang, Liang-Chieh Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14494
Pdf URL: https://arxiv.org/pdf/2503.14494
Copy Paste: [[2503.14494]] Deeply Supervised Flow-Based Generative Models(https://arxiv.org/abs/2503.14494)
Keywords: transformer, generative
Abstract: Flow based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer output underutilizes the rich inter layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges 8 times faster on ImageNet with equivalent performance and further reduces FID by 2.6 while halving training time compared to previous flow based models without a classifier free guidance. DeepFlow also outperforms baselines in text to image generation tasks, as evidenced by evaluations on MSCOCO and zero shot GenEval.

Title: Advances in 4D Generation: A Survey

Authors: Qiaowei Miao, Kehan Li, Jinsheng Quan, Zhiyuan Min, Shaojie Ma, Yichao Xu, Yi Yang, Yawei Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14501
Pdf URL: https://arxiv.org/pdf/2503.14501
Copy Paste: [[2503.14501]] Advances in 4D Generation: A Survey(https://arxiv.org/abs/2503.14501)
Keywords: generative
Abstract: Generative artificial intelligence has witnessed remarkable advancements across multiple domains in recent years. Building on the successes of 2D and 3D content generation, 4D generation, which incorporates the temporal dimension into generative tasks, has emerged as a burgeoning yet rapidly evolving research area. This paper presents a comprehensive survey of this emerging field, systematically examining its theoretical foundations, key methodologies, and practical applications, with the aim of providing readers with a holistic understanding of the current state and future potential of 4D generation. We begin by introducing the core concepts of 4D data representations, encompassing both structured and unstructured formats, and their implications for generative tasks. Building upon this foundation, we delve into the enabling technologies that drive 4D generation, including advancements in spatiotemporal modeling, neural representations, and generative frameworks. We further review recent studies that employ diverse control mechanisms and representation strategies for generating 4D outputs, categorizing these approaches and summarizing their research trajectories. In addition, we explore the wide-ranging applications of 4D generation techniques, spanning dynamic object modeling, scene generation, digital human synthesis, 4D content editing, and autonomous driving. Finally, we analyze the key challenges inherent to 4D generation, such as data availability, computational efficiency, and spatiotemporal consistency, and propose promising directions for future research. Our code is publicly available at: \href{this https URL}{this https URL}.

Title: The Power of Context: How Multimodality Improves Image Super-Resolution

Authors: Kangfu Mei, Hossein Talebi, Mojtaba Ardakani, Vishal M. Patel, Peyman Milanfar, Mauricio Delbracio
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14503
Pdf URL: https://arxiv.org/pdf/2503.14503
Copy Paste: [[2503.14503]] The Power of Context: How Multimodality Improves Image Super-Resolution(https://arxiv.org/abs/2503.14503)
Keywords: diffusion, generative, segmentation
Abstract: Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at this https URL.

Title: Aligning Multimodal LLM with Human Preference: A Survey

Authors: Tao Yu, Yi-Fan Zhang†, Chaoyou Fu, Junkang Wu, Jinda Lu, Kun Wang, Xingyu Lu, Yunhang Shen, Guibin Zhang, Dingjie Song, Yibo Yan, Tianlong Xu, Qingsong Wen, Zhang Zhang, Yan Huang, Liang Wang, Tieniu Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14504
Pdf URL: https://arxiv.org/pdf/2503.14504
Copy Paste: [[2503.14504]] Aligning Multimodal LLM with Human Preference: A Survey(https://arxiv.org/abs/2503.14504)
Keywords: large language model
Abstract: Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at this https URL.

Title: MusicInfuser: Making Video Diffusion Listen and Dance

Authors: Susung Hong, Ira Kemelmacher-Shlizerman, Brian Curless, Steven M. Seitz
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14505
Pdf URL: https://arxiv.org/pdf/2503.14505
Copy Paste: [[2503.14505]] MusicInfuser: Making Video Diffusion Listen and Dance(https://arxiv.org/abs/2503.14505)
Keywords: diffusion, generative
Abstract: We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at this https URL.