2025-04-07

Title: Optimizing Humor Generation in Large Language Models: Temperature Configurations and Architectural Trade-offs

Authors: Evgenii Evstafev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02858
Pdf URL: https://arxiv.org/pdf/2504.02858
Copy Paste: [[2504.02858]] Optimizing Humor Generation in Large Language Models: Temperature Configurations and Architectural Trade-offs(https://arxiv.org/abs/2504.02858)
Keywords: large language model
Abstract: Large language models (LLMs) demonstrate increasing capabilities in creative text generation, yet systematic evaluations of their humor production remain underexplored. This study presents a comprehensive analysis of 13 state-of-the-art LLMs across five architectural families, evaluating their performance in generating technically relevant humor for software developers. Through a full factorial design testing 715 unique configurations of temperature settings and prompt variations, we assess model outputs using five weighted criteria: humor quality, domain relevance, concept originality, tone precision, and delivery efficiency. Our methodology employs rigorous statistical analysis including ANOVA, correlation studies, and quadratic regression to identify optimal configurations and architectural influences. Results reveal significant performance variations across models, with certain architectures achieving 21.8% superiority over baseline systems. Temperature sensitivity analysis demonstrates that 73% of models achieve peak performance at lower stochasticity settings (<= 0.5), though optimal ranges vary substantially by architecture. We identify distinct model clusters: compact high-performers maintaining efficiency-quality balance versus verbose specialists requiring longer outputs for marginal gains. Statistical validation confirms model architecture explains 38.7% of performance variance, with significant correlations between humor quality and concept originality. The study establishes practical guidelines for model selection and configuration, demonstrating how temperature adjustments and architectural considerations impact humor generation effectiveness. These findings advance understanding of LLM capabilities in creative technical writing and provide empirically validated configuration strategies for developers implementing humor-generation systems.

Title: The Material Contracts Corpus

Authors: Peter Adelson, Julian Nyarko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02864
Pdf URL: https://arxiv.org/pdf/2504.02864
Copy Paste: [[2504.02864]] The Material Contracts Corpus(https://arxiv.org/abs/2504.02864)
Keywords: security
Abstract: This paper introduces the Material Contracts Corpus (MCC), a publicly available dataset comprising over one million contracts filed by public companies with the U.S. Securities and Exchange Commission (SEC) between 2000 and 2023. The MCC facilitates empirical research on contract design and legal language, and supports the development of AI-based legal tools. Contracts in the corpus are categorized by agreement type and linked to specific parties using machine learning and natural language processing techniques, including a fine-tuned LLaMA-2 model for contract classification. The MCC further provides metadata such as filing form, document format, and amendment status. We document trends in contractual language, length, and complexity over time, and highlight the dominance of employment and security agreements in SEC filings. This resource is available for bulk download and online access at this https URL.

Title: The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances

Authors: Yining Wang, Yuquan Wang, Xi Li, Mi Zhang, Geng Hong, Min Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02865
Pdf URL: https://arxiv.org/pdf/2504.02865
Copy Paste: [[2504.02865]] The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances(https://arxiv.org/abs/2504.02865)
Keywords: attack, large language model
Abstract: As Large Language Models (LLMs) continue to advance, they are increasingly relied upon as real-time sources of information by non-expert users. To ensure the factuality of the information they provide, much research has focused on mitigating hallucinations in LLM responses, but only in the context of formal user queries, rather than maliciously crafted ones. In this study, we introduce The Illusionist's Prompt, a novel hallucination attack that incorporates linguistic nuances into adversarial queries, challenging the factual accuracy of LLMs against five types of fact-enhancing strategies. Our attack automatically generates highly transferrable illusory prompts to induce internal factual errors, all while preserving user intent and semantics. Extensive experiments confirm the effectiveness of our attack in compromising black-box LLMs, including commercial APIs like GPT-4o and Gemini-2.0, even with various defensive mechanisms.

Title: OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery

Authors: Xiucheng Liang, Jinheng Xie, Tianhong Zhao, Rudi Stouffs, Filip Biljecki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.02866
Pdf URL: https://arxiv.org/pdf/2504.02866
Copy Paste: [[2504.02866]] OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery(https://arxiv.org/abs/2504.02866)
Keywords: extraction, large language model
Abstract: Building properties, such as height, usage, and material composition, play a crucial role in spatial data infrastructures, supporting applications such as energy simulation, risk assessment, and environmental modeling. Despite their importance, comprehensive and high-quality building attribute data remain scarce in many urban areas. Recent advances have enabled the extraction and tagging of objective building attributes using remote sensing and street-level imagery. However, establishing a method and pipeline that integrates diverse open datasets, acquires holistic building imagery at scale, and infers comprehensive building attributes remains a significant challenge. Among the first, this study bridges the gaps by introducing OpenFACADES, an open framework that leverages multimodal crowdsourced data to enrich building profiles with both objective attributes and semantic descriptors through multimodal large language models. Our methodology proceeds in three major steps. First, we integrate street-level image metadata from Mapillary with OpenStreetMap geometries via isovist analysis, effectively identifying images that provide suitable vantage points for observing target buildings. Second, we automate the detection of building facades in panoramic imagery and tailor a reprojection approach to convert objects into holistic perspective views that approximate real-world observation. Third, we introduce an innovative approach that harnesses and systematically investigates the capabilities of open-source large vision-language models (VLMs) for multi-attribute prediction and open-vocabulary captioning in building-level analytics, leveraging a globally sourced dataset of 30,180 labeled images from seven cities. Evaluation shows that fine-tuned VLM excel in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o.

Title: Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

Authors: Hongliu Cao, Ilias Driouich, Robin Singh, Eoin Thomas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02867
Pdf URL: https://arxiv.org/pdf/2504.02867
Copy Paste: [[2504.02867]] Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications(https://arxiv.org/abs/2504.02867)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for robust evaluation methodologies to accurately assess LLM-based applications. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation. Recent research has explored leveraging LLMs to mimic human reasoning and decision-making processes for evaluation purposes known as LLM-as-a-judge framework. However, these existing frameworks have two significant limitations. First, they lack the flexibility to adapt to different text styles, including various answer and ground truth styles, thereby reducing their generalization performance. Second, the evaluation scores produced by these frameworks are often skewed and hard to interpret, showing a low correlation with human judgment. To address these challenges, we propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications. This system iteratively refines evaluation prompts and balances the trade-off between the adaptive requirements of downstream tasks and the alignment with human perception. Our experimental results show that the proposed multi-agent LLM Judge framework not only enhances evaluation accuracy compared to existing methods but also produces evaluation scores that better align with human perception.

Title: AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening

Authors: Frank P.-W. Lo, Jianing Qiu, Zeyu Wang, Haibao Yu, Yeming Chen, Gao Zhang, Benny Lo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02870
Pdf URL: https://arxiv.org/pdf/2504.02870
Copy Paste: [[2504.02870]] AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening(https://arxiv.org/abs/2504.02870)
Keywords: fair, large language model
Abstract: Resume screening is a critical yet time-intensive process in talent acquisition, requiring recruiters to analyze vast volume of job applications while remaining objective, accurate, and fair. With the advancements in Large Language Models (LLMs), their reasoning capabilities and extensive knowledge bases demonstrate new opportunities to streamline and automate recruitment workflows. In this work, we propose a multi-agent framework for resume screening using LLMs to systematically process and evaluate resumes. The framework consists of four core agents, including a resume extractor, an evaluator, a summarizer, and a score formatter. To enhance the contextual relevance of candidate assessments, we integrate Retrieval-Augmented Generation (RAG) within the resume evaluator, allowing incorporation of external knowledge sources, such as industry-specific expertise, professional certifications, university rankings, and company-specific hiring criteria. This dynamic adaptation enables personalized recruitment, bridging the gap between AI automation and talent acquisition. We assess the effectiveness of our approach by comparing AI-generated scores with ratings provided by HR professionals on a dataset of anonymized online resumes. The findings highlight the potential of multi-agent RAG-LLM systems in automating resume screening, enabling more efficient and scalable hiring workflows.

Title: Synthesized Annotation Guidelines are Knowledge-Lite Boosters for Clinical Information Extraction

Authors: Enshuo Hsu, Martin Ugbala, Krishna Kumar Kookal, Zouaidi Kawtar, Nicholas L. Rider, Muhammad F. Walji, Kirk Roberts
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.02871
Pdf URL: https://arxiv.org/pdf/2504.02871
Copy Paste: [[2504.02871]] Synthesized Annotation Guidelines are Knowledge-Lite Boosters for Clinical Information Extraction(https://arxiv.org/abs/2504.02871)
Keywords: extraction, generative, large language model
Abstract: Generative information extraction using large language models, particularly through few-shot learning, has become a popular method. Recent studies indicate that providing a detailed, human-readable guideline-similar to the annotation guidelines traditionally used for training human annotators can significantly improve performance. However, constructing these guidelines is both labor- and knowledge-intensive. Additionally, the definitions are often tailored to meet specific needs, making them highly task-specific and often non-reusable. Handling these subtle differences requires considerable effort and attention to detail. In this study, we propose a self-improving method that harvests the knowledge summarization and text generation capacity of LLMs to synthesize annotation guidelines while requiring virtually no human input. Our zero-shot experiments on the clinical named entity recognition benchmarks, 2012 i2b2 EVENT, 2012 i2b2 TIMEX, 2014 i2b2, and 2018 n2c2 showed 25.86%, 4.36%, 0.20%, and 7.75% improvements in strict F1 scores from the no-guideline baseline. The LLM-synthesized guidelines showed equivalent or better performance compared to human-written guidelines by 1.15% to 4.14% in most tasks. In conclusion, this study proposes a novel LLM self-improving method that requires minimal knowledge and human input and is applicable to multiple biomedical domains.

Title: Scraping the Shadows: Deep Learning Breakthroughs in Dark Web Intelligence

Authors: Ingmar Bakermans, Daniel De Pascale, Gonçalo Marcelino, Giuseppe Cascavilla, Zeno Geradts
Subjects: cs.CL, cs.AI, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2504.02872
Pdf URL: https://arxiv.org/pdf/2504.02872
Copy Paste: [[2504.02872]] Scraping the Shadows: Deep Learning Breakthroughs in Dark Web Intelligence(https://arxiv.org/abs/2504.02872)
Keywords: extraction
Abstract: Darknet markets (DNMs) facilitate the trade of illegal goods on a global scale. Gathering data on DNMs is critical to ensuring law enforcement agencies can effectively combat crime. Manually extracting data from DNMs is an error-prone and time-consuming task. Aiming to automate this process we develop a framework for extracting data from DNMs and evaluate the application of three state-of-the-art Named Entity Recognition (NER) models, ELMo-BiLSTM \citep{ShahEtAl2022}, UniversalNER \citep{ZhouEtAl2024}, and GLiNER \citep{ZaratianaEtAl2023}, at the task of extracting complex entities from DNM product listing pages. We propose a new annotated dataset, which we use to train, fine-tune, and evaluate the models. Our findings show that state-of-the-art NER models perform well in information extraction from DNMs, achieving 91% Precision, 96% Recall, and an F1 score of 94%. In addition, fine-tuning enhances model performance, with UniversalNER achieving the best performance.

Title: Short-PHD: Detecting Short LLM-generated Text with Topological Data Analysis After Off-topic Content Insertion

Authors: Dongjun Wei, Minjia Mao, Xiao Fang, Michael Chau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02873
Pdf URL: https://arxiv.org/pdf/2504.02873
Copy Paste: [[2504.02873]] Short-PHD: Detecting Short LLM-generated Text with Topological Data Analysis After Off-topic Content Insertion(https://arxiv.org/abs/2504.02873)
Keywords: robust, large language model
Abstract: The malicious usage of large language models (LLMs) has motivated the detection of LLM-generated texts. Previous work in topological data analysis shows that the persistent homology dimension (PHD) of text embeddings can serve as a more robust and promising score than other zero-shot methods. However, effectively detecting short LLM-generated texts remains a challenge. This paper presents Short-PHD, a zero-shot LLM-generated text detection method tailored for short texts. Short-PHD stabilizes the estimation of the previous PHD method for short texts by inserting off-topic content before the given input text and identifies LLM-generated text based on an established detection threshold. Experimental results on both public and generated datasets demonstrate that Short-PHD outperforms existing zero-shot methods in short LLM-generated text detection. Implementation codes are available online.

Title: TheBlueScrubs-v1, a comprehensive curated medical dataset derived from the internet

Authors: Luis Felipe, Carlos Garcia, Issam El Naqa, Monique Shotande, Aakash Tripathi, Vivek Rudrapatna, Ghulam Rasool, Danielle Bitterman, Gilmer Valdes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02874
Pdf URL: https://arxiv.org/pdf/2504.02874
Copy Paste: [[2504.02874]] TheBlueScrubs-v1, a comprehensive curated medical dataset derived from the internet(https://arxiv.org/abs/2504.02874)
Keywords: robust, large language model
Abstract: The need for robust and diverse data sets to train clinical large language models (cLLMs) is critical given that currently available public repositories often prove too limited in size or scope for comprehensive medical use. While resources like PubMed provide foundational medical literature, they capture only a narrow range of formal publications and omit the broader medical discourse on the internet. To address these deficits, we introduce TheBlueScrubs-v1, a curated dataset of over 25 billion medical tokens - nearly three times larger than PubMed - drawn from a broad-scale internet corpus. Our two-stage filtering pipeline employs a Logistic Regression model for document screening (achieving an AUC of approximately 0.95 on external validation), followed by verification via a 70B-parameter Llama 3.1 instruct model. Each text is assigned three LLM-based quality scores encompassing medical relevance, precision and factual detail, and safety and ethical standards. Clinician reviews confirm high concordance with these automated evaluations, and a specialized cancer classifier further labels approximately 11 billion oncology tokens. Two demonstration tasks highlight the dataset's practical value: first, we distill the safety evaluations to a smaller BERT-style model that reaches an AUC near 0.96 on unseen data; second, we fine-tune a compact LLM on a filtered subset, showing measurable improvements over standard baselines in medical benchmarks as well as private ones. This Data Descriptor details the dataset's creation and validation, underscoring its potential utility for medical AI research.

Title: Multimodal Reference Visual Grounding

Authors: Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, Yu Xiang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02876
Pdf URL: https://arxiv.org/pdf/2504.02876
Copy Paste: [[2504.02876]] Multimodal Reference Visual Grounding(https://arxiv.org/abs/2504.02876)
Keywords: large language model
Abstract: Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-7B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding. Project page with our code and dataset: this https URL

Title: Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations

Authors: DongHyun Choi, Lucas Spangher, Chris Hidey, Peter Grabowski, Ramy Eskander
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02877
Pdf URL: https://arxiv.org/pdf/2504.02877
Copy Paste: [[2504.02877]] Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations(https://arxiv.org/abs/2504.02877)
Keywords: transformer, large language model
Abstract: Transformer-based Large Language Models, which suffer from high computational costs, advance so quickly that techniques proposed to streamline earlier iterations are not guaranteed to benefit more modern models. Building upon the Funnel Transformer proposed by Dai and Le (2020), which progressively compresses intermediate representations, we investigate the impact of funneling in contemporary Gemma2 Transformer architectures. We systematically evaluate various funnel configurations and recovery methods, comparing: (1) standard pretraining to funnel-aware pretraining strategies, (2) the impact of funnel-aware fine-tuning, and (3) the type of sequence recovery operation. Our results demonstrate that funneling creates information bottlenecks that propagate through deeper network layers, particularly in larger models (e.g., Gemma 7B), leading to at times unmanageable performance lost. However, carefully selecting the funneling layer and employing effective recovery strategies, can substantially mitigate performance losses, achieving up to a 44\% reduction in latency. Our findings highlight key trade-offs between computational efficiency and model accuracy, providing practical guidance for deploying funnel-based approaches in large-scale natural language applications.

Title: Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding

Authors: Lilin Xu, Kaiyuan Hou, Xiaofan Jiang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02878
Pdf URL: https://arxiv.org/pdf/2504.02878
Copy Paste: [[2504.02878]] Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding(https://arxiv.org/abs/2504.02878)
Keywords: robust, large language model
Abstract: Human activity recognition (HAR) using inertial measurement units (IMUs) increasingly leverages large language models (LLMs), yet existing approaches focus on coarse activities like walking or running. Our preliminary study indicates that pretrained LLMs fail catastrophically on fine-grained HAR tasks such as air-written letter recognition, achieving only near-random guessing accuracy. In this work, we first bridge this gap for flat-surface writing scenarios: by fine-tuning LLMs with a self-collected dataset and few-shot learning, we achieved up to a 129x improvement on 2D data. To extend this to 3D scenarios, we designed an encoder-based pipeline that maps 3D data into 2D equivalents, preserving the spatiotemporal information for robust letter prediction. Our end-to-end pipeline achieves 78% accuracy on word recognition with up to 5 letters in mid-air writing scenarios, establishing LLMs as viable tools for fine-grained HAR.

Title: Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers

Authors: Nick Whitehouse, Nicole Lincoln, Stephanie Yiu, Lizzie Catterson, Rivindu Perera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02881
Pdf URL: https://arxiv.org/pdf/2504.02881
Copy Paste: [[2504.02881]] Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers(https://arxiv.org/abs/2504.02881)
Keywords: large language model
Abstract: Legal invoice review is a costly, inconsistent, and time-consuming process, traditionally performed by Legal Operations, Lawyers or Billing Specialists who scrutinise billing compliance line by line. This study presents the first empirical comparison of Large Language Models (LLMs) against human invoice reviewers - Early-Career Lawyers, Experienced Lawyers, and Legal Operations Professionals-assessing their accuracy, speed, and cost-effectiveness. Benchmarking state-of-the-art LLMs against a ground truth set by expert legal professionals, our empirically substantiated findings reveal that LLMs decisively outperform humans across every metric. In invoice approval decisions, LLMs achieve up to 92% accuracy, surpassing the 72% ceiling set by experienced lawyers. On a granular level, LLMs dominate line-item classification, with top models reaching F-scores of 81%, compared to just 43% for the best-performing human group. Speed comparisons are even more striking - while lawyers take 194 to 316 seconds per invoice, LLMs are capable of completing reviews in as fast as 3.6 seconds. And cost? AI slashes review expenses by 99.97%, reducing invoice processing costs from an average of $4.27 per invoice for human invoice reviewers to mere cents. These results highlight the evolving role of AI in legal spend management. As law firms and corporate legal departments struggle with inefficiencies, this study signals a seismic shift: The era of LLM-powered legal spend management is not on the horizon, it has arrived. The challenge ahead is not whether AI can perform as well as human reviewers, but how legal teams will strategically incorporate it, balancing automation with human discretion.

Title: DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

Authors: Sunghee Jung, Donghun Lee, Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Junrae Cho, Kihyun Kim, Eunggyun Kim, Myeongcheol Shin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02882
Pdf URL: https://arxiv.org/pdf/2504.02882
Copy Paste: [[2504.02882]] DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models(https://arxiv.org/abs/2504.02882)
Keywords: large language model
Abstract: Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM's dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

Title: SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models

Authors: Anil Ramakrishna, Yixin Wan, Xiaomeng Jin, Kai-Wei Chang, Zhiqi Bu, Bhanukiran Vinzamuri, Volkan Cevher, Mingyi Hong, Rahul Gupta
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02883
Pdf URL: https://arxiv.org/pdf/2504.02883
Copy Paste: [[2504.02883]] SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models(https://arxiv.org/abs/2504.02883)
Keywords: large language model
Abstract: We introduce SemEval-2025 Task 4: unlearning sensitive content from Large Language Models (LLMs). The task features 3 subtasks for LLM unlearning spanning different use cases: (1) unlearn long form synthetic creative documents spanning different genres; (2) unlearn short form synthetic biographies containing personally identifiable information (PII), including fake names, phone number, SSN, email and home addresses, and (3) unlearn real documents sampled from the target model's training dataset. We received over 100 submissions from over 30 institutions and we summarize the key techniques and lessons in this paper.

Title: Enhancing Traffic Sign Recognition On The Performance Based On Yolov8

Authors: Baba Ibrahim, Zhou Kui (Hubei University of Automotive Technology and Hubei University of Automotive Technology)
Subjects: cs.CV, cs.PF
Abstract URL: https://arxiv.org/abs/2504.02884
Pdf URL: https://arxiv.org/pdf/2504.02884
Copy Paste: [[2504.02884]] Enhancing Traffic Sign Recognition On The Performance Based On Yolov8(https://arxiv.org/abs/2504.02884)
Keywords: robust
Abstract: This paper Traffic sign recognition plays a crucial role in the development of autonomous vehicles and advanced driver-assistance systems (ADAS). Despite significant advances in deep learning and object detection, accurately detecting and classifying traffic signs remains challenging due to their small sizes, variable environmental conditions, occlusion, and class imbalance. This thesis presents an enhanced YOLOv8-based detection system that integrates advanced data augmentation techniques, novel architectural enhancements including Coordinate Attention (CA), Bidirectional Feature Pyramid Network (BiFPN), and dynamic modules such as ODConv and LSKA, along with refined loss functions (EIoU and WIoU combined with Focal Loss). Extensive experiments conducted on datasets including GTSRB, TT100K, and GTSDB demonstrate marked improvements in detection accuracy, robustness under adverse conditions, and real-time inference on edge devices. The findings contribute actionable insights for deploying reliable traffic sign recognition systems in real-world autonomous driving scenarios.

Title: Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets

Authors: John Chen, Alexandros Lotsos, Grace Wang, Lexie Zhao, Bruce Sherin, Uri Wilensky, Michael Horn
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02887
Pdf URL: https://arxiv.org/pdf/2504.02887
Copy Paste: [[2504.02887]] Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets(https://arxiv.org/abs/2504.02887)
Keywords: generative
Abstract: Open coding, a key inductive step in qualitative research, discovers and constructs concepts from human datasets. However, capturing extensive and nuanced aspects or "coding moments" can be challenging, especially with large discourse datasets. While some studies explore machine learning (ML)/Generative AI (GAI)'s potential for open coding, few evaluation studies exist. We compare open coding results by five recently published ML/GAI approaches and four human coders, using a dataset of online chat messages around a mobile learning software. Our systematic analysis reveals ML/GAI approaches' strengths and weaknesses, uncovering the complementary potential between humans and AI. Line-by-line AI approaches effectively identify content-based codes, while humans excel in interpreting conversational dynamics. We discussed how embedded analytical processes could shape the results of ML/GAI approaches. Instead of replacing humans in open coding, researchers should integrate AI with and according to their analytical processes, e.g., as parallel co-coders.

Title: A Status Quo Investigation of Large Language Models towards Cost-Effective CFD Automation with OpenFOAMGPT: ChatGPT vs. Qwen vs. Deepseek

Authors: Wenkang Wang, Ran Xu, Jingsen Feng, Qingfu Zhang, Xu Chu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02888
Pdf URL: https://arxiv.org/pdf/2504.02888
Copy Paste: [[2504.02888]] A Status Quo Investigation of Large Language Models towards Cost-Effective CFD Automation with OpenFOAMGPT: ChatGPT vs. Qwen vs. Deepseek(https://arxiv.org/abs/2504.02888)
Keywords: large language model
Abstract: We evaluated the performance of OpenFOAMGPT incorporating multiple large-language models. Some of the present models efficiently manage different CFD tasks such as adjusting boundary conditions, turbulence models, and solver configurations, although their token cost and stability vary. Locally deployed smaller models like QwQ-32B struggled with generating valid solver files for complex processes. Zero-shot prompting commonly failed in simulations with intricate settings, even for large models. Challenges with boundary conditions and solver keywords stress the requirement for expert supervision, indicating that further development is needed to fully automate specialized CFD simulations.

Title: Scaling Test-time Compute for Low-resource Languages: Multilingual Reasoning in LLMs

Authors: Khanh-Tung Tran, Barry O'Sullivan, Hoang D. Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02890
Pdf URL: https://arxiv.org/pdf/2504.02890
Copy Paste: [[2504.02890]] Scaling Test-time Compute for Low-resource Languages: Multilingual Reasoning in LLMs(https://arxiv.org/abs/2504.02890)
Keywords: large language model
Abstract: Recent advances in test-time compute scaling have enabled Large Language Models (LLMs) to tackle deep reasoning tasks by generating a chain-of-thought (CoT) that includes trial and error, backtracking, and intermediate reasoning steps before producing the final answer. However, these techniques have been applied predominantly to popular languages, such as English, leaving reasoning in low-resource languages underexplored and misaligned. In this work, we investigate the multilingual mechanism by which LLMs internally operate in a latent space biased toward their inherently dominant language. To leverage this phenomenon for low-resource languages, we train models to generate the CoT in English while outputting the final response in the target language, given input in the low-resource language. Our experiments demonstrate that this approach, named English-Pivoted CoT Training, outperforms other baselines, including training to generate both the CoT and the final response solely in the target language, with up to 28.33% improvement. Further analysis provides novel insights into the relationships between reasoning and multilinguality of LLMs, prompting for better approaches in developing multilingual large reasoning models

Title: Automated Survey Collection with LLM-based Conversational Agents

Authors: Kurmanbek Kaiyrbekov, Nicholas J Dobbins, Sean D Mooney
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02891
Pdf URL: https://arxiv.org/pdf/2504.02891
Copy Paste: [[2504.02891]] Automated Survey Collection with LLM-based Conversational Agents(https://arxiv.org/abs/2504.02891)
Keywords: large language model
Abstract: Objective: Traditional phone-based surveys are among the most accessible and widely used methods to collect biomedical and healthcare data, however, they are often costly, labor intensive, and difficult to scale effectively. To overcome these limitations, we propose an end-to-end survey collection framework driven by conversational Large Language Models (LLMs). Materials and Methods: Our framework consists of a researcher responsible for designing the survey and recruiting participants, a conversational phone agent powered by an LLM that calls participants and administers the survey, a second LLM (GPT-4o) that analyzes the conversation transcripts generated during the surveys, and a database for storing and organizing the results. To test our framework, we recruited 8 participants consisting of 5 native and 3 non-native english speakers and administered 40 surveys. We evaluated the correctness of LLM-generated conversation transcripts, accuracy of survey responses inferred by GPT-4o and overall participant experience. Results: Survey responses were successfully extracted by GPT-4o from conversation transcripts with an average accuracy of 98% despite transcripts exhibiting an average per-line word error rate of 7.7%. While participants noted occasional errors made by the conversational LLM agent, they reported that the agent effectively conveyed the purpose of the survey, demonstrated good comprehension, and maintained an engaging interaction. Conclusions: Our study highlights the potential of LLM agents in conducting and analyzing phone surveys for healthcare applications. By reducing the workload on human interviewers and offering a scalable solution, this approach paves the way for real-world, end-to-end AI-powered phone survey collection systems.

Title: OnRL-RAG: Real-Time Personalized Mental Health Dialogue System

Authors: Ahsan Bilal, Beiyu Lin, Mehdi Zaeifi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02894
Pdf URL: https://arxiv.org/pdf/2504.02894
Copy Paste: [[2504.02894]] OnRL-RAG: Real-Time Personalized Mental Health Dialogue System(https://arxiv.org/abs/2504.02894)
Keywords: large language model
Abstract: Large language models (LLMs) have been widely used for various tasks and applications. However, LLMs and fine-tuning are limited to the pre-trained data. For example, ChatGPT's world knowledge until 2021 can be outdated or inaccurate. To enhance the capabilities of LLMs, Retrieval-Augmented Generation (RAG), is proposed to augment LLMs with additional, new, latest details and information to LLMs. While RAG offers the correct information, it may not best present it, especially to different population groups with personalizations. Reinforcement Learning from Human Feedback (RLHF) adapts to user needs by aligning model responses with human preference through feedback loops. In real-life applications, such as mental health problems, a dynamic and feedback-based model would continuously adapt to new information and offer personalized assistance due to complex factors fluctuating in a daily environment. Thus, we propose an Online Reinforcement Learning-based Retrieval-Augmented Generation (OnRL-RAG) system to detect and personalize the responding systems to mental health problems, such as stress, anxiety, and depression. We use an open-source dataset collected from 2028 College Students with 28 survey questions for each student to demonstrate the performance of our proposed system with the existing systems. Our system achieves superior performance compared to standard RAG and simple LLM via GPT-4o, GPT-4o-mini, Gemini-1.5, and GPT-3.5. This work would open up the possibilities of real-life applications of LLMs for personalized services in the everyday environment. The results will also help researchers in the fields of sociology, psychology, and neuroscience to align their theories more closely with the actual human daily environment.

Title: UAC: Uncertainty-Aware Calibration of Neural Networks for Gesture Detection

Authors: Farida Al Haddad, Yuxin Wang, Malcolm Mielle
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02895
Pdf URL: https://arxiv.org/pdf/2504.02895
Copy Paste: [[2504.02895]] UAC: Uncertainty-Aware Calibration of Neural Networks for Gesture Detection(https://arxiv.org/abs/2504.02895)
Keywords: privacy, robust
Abstract: Artificial intelligence has the potential to impact safety and efficiency in safety-critical domains such as construction, manufacturing, and healthcare. For example, using sensor data from wearable devices, such as inertial measurement units (IMUs), human gestures can be detected while maintaining privacy, thereby ensuring that safety protocols are followed. However, strict safety requirements in these domains have limited the adoption of AI, since accurate calibration of predicted probabilities and robustness against out-of-distribution (OOD) data is necessary. This paper proposes UAC (Uncertainty-Aware Calibration), a novel two-step method to address these challenges in IMU-based gesture recognition. First, we present an uncertainty-aware gesture network architecture that predicts both gesture probabilities and their associated uncertainties from IMU data. This uncertainty is then used to calibrate the probabilities of each potential gesture. Second, an entropy-weighted expectation of predictions over multiple IMU data windows is used to improve accuracy while maintaining correct calibration. Our method is evaluated using three publicly available IMU datasets for gesture detection and is compared to three state-of-the-art calibration methods for neural networks: temperature scaling, entropy maximization, and Laplace approximation. UAC outperforms existing methods, achieving improved accuracy and calibration in both OOD and in-distribution scenarios. Moreover, we find that, unlike our method, none of the state-of-the-art methods significantly improve the calibration of IMU-based gesture recognition models. In conclusion, our work highlights the advantages of uncertainty-aware calibration of neural networks, demonstrating improvements in both calibration and accuracy for gesture detection using IMU data.

Title: A Practical Synthesis of Detecting AI-Generated Textual, Visual, and Audio Content

Authors: Lele Cao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02898
Pdf URL: https://arxiv.org/pdf/2504.02898
Copy Paste: [[2504.02898]] A Practical Synthesis of Detecting AI-Generated Textual, Visual, and Audio Content(https://arxiv.org/abs/2504.02898)
Keywords: security, robust, watermark, diffusion, generative, large language model
Abstract: Advances in AI-generated content have led to wide adoption of large language models, diffusion-based visual generators, and synthetic audio tools. However, these developments raise critical concerns about misinformation, copyright infringement, security threats, and the erosion of public trust. In this paper, we explore an extensive range of methods designed to detect and mitigate AI-generated textual, visual, and audio content. We begin by discussing motivations and potential impacts associated with AI-based content generation, including real-world risks and ethical dilemmas. We then outline detection techniques spanning observation-based strategies, linguistic and statistical analysis, model-based pipelines, watermarking and fingerprinting, as well as emergent ensemble approaches. We also present new perspectives on robustness, adaptation to rapidly improving generative architectures, and the critical role of human-in-the-loop verification. By surveying state-of-the-art research and highlighting case studies in academic, journalistic, legal, and industrial contexts, this paper aims to inform robust solutions and policymaking. We conclude by discussing open challenges, including adversarial transformations, domain generalization, and ethical concerns, thereby offering a holistic guide for researchers, practitioners, and regulators to preserve content authenticity in the face of increasingly sophisticated AI-generated media.

Title: Comparative Analysis of Deepfake Detection Models: New Approaches and Perspectives

Authors: Matheus Martins Batista
Subjects: cs.CV, cs.LG, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2504.02900
Pdf URL: https://arxiv.org/pdf/2504.02900
Copy Paste: [[2504.02900]] Comparative Analysis of Deepfake Detection Models: New Approaches and Perspectives(https://arxiv.org/abs/2504.02900)
Keywords: robust, transformer, generative
Abstract: The growing threat posed by deepfake videos, capable of manipulating realities and disseminating misinformation, drives the urgent need for effective detection methods. This work investigates and compares different approaches for identifying deepfakes, focusing on the GenConViT model and its performance relative to other architectures present in the DeepfakeBenchmark. To contextualize the research, the social and legal impacts of deepfakes are addressed, as well as the technical fundamentals of their creation and detection, including digital image processing, machine learning, and artificial neural networks, with emphasis on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Transformers. The performance evaluation of the models was conducted using relevant metrics and new datasets established in the literature, such as WildDeep-fake and DeepSpeak, aiming to identify the most effective tools in the battle against misinformation and media manipulation. The obtained results indicated that GenConViT, after fine-tuning, exhibited superior performance in terms of accuracy (93.82%) and generalization capacity, surpassing other architectures in the DeepfakeBenchmark on the DeepSpeak dataset. This study contributes to the advancement of deepfake detection techniques, offering contributions to the development of more robust and effective solutions against the dissemination of false information.

Title: Hide and Seek in Noise Labels: Noise-Robust Collaborative Active Learning with LLM-Powered Assistance

Authors: Bo Yuan, Yulin Chen, Yin Zhang, Wei Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02901
Pdf URL: https://arxiv.org/pdf/2504.02901
Copy Paste: [[2504.02901]] Hide and Seek in Noise Labels: Noise-Robust Collaborative Active Learning with LLM-Powered Assistance(https://arxiv.org/abs/2504.02901)
Keywords: robust, large language model
Abstract: Learning from noisy labels (LNL) is a challenge that arises in many real-world scenarios where collected training data can contain incorrect or corrupted labels. Most existing solutions identify noisy labels and adopt active learning to query human experts on them for denoising. In the era of large language models (LLMs), although we can reduce the human effort to improve these methods, their performances are still subject to accurately separating the clean and noisy samples from noisy data. In this paper, we propose an innovative collaborative learning framework NoiseAL based on active learning to combine LLMs and small models (SMs) for learning from noisy labels. During collaborative training, we first adopt two SMs to form a co-prediction network and propose a dynamic-enhanced threshold strategy to divide the noisy data into different subsets, then select the clean and noisy samples from these subsets to feed the active annotator LLMs to rectify noisy samples. Finally, we employ different optimization objectives to conquer subsets with different degrees of label noises. Extensive experiments on synthetic and real-world noise datasets further demonstrate the superiority of our framework over state-of-the-art baselines.

Title: Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models

Authors: Liangjie Huang, Dawei Li, Huan Liu, Lu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02902
Pdf URL: https://arxiv.org/pdf/2504.02902
Copy Paste: [[2504.02902]] Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models(https://arxiv.org/abs/2504.02902)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task performance, recent studies suggest that it may also introduce undesirable biases-most notably, self-bias, or the tendency of LLMs to favor their own prior outputs. In this work, we extend this line of inquiry by investigating the impact on confidence estimation. We evaluate three representative self-improvement paradigms-basic prompting, Chain-of-Thought (CoT) prompting, and tuning-based methods and find that iterative self-improvement can lead to systematic overconfidence, as evidenced by a steadily increasing Expected Calibration Error (ECE) and lower accuracy with high confidence. We then further explore the integration of confidence calibration techniques with self-improvement. Specifically, we compare three strategies: (1) applying calibration after multiple rounds of self-improvement, (2) calibrating before self-improvement, and (3) applying calibration iteratively at each self-improvement step. Our results show that iterative calibration is most effective in reducing ECE, yielding improved calibration. Our work pioneers the study of self-improving LLMs from a calibration perspective, offering valuable insights into balancing model performance and reliability.

Title: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Authors: Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02904
Pdf URL: https://arxiv.org/pdf/2504.02904
Copy Paste: [[2504.02904]] How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence(https://arxiv.org/abs/2504.02904)
Keywords: interpretability, large language model
Abstract: Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by linear vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training.

Title: Enhancing Chart-to-Code Generation in Multimodal Large Language Models via Iterative Dual Preference Learning

Authors: Zhihan Zhang, Yixin Cao, Lizi Liao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02906
Pdf URL: https://arxiv.org/pdf/2504.02906
Copy Paste: [[2504.02906]] Enhancing Chart-to-Code Generation in Multimodal Large Language Models via Iterative Dual Preference Learning(https://arxiv.org/abs/2504.02906)
Keywords: large language model
Abstract: Chart-to-code generation, the process of converting chart images into executable plotting scripts, provides a lossless representation of chart information, requiring models to accurately capture and summarize all visual and structural elements. However, this remains a significant challenge for multimodal large language models (MLLMs), which are not inherently well-aligned with code generation tasks. To bridge this gap, we introduce Chart2Code, a novel iterative dual preference learning framework designed to enhance MLLMs' chart-to-code generation capabilities through structured code variant generation and fine-grained dual reward signals. We validate Chart2Code across three MLLMs and find that iterative preference learning consistently improves out-of-distribution chart-to-code generation quality. Throughout this process, our dual scoring method, which evaluates both the textual code structure and its visual representation, leads to greater performance improvements, even with a reduced preference dataset size. Further analysis explores the key components of our framework and highlights the interplay between chart-to-code generation and broader chart reasoning, paving the way for future advancements in chart comprehension.

Title: Noiser: Bounded Input Perturbations for Attributing Large Language Models

Authors: Mohammad Reza Ghasemi Madani, Aryo Pradipta Gema, Gabriele Sarti, Yu Zhao, Pasquale Minervini, Andrea Passerini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02911
Pdf URL: https://arxiv.org/pdf/2504.02911
Copy Paste: [[2504.02911]] Noiser: Bounded Input Perturbations for Attributing Large Language Models(https://arxiv.org/abs/2504.02911)
Keywords: robust, large language model
Abstract: Feature attribution (FA) methods are common post-hoc approaches that explain how Large Language Models (LLMs) make predictions. Accordingly, generating faithful attributions that reflect the actual inner behavior of the model is crucial. In this paper, we introduce Noiser, a perturbation-based FA method that imposes bounded noise on each input embedding and measures the robustness of the model against partially noised input to obtain the input attributions. Additionally, we propose an answerability metric that employs an instructed judge model to assess the extent to which highly scored tokens suffice to recover the predicted output. Through a comprehensive evaluation across six LLMs and three tasks, we demonstrate that Noiser consistently outperforms existing gradient-based, attention-based, and perturbation-based FA methods in terms of both faithfulness and answerability, making it a robust and effective approach for explaining language model predictions.

Title: Haphazard Inputs as Images in Online Learning

Authors: Rohit Agarwal, Aryan Dessai, Arif Ahmed Sekh, Krishna Agarwal, Alexander Horsch, Dilip K. Prasad
Subjects: cs.CV, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02912
Pdf URL: https://arxiv.org/pdf/2504.02912
Copy Paste: [[2504.02912]] Haphazard Inputs as Images in Online Learning(https://arxiv.org/abs/2504.02912)
Keywords: robust
Abstract: The field of varying feature space in online learning settings, also known as haphazard inputs, is very prominent nowadays due to its applicability in various fields. However, the current solutions to haphazard inputs are model-dependent and cannot benefit from the existing advanced deep-learning methods, which necessitate inputs of fixed dimensions. Therefore, we propose to transform the varying feature space in an online learning setting to a fixed-dimension image representation on the fly. This simple yet novel approach is model-agnostic, allowing any vision-based models to be applicable for haphazard inputs, as demonstrated using ResNet and ViT. The image representation handles the inconsistent input data seamlessly, making our proposed approach scalable and robust. We show the efficacy of our method on four publicly available datasets. The code is available at this https URL.

Title: Bias in Large Language Models Across Clinical Applications: A Systematic Review

Authors: Thanathip Suenghataiphorn, Narisara Tribuddharat, Pojsakorn Danpanichkul, Narathorn Kulthamrongsri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02917
Pdf URL: https://arxiv.org/pdf/2504.02917
Copy Paste: [[2504.02917]] Bias in Large Language Models Across Clinical Applications: A Systematic Review(https://arxiv.org/abs/2504.02917)
Keywords: large language model
Abstract: Background: Large language models (LLMs) are rapidly being integrated into healthcare, promising to enhance various clinical tasks. However, concerns exist regarding their potential for bias, which could compromise patient care and exacerbate health inequities. This systematic review investigates the prevalence, sources, manifestations, and clinical implications of bias in LLMs. Methods: We conducted a systematic search of PubMed, OVID, and EMBASE from database inception through 2025, for studies evaluating bias in LLMs applied to clinical tasks. We extracted data on LLM type, bias source, bias manifestation, affected attributes, clinical task, evaluation methods, and outcomes. Risk of bias was assessed using a modified ROBINS-I tool. Results: Thirty-eight studies met inclusion criteria, revealing pervasive bias across various LLMs and clinical applications. Both data-related bias (from biased training data) and model-related bias (from model training) were significant contributors. Biases manifested as: allocative harm (e.g., differential treatment recommendations); representational harm (e.g., stereotypical associations, biased image generation); and performance disparities (e.g., variable output quality). These biases affected multiple attributes, most frequently race/ethnicity and gender, but also age, disability, and language. Conclusions: Bias in clinical LLMs is a pervasive and systemic issue, with a potential to lead to misdiagnosis and inappropriate treatment, particularly for marginalized patient populations. Rigorous evaluation of the model is crucial. Furthermore, the development and implementation of effective mitigation strategies, coupled with continuous monitoring in real-world clinical settings, are essential to ensure the safe, equitable, and trustworthy deployment of LLMs in healthcare.

Title: Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

Authors: Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios Vozikis, Thijmen Nijdam, Derck W. E. Prinzhorn, Mark Bodracska, Nicu Sebe, Efstratios Gavves
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.02918
Pdf URL: https://arxiv.org/pdf/2504.02918
Copy Paste: [[2504.02918]] Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments(https://arxiv.org/abs/2504.02918)
Keywords: generative
Abstract: Recent advances in image and video generation raise hopes that these models possess world modeling capabilities, the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical conservation laws? To answer this, we introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 80 real-world videos capturing physical phenomena, guided by conservation laws. Since artificial generations lack ground truth, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles despite generating aesthetically pleasing videos. All data, leaderboard, and code are open-sourced at our project page.

Title: HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

Authors: Yuwei An, Yihua Cheng, Seo Jin Park, Junchen Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02921
Pdf URL: https://arxiv.org/pdf/2504.02921
Copy Paste: [[2504.02921]] HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse(https://arxiv.org/abs/2504.02921)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.

Title: Robustly identifying concepts introduced during chat fine-tuning using crosscoders

Authors: Julian Minder, Clement Dumas, Caden Juang, Bilal Chugtai, Neel Nanda
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.02922
Pdf URL: https://arxiv.org/pdf/2504.02922
Copy Paste: [[2504.02922]] Robustly identifying concepts introduced during chat fine-tuning using crosscoders(https://arxiv.org/abs/2504.02922)
Keywords: robust
Abstract: Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of genuinely chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat tuning modifies language model behavior.

Title: Graph Attention for Heterogeneous Graphs with Positional Encoding

Authors: Nikhil Shivakumar Nayak
Subjects: cs.LG, cs.AI, cs.DM, math.DG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.02938
Pdf URL: https://arxiv.org/pdf/2504.02938
Copy Paste: [[2504.02938]] Graph Attention for Heterogeneous Graphs with Positional Encoding(https://arxiv.org/abs/2504.02938)
Keywords: transformer
Abstract: Graph Neural Networks (GNNs) have emerged as the de facto standard for modeling graph data, with attention mechanisms and transformers significantly enhancing their performance on graph-based tasks. Despite these advancements, the performance of GNNs on heterogeneous graphs often remains complex, with networks generally underperforming compared to their homogeneous counterparts. This work benchmarks various GNN architectures to identify the most effective methods for heterogeneous graphs, with a particular focus on node classification and link prediction. Our findings reveal that graph attention networks excel in these tasks. As a main contribution, we explore enhancements to these attention networks by integrating positional encodings for node embeddings. This involves utilizing the full Laplacian spectrum to accurately capture both the relative and absolute positions of each node within the graph, further enhancing performance on downstream tasks such as node classification and link prediction.

Title: VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

Authors: Xianwei Zhuang, Yuxin Xie, Yufan Deng, Dongchao Yang, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02949
Pdf URL: https://arxiv.org/pdf/2504.02949
Copy Paste: [[2504.02949]] VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning(https://arxiv.org/abs/2504.02949)
Keywords: generative, large language model
Abstract: In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded language model backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from large language models (LLMs), exhibiting promising scalability. The codebase and model weights are publicly available at this https URL.

Title: Cultural Learning-Based Culture Adaptation of Language Models

Authors: Chen Cecilia Liu, Anna Korhonen, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02953
Pdf URL: https://arxiv.org/pdf/2504.02953
Copy Paste: [[2504.02953]] Cultural Learning-Based Culture Adaptation of Language Models(https://arxiv.org/abs/2504.02953)
Keywords: large language model
Abstract: Adapting large language models (LLMs) to diverse cultural values is a challenging task, as existing LLMs often reflect the values of specific groups by default, and potentially causing harm to others. In this paper, we present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning. The framework leverages simulated social interactions to generate conversations in which LLMs engage in role-playing within culturally adapted social scenarios, capturing implicit cultural norms for model fine-tuning. CLCA improves cultural value alignment across various model architectures measured using World Value Survey data, demonstrating the effectiveness of our proposed approach. Our results provide early evidence that understanding intent and social interactions can enhance cultural value adaptation in LLMs, highlighting the promise of training approaches based on cultural learning.

Title: Digital Forensics in the Age of Large Language Models

Authors: Zhipeng Yin, Zichong Wang, Weifeng Xu, Jun Zhuang, Pallab Mozumder, Antoinette Smith, Wenbin Zhang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02963
Pdf URL: https://arxiv.org/pdf/2504.02963
Copy Paste: [[2504.02963]] Digital Forensics in the Age of Large Language Models(https://arxiv.org/abs/2504.02963)
Keywords: robust, interpretability, large language model
Abstract: Digital forensics plays a pivotal role in modern investigative processes, utilizing specialized methods to systematically collect, analyze, and interpret digital evidence for judicial proceedings. However, traditional digital forensic techniques are primarily based on manual labor-intensive processes, which become increasingly insufficient with the rapid growth and complexity of digital data. To this end, Large Language Models (LLMs) have emerged as powerful tools capable of automating and enhancing various digital forensic tasks, significantly transforming the field. Despite the strides made, general practitioners and forensic experts often lack a comprehensive understanding of the capabilities, principles, and limitations of LLM, which limits the full potential of LLM in forensic applications. To fill this gap, this paper aims to provide an accessible and systematic overview of how LLM has revolutionized the digital forensics approach. Specifically, it takes a look at the basic concepts of digital forensics, as well as the evolution of LLM, and emphasizes the superior capabilities of LLM. To connect theory and practice, relevant examples and real-world scenarios are discussed. We also critically analyze the current limitations of applying LLMs to digital forensics, including issues related to illusion, interpretability, bias, and ethical considerations. In addition, this paper outlines the prospects for future research, highlighting the need for effective use of LLMs for transparency, accountability, and robust standardization in the forensic process.

Title: QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding

Authors: Binh M. Le, Shaoyuan Xu, Jinmiao Fu, Zhishen Huang, Moyan Li, Yanhui Guo, Hongdong Li, Sameera Ramasinghe, Bryan Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.02971
Pdf URL: https://arxiv.org/pdf/2504.02971
Copy Paste: [[2504.02971]] QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding(https://arxiv.org/abs/2504.02971)
Keywords: robust
Abstract: In Visual Document Understanding (VDU) tasks, fine-tuning a pre-trained Vision-Language Model (VLM) with new datasets often falls short in optimizing the vision encoder to identify query-specific regions in text-rich document images. Existing methods that directly inject queries into model layers by modifying the network architecture often struggle to adapt to new datasets with limited annotations. To address this, we introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder, leading to notable performance gains, particularly in data-scarce fine-tuning scenarios. Specifically, our approach introduces a dual-module framework: a query-aware module that generates a unique query vector to precisely guide the model's focus, as well as a query-agnostic module that captures the positional relationships among tokens, ensuring robust spatial understanding. Notably, both modules operate independently of the vision attention blocks, facilitating targeted learning of query embeddings and enhancing visual semantic identification. Experiments with OCR-free VLMs across multiple datasets demonstrate significant performance improvements using our method, especially in handling text-rich documents in data-scarce environments.

Title: Localized Definitions and Distributed Reasoning: A Proof-of-Concept Mechanistic Interpretability Study via Activation Patching

Authors: Nooshin Bahador
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02976
Pdf URL: https://arxiv.org/pdf/2504.02976
Copy Paste: [[2504.02976]] Localized Definitions and Distributed Reasoning: A Proof-of-Concept Mechanistic Interpretability Study via Activation Patching(https://arxiv.org/abs/2504.02976)
Keywords: interpretability
Abstract: This study investigates the localization of knowledge representation in fine-tuned GPT-2 models using Causal Layer Attribution via Activation Patching (CLAP), a method that identifies critical neural layers responsible for correct answer generation. The model was fine-tuned on 9,958 PubMed abstracts (epilepsy: 20,595 mentions, EEG: 11,674 mentions, seizure: 13,921 mentions) using two configurations with validation loss monitoring for early stopping. CLAP involved (1) caching clean (correct answer) and corrupted (incorrect answer) activations, (2) computing logit difference to quantify model preference, and (3) patching corrupted activations with clean ones to assess recovery. Results revealed three findings: First, patching the first feedforward layer recovered 56% of correct preference, demonstrating that associative knowledge is distributed across multiple layers. Second, patching the final output layer completely restored accuracy (100% recovery), indicating that definitional knowledge is localised. The stronger clean logit difference for definitional questions further supports this localized representation. Third, minimal recovery from convolutional layer patching (13.6%) suggests low-level features contribute marginally to high-level reasoning. Statistical analysis confirmed significant layer-specific effects (p<0.01). These findings demonstrate that factual knowledge is more localized and associative knowledge depends on distributed representations. We also showed that editing efficacy depends on task type. Our findings not only reconcile conflicting observations about localization in model editing but also emphasize on using task-adaptive techniques for reliable, interpretable updates.

Title: Multi-Screaming-Channel Attacks: Frequency Diversity for Enhanced Attacks

Authors: Jeremy Guillaume, Maxime Pelcat, Amor Nafkha, Rubén Salvador
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.02979
Pdf URL: https://arxiv.org/pdf/2504.02979
Copy Paste: [[2504.02979]] Multi-Screaming-Channel Attacks: Frequency Diversity for Enhanced Attacks(https://arxiv.org/abs/2504.02979)
Keywords: attack
Abstract: Side-channel attacks consist of retrieving internal data from a victim system by analyzing its leakage, which usually requires proximity to the victim in the range of a few millimetres. Screaming channels are EM side channels transmitted at a distance of a few meters. They appear on mixed-signal devices integrating an RF module on the same silicon die as the digital part. Consequently, the side channels are modulated by legitimate RF signal carriers and appear at the harmonics of the digital clock frequency. While initial works have only considered collecting leakage at these harmonics, late work has demonstrated that the leakage is also present at frequencies other than these harmonics. This result significantly increases the number of available frequencies to perform a screaming-channel attack, which can be convenient in an environment where multiple harmonics are polluted. This work studies how this diversity of frequencies carrying leakage can be used to improve attack performance. We first study how to combine multiple frequencies. Second, we demonstrate that frequency combination can improve attack performance and evaluate this improvement according to the performance of the combined frequencies. Finally, we demonstrate the interest of frequency combination in attacks at 15 and, for the first time to the best of our knowledge, at 30 meters. One last important observation is that this frequency combination divides by 2 the number of traces needed to reach a given attack performance.

Title: Hummus: A Dataset of Humorous Multimodal Metaphor Use

Authors: Xiaoyu Tong, Zhi Zhang, Martha Lewis, Ekaterina Shutova
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2504.02983
Pdf URL: https://arxiv.org/pdf/2504.02983
Copy Paste: [[2504.02983]] Hummus: A Dataset of Humorous Multimodal Metaphor Use(https://arxiv.org/abs/2504.02983)
Keywords: large language model
Abstract: Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at this http URL.

Title: Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization

Authors: Siqi Wang, Aoming Liu, Bryan A. Plummer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2504.02996
Pdf URL: https://arxiv.org/pdf/2504.02996
Copy Paste: [[2504.02996]] Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization(https://arxiv.org/abs/2504.02996)
Keywords: robust
Abstract: Multi-source Domain Generalization (DG) aims to improve model robustness to new distributions. However, DG methods often overlook the effect of label noise, which can confuse a model during training, reducing performance. Limited prior work has analyzed DG method's noise-robustness, typically focused on an analysis of existing methods rather than new solutions. In this paper, we investigate this underexplored space, where models are evaluated under both distribution shifts and label noise, which we refer to as Noise-Aware Generalization (NAG). A natural solution to address label noise would be to combine a Learning with Noisy Labels (LNL) method with those from DG. Many LNL methods aim to detect distribution shifts in a class's samples, i.e., they assume that distribution shifts often correspond to label noise. However, in NAG distribution shifts can be due to label noise or domain shifts, breaking the assumptions used by LNL methods. A naive solution is to make a similar assumption made by many DG methods, where we presume to have domain labels during training, enabling us to isolate the two types of shifts. However, this ignores valuable cross-domain information. Specifically, our proposed DL4ND approach improves noise detection by taking advantage of the observation that noisy samples that may appear indistinguishable within a single domain often show greater variation when compared across domains. Experiments show that DL4ND significantly improves performance across four diverse datasets, offering a promising direction for tackling NAG.

Title: Improving Efficiency in Federated Learning with Optimized Homomorphic Encryption

Authors: Feiran Yang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.03002
Pdf URL: https://arxiv.org/pdf/2504.03002
Copy Paste: [[2504.03002]] Improving Efficiency in Federated Learning with Optimized Homomorphic Encryption(https://arxiv.org/abs/2504.03002)
Keywords: privacy, robust, federate
Abstract: Federated learning is a method used in machine learning to allow multiple devices to work together on a model without sharing their private data. Each participant keeps their private data on their system and trains a local model and only sends updates to a central server, which combines these updates to improve the overall model. A key enabler of privacy in FL is homomorphic encryption (HE). HE allows computations to be performed directly on encrypted data. While HE offers strong privacy guarantees, it is computationally intensive, leading to significant latency and scalability issues, particularly for large-scale models like BERT. In my research, I aimed to address this inefficiency problem. My research introduces a novel algorithm to address these inefficiencies while maintaining robust privacy guarantees. I integrated several mathematical techniques such as selective parameter encryption, sensitivity maps, and differential privacy noise within my algorithms, which has already improved its efficiency. I have also conducted rigorous mathematical proofs to validate the correctness and robustness of the approach. I implemented this algorithm by coding it in C++, simulating the environment of federated learning on large-scale models, and verified that the efficiency of my algorithm is $3$ times the efficiency of the state-of-the-art method. This research has significant implications for machine learning because its ability to improve efficiency while balancing privacy makes it a practical solution! It would enable federated learning to be used very efficiently and deployed in various resource-constrained environments, as this research provides a novel solution to one of the key challenges in federated learning: the inefficiency of homomorphic encryption, as my new algorithm is able to enhance the scalability and resource efficiency of FL while maintaining robust privacy guarantees.

Title: DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery

Authors: Jing Gao, Ce Zheng, Laszlo A. Jeni, Zackory Erickson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03006
Pdf URL: https://arxiv.org/pdf/2504.03006
Copy Paste: [[2504.03006]] DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery(https://arxiv.org/abs/2504.03006)
Keywords: privacy, robust, diffusion
Abstract: In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learning models. Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. We introduce a diffusion model that bridges the gap between synthetic data and real data to support generalization in real-world in-bed pose and body inference scenarios. Extensive experiments and ablation studies validate the effectiveness of our framework, demonstrating significant improvements in robustness and adaptability across diverse healthcare scenarios.

Title: Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization

Authors: Junying Wang, Jingyuan Liu, Xin Sun, Krishna Kumar Singh, Zhixin Shu, He Zhang, Jimei Yang, Nanxuan Zhao, Tuanfeng Y. Wang, Simon S. Chen, Ulrich Neumann, Jae Shin Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03011
Pdf URL: https://arxiv.org/pdf/2504.03011
Copy Paste: [[2504.03011]] Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization(https://arxiv.org/abs/2504.03011)
Keywords: diffusion
Abstract: This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.

Title: Deep Reinforcement Learning via Object-Centric Attention

Authors: Jannis Blüml, Cedric Derstroff, Bjarne Gregori, Elisabeth Dillies, Quentin Delfosse, Kristian Kersting
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03024
Pdf URL: https://arxiv.org/pdf/2504.03024
Copy Paste: [[2504.03024]] Deep Reinforcement Learning via Object-Centric Attention(https://arxiv.org/abs/2504.03024)
Keywords: robust, extraction
Abstract: Deep reinforcement learning agents, trained on raw pixel inputs, often fail to generalize beyond their training environments, relying on spurious correlations and irrelevant background details. To address this issue, object-centric agents have recently emerged. However, they require different representations tailored to the task specifications. Contrary to deep agents, no single object-centric architecture can be applied to any environment. Inspired by principles of cognitive science and Occam's Razor, we introduce Object-Centric Attention via Masking (OCCAM), which selectively preserves task-relevant entities while filtering out irrelevant visual information. Specifically, OCCAM takes advantage of the object-centric inductive bias. Empirical evaluations on Atari benchmarks demonstrate that OCCAM significantly improves robustness to novel perturbations and reduces sample complexity while showing similar or improved performance compared to conventional pixel-based RL. These results suggest that structured abstraction can enhance generalization without requiring explicit symbolic representations or domain-specific object extraction pipelines.

Title: VIP: Video Inpainting Pipeline for Real World Human Removal

Authors: Huiming Sun, Yikang Li, Kangning Yang, Ruineng Li, Daitao Xing, Yangbo Xie, Lan Fu, Kaiyu Zhang, Ming Chen, Jiaming Ding, Jiang Geng, Jie Cai, Zibo Meng, Chiuman Ho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03041
Pdf URL: https://arxiv.org/pdf/2504.03041
Copy Paste: [[2504.03041]] VIP: Video Inpainting Pipeline for Real World Human Removal(https://arxiv.org/abs/2504.03041)
Keywords: segmentation
Abstract: Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting framework for real-world human removal applications. VIP enhances a state-of-the-art text-to-video model with a motion module and employs a Variational Autoencoder (VAE) for progressive denoising in the latent space. Additionally, we implement an efficient human-and-belongings segmentation for precise mask generation. Sufficient experimental results demonstrate that VIP achieves superior temporal consistency and visual fidelity across diverse real-world scenarios, surpassing state-of-the-art methods on challenging datasets. Our key contributions include the development of the VIP pipeline, a reference frame integration technique, and the Dual-Fusion Latent Segment Refinement method, all of which address the complexities of inpainting in long, high-resolution video sequences.

Title: Sliced Wasserstein Discrepancy in Disentangling Representation and Adaptation Networks for Unsupervised Domain Adaptation

Authors: Joel Sol, Shadi Alijani, Homayoun Najjaran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03043
Pdf URL: https://arxiv.org/pdf/2504.03043
Copy Paste: [[2504.03043]] Sliced Wasserstein Discrepancy in Disentangling Representation and Adaptation Networks for Unsupervised Domain Adaptation(https://arxiv.org/abs/2504.03043)
Keywords: robust, segmentation
Abstract: This paper introduces DRANet-SWD, an extension of existing work that disentangles content and style representations of images for unsupervised domain adaptation (UDA). The approach builds upon DRANet by incorporating the sliced Wasserstein discrepancy (SWD) as a style loss instead of the traditional Gram matrix loss. The potential advantages of SWD over the Gram matrix loss for capturing style variations in domain adaptation are investigated. Experiments using digit classification datasets and driving scenario segmentation validate the method, demonstrating that DRANet-SWD enhances performance. Results indicate that SWD provides a more robust statistical comparison of feature distributions, leading to better style adaptation. These findings highlight the effectiveness of SWD in refining feature alignment and improving domain adaptation tasks across these benchmarks. Our code can be found here.

Title: Extending CREAMT: Leveraging Large Language Models for Literary Translation Post-Editing

Authors: Antonio Castaldo, Sheila Castilho, Joss Moorkens, Johanna Monti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03045
Pdf URL: https://arxiv.org/pdf/2504.03045
Copy Paste: [[2504.03045]] Extending CREAMT: Leveraging Large Language Models for Literary Translation Post-Editing(https://arxiv.org/abs/2504.03045)
Keywords: large language model
Abstract: Post-editing machine translation (MT) for creative texts, such as literature, requires balancing efficiency with the preservation of creativity and style. While neural MT systems struggle with these challenges, large language models (LLMs) offer improved capabilities for context-aware and creative translation. This study evaluates the feasibility of post-editing literary translations generated by LLMs. Using a custom research tool, we collaborated with professional literary translators to analyze editing time, quality, and creativity. Our results indicate that post-editing LLM-generated translations significantly reduces editing time compared to human translation while maintaining a similar level of creativity. The minimal difference in creativity between PE and MT, combined with substantial productivity gains, suggests that LLMs may effectively support literary translators working with high-resource languages.

Title: Attention-Aware Multi-View Pedestrian Tracking

Authors: Reef Alturki, Adrian Hilton, Jean-Yves Guillemaut
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03047
Pdf URL: https://arxiv.org/pdf/2504.03047
Copy Paste: [[2504.03047]] Attention-Aware Multi-View Pedestrian Tracking(https://arxiv.org/abs/2504.03047)
Keywords: robust
Abstract: In spite of the recent advancements in multi-object tracking, occlusion poses a significant challenge. Multi-camera setups have been used to address this challenge by providing a comprehensive coverage of the scene. Recent multi-view pedestrian detection models have highlighted the potential of an early-fusion strategy, projecting feature maps of all views to a common ground plane or the Bird's Eye View (BEV), and then performing detection. This strategy has been shown to improve both detection and tracking performance. However, the perspective transformation results in significant distortion on the ground plane, affecting the robustness of the appearance features of the pedestrians. To tackle this limitation, we propose a novel model that incorporates attention mechanisms in a multi-view pedestrian tracking scenario. Our model utilizes an early-fusion strategy for detection, and a cross-attention mechanism to establish robust associations between pedestrians in different frames, while efficiently propagating pedestrian features across frames, resulting in a more robust feature representation for each pedestrian. Extensive experiments demonstrate that our model outperforms state-of-the-art models, with an IDF1 score of $96.1\%$ on Wildtrack dataset, and $85.7\%$ on MultiviewX dataset.

Title: Task as Context Prompting for Accurate Medical Symptom Coding Using Large Language Models

Authors: Chengyang He, Wenlong Zhang, Violet Xinying Chen, Yue Ning, Ping Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03051
Pdf URL: https://arxiv.org/pdf/2504.03051
Copy Paste: [[2504.03051]] Task as Context Prompting for Accurate Medical Symptom Coding Using Large Language Models(https://arxiv.org/abs/2504.03051)
Keywords: extraction, large language model
Abstract: Accurate medical symptom coding from unstructured clinical text, such as vaccine safety reports, is a critical task with applications in pharmacovigilance and safety monitoring. Symptom coding, as tailored in this study, involves identifying and linking nuanced symptom mentions to standardized vocabularies like MedDRA, differentiating it from broader medical coding tasks. Traditional approaches to this task, which treat symptom extraction and linking as independent workflows, often fail to handle the variability and complexity of clinical narratives, especially for rare cases. Recent advancements in Large Language Models (LLMs) offer new opportunities but face challenges in achieving consistent performance. To address these issues, we propose Task as Context (TACO) Prompting, a novel framework that unifies extraction and linking tasks by embedding task-specific context into LLM prompts. Our study also introduces SYMPCODER, a human-annotated dataset derived from Vaccine Adverse Event Reporting System (VAERS) reports, and a two-stage evaluation framework to comprehensively assess both symptom linking and mention fidelity. Our comprehensive evaluation of multiple LLMs, including Llama2-chat, Jackalope-7b, GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o, demonstrates TACO's effectiveness in improving flexibility and accuracy for tailored tasks like symptom coding, paving the way for more specific coding tasks and advancing clinical text processing methodologies.

Title: AD-GPT: Large Language Models in Alzheimer's Disease

Authors: Ziyu Liu, Lintao Tang, Zeliang Sun, Zhengliang Liu, Yanjun Lyu, Wei Ruan, Yangshuang Xu, Liang Shan, Jiyoon Shin, Xiaohe Chen, Dajiang Zhu, Tianming Liu, Rongjie Liu, Chao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03071
Pdf URL: https://arxiv.org/pdf/2504.03071
Copy Paste: [[2504.03071]] AD-GPT: Large Language Models in Alzheimer's Disease(https://arxiv.org/abs/2504.03071)
Keywords: robust, transformer, generative, large language model
Abstract: Large language models (LLMs) have emerged as powerful tools for medical information retrieval, yet their accuracy and depth remain limited in specialized domains such as Alzheimer's disease (AD), a growing global health challenge. To address this gap, we introduce AD-GPT, a domain-specific generative pre-trained transformer designed to enhance the retrieval and analysis of AD-related genetic and neurobiological information. AD-GPT integrates diverse biomedical data sources, including potential AD-associated genes, molecular genetic information, and key gene variants linked to brain regions. We develop a stacked LLM architecture combining Llama3 and BERT, optimized for four critical tasks in AD research: (1) genetic information retrieval, (2) gene-brain region relationship assessment, (3) gene-AD relationship analysis, and (4) brain region-AD relationship mapping. Comparative evaluations against state-of-the-art LLMs demonstrate AD-GPT's superior precision and reliability across these tasks, underscoring its potential as a robust and specialized AI tool for advancing AD research and biomarker discovery.

Title: How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models

Authors: Pascal Chang, Jingwei Tang, Markus Gross, Vinicius C. Azevedo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03072
Pdf URL: https://arxiv.org/pdf/2504.03072
Copy Paste: [[2504.03072]] How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models(https://arxiv.org/abs/2504.03072)
Keywords: diffusion
Abstract: Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed $\int$-noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses $\int$-noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed $\int$-noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation. See this https URL for video results.

Title: Integrating Identity-Based Identification against Adaptive Adversaries in Federated Learning

Authors: Jakub Kacper Szelag, Ji-Jian Chin, Lauren Ansell, Sook-Chin Yip
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03077
Pdf URL: https://arxiv.org/pdf/2504.03077
Copy Paste: [[2504.03077]] Integrating Identity-Based Identification against Adaptive Adversaries in Federated Learning(https://arxiv.org/abs/2504.03077)
Keywords: secure, security, privacy, attack, robust, federate
Abstract: Federated Learning (FL) has recently emerged as a promising paradigm for privacy-preserving, distributed machine learning. However, FL systems face significant security threats, particularly from adaptive adversaries capable of modifying their attack strategies to evade detection. One such threat is the presence of Reconnecting Malicious Clients (RMCs), which exploit FLs open connectivity by reconnecting to the system with modified attack strategies. To address this vulnerability, we propose integration of Identity-Based Identification (IBI) as a security measure within FL environments. By leveraging IBI, we enable FL systems to authenticate clients based on cryptographic identity schemes, effectively preventing previously disconnected malicious clients from re-entering the system. Our approach is implemented using the TNC-IBI (Tan-Ng-Chin) scheme over elliptic curves to ensure computational efficiency, particularly in resource-constrained environments like Internet of Things (IoT). Experimental results demonstrate that integrating IBI with secure aggregation algorithms, such as Krum and Trimmed Mean, significantly improves FL robustness by mitigating the impact of RMCs. We further discuss the broader implications of IBI in FL security, highlighting research directions for adaptive adversary detection, reputation-based mechanisms, and the applicability of identity-based cryptographic frameworks in decentralized FL architectures. Our findings advocate for a holistic approach to FL security, emphasizing the necessity of proactive defence strategies against evolving adaptive adversarial threats.

Title: SLACK: Attacking LiDAR-based SLAM with Adversarial Point Injections

Authors: Prashant Kumar, Dheeraj Vattikonda, Kshitij Madhav Bhat, Kunal Dargan, Prem Kalra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03089
Pdf URL: https://arxiv.org/pdf/2504.03089
Copy Paste: [[2504.03089]] SLACK: Attacking LiDAR-based SLAM with Adversarial Point Injections(https://arxiv.org/abs/2504.03089)
Keywords: security, attack, generative, segmentation
Abstract: The widespread adoption of learning-based methods for the LiDAR makes autonomous vehicles vulnerable to adversarial attacks through adversarial \textit{point injections (PiJ)}. It poses serious security challenges for navigation and map generation. Despite its critical nature, no major work exists that studies learning-based attacks on LiDAR-based SLAM. Our work proposes SLACK, an end-to-end deep generative adversarial model to attack LiDAR scans with several point injections without deteriorating LiDAR quality. To facilitate SLACK, we design a novel yet simple autoencoder that augments contrastive learning with segmentation-based attention for precise reconstructions. SLACK demonstrates superior performance on the task of \textit{point injections (PiJ)} compared to the best baselines on KITTI and CARLA-64 dataset while maintaining accurate scan quality. We qualitatively and quantitatively demonstrate PiJ attacks using a fraction of LiDAR points. It severely degrades navigation and map quality without deteriorating the LiDAR scan quality.

Title: Machine Learning-Based Detection and Analysis of Suspicious Activities in Bitcoin Wallet Transactions in the USA

Authors: Md Zahidul Islam, Md Shahidul Islam, Biswajit Chandra das, Syed Ali Reza, Proshanta Kumar Bhowmik, Kanchon Kumar Bishnu, Md Shafiqur Rahman, Redoyan Chowdhury, Laxmi Pant
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03092
Pdf URL: https://arxiv.org/pdf/2504.03092
Copy Paste: [[2504.03092]] Machine Learning-Based Detection and Analysis of Suspicious Activities in Bitcoin Wallet Transactions in the USA(https://arxiv.org/abs/2504.03092)
Keywords: secure
Abstract: The dramatic adoption of Bitcoin and other cryptocurrencies in the USA has revolutionized the financial landscape and provided unprecedented investment and transaction efficiency opportunities. The prime objective of this research project is to develop machine learning algorithms capable of effectively identifying and tracking suspicious activity in Bitcoin wallet transactions. With high-tech analysis, the study aims to create a model with a feature for identifying trends and outliers that can expose illicit activity. The current study specifically focuses on Bitcoin transaction information in America, with a strong emphasis placed on the importance of knowing about the immediate environment in and through which such transactions pass through. The dataset is composed of in-depth Bitcoin wallet transactional information, including important factors such as transaction values, timestamps, network flows, and addresses for wallets. All entries in the dataset expose information about financial transactions between wallets, including received and sent transactions, and such information is significant for analysis and trends that can represent suspicious activity. This study deployed three accredited algorithms, most notably, Logistic Regression, Random Forest, and Support Vector Machines. In retrospect, Random Forest emerged as the best model with the highest F1 Score, showcasing its ability to handle non-linear relationships in the data. Insights revealed significant patterns in wallet activity, such as the correlation between unredeemed transactions and final balances. The application of machine algorithms in tracking cryptocurrencies is a tool for creating transparent and secure U.S. markets.

Title: Post-processing for Fair Regression via Explainable SVD

Authors: Zhiqun Zuo, Ding Zhu, Mohammad Mahdi Khalili
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03093
Pdf URL: https://arxiv.org/pdf/2504.03093
Copy Paste: [[2504.03093]] Post-processing for Fair Regression via Explainable SVD(https://arxiv.org/abs/2504.03093)
Keywords: fair
Abstract: This paper presents a post-processing algorithm for training fair neural network regression models that satisfy statistical parity, utilizing an explainable singular value decomposition (SVD) of the weight matrix. We propose a linear transformation of the weight matrix, whereby the singular values derived from the SVD of the transformed matrix directly correspond to the differences in the first and second moments of the output distributions across two groups. Consequently, we can convert the fairness constraints into constraints on the singular values. We analytically solve the problem of finding the optimal weights under these constraints. Experimental validation on various datasets demonstrates that our method achieves a similar or superior fairness-accuracy trade-off compared to the baselines without using the sensitive attribute at the inference time.

Title: Scaling Open-Vocabulary Action Detection

Authors: Zhen Hao Sia, Yogesh Singh Rawat
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03096
Pdf URL: https://arxiv.org/pdf/2504.03096
Copy Paste: [[2504.03096]] Scaling Open-Vocabulary Action Detection(https://arxiv.org/abs/2504.03096)
Keywords: robust
Abstract: In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.

Title: Single-Pass Document Scanning for Question Answering

Authors: Weili Cao, Jianyou Wang, Youze Zheng, Longtian Bao, Qirui Zheng, Taylor Berg-Kirkpatrick, Ramamohan Paturi, Leon Bergen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03101
Pdf URL: https://arxiv.org/pdf/2504.03101
Copy Paste: [[2504.03101]] Single-Pass Document Scanning for Question Answering(https://arxiv.org/abs/2504.03101)
Keywords: transformer, large language model
Abstract: Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at this https URL

Title: Multi-Granularity Vision Fastformer with Fusion Mechanism for Skin Lesion Segmentation

Authors: Xuanyu Liu, Huiyun Yao, Jinggui Gao, Zhongyi Guo, Xue Zhang, Yulin Dong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03108
Pdf URL: https://arxiv.org/pdf/2504.03108
Copy Paste: [[2504.03108]] Multi-Granularity Vision Fastformer with Fusion Mechanism for Skin Lesion Segmentation(https://arxiv.org/abs/2504.03108)
Keywords: extraction, transformer, segmentation
Abstract: Background:Convolutional Neural Networks(CNN) and Vision Transformers(ViT) are the main techniques used in Medical image segmentation. However, CNN is limited to local contextual information, and ViT's quadratic complexity results in significant computational costs. At the same time, equipping the model to distinguish lesion boundaries with varying degrees of severity is also a challenge encountered in skin lesion segmentation. Purpose:This research aims to optimize the balance between computational costs and long-range dependency modelling and achieve excellent generalization across lesions with different degrees of severity. Methods:we propose a lightweight U-shape network that utilizes Vision Fastformer with Fusion Mechanism (VFFM-UNet). We inherit the advantages of Fastformer's additive attention mechanism, combining element-wise product and matrix product for comprehensive feature extraction and channel reduction to save computational costs. In order to accurately identify the lesion boundaries with varying degrees of severity, we designed Fusion Mechanism including Multi-Granularity Fusion and Channel Fusion, which can process the feature maps in the granularity and channel levels to obtain different contextual information. Results:Comprehensive experiments on the ISIC2017, ISIC2018 and PH2 datasets demonstrate that VFFM-UNet outperforms existing state-of-the-art models regarding parameter numbers, computational complexity and segmentation performance. In short, compared to MISSFormer, our model achieves superior segmentation performance while reducing parameter and computation costs by 101x and 15x, respectively. Conclusions:Both quantitative and qualitative analyses show that VFFM-UNet sets a new benchmark by reaching an ideal balance between parameter numbers, computational complexity, and segmentation performance compared to existing state-of-the-art models.

Title: Les Dissonances: Cross-Tool Harvesting and Polluting in Multi-Tool Empowered LLM Agents

Authors: Zichuan Li, Jian Cui, Xiaojing Liao, Luyi Xing
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.03111
Pdf URL: https://arxiv.org/pdf/2504.03111
Copy Paste: [[2504.03111]] Les Dissonances: Cross-Tool Harvesting and Polluting in Multi-Tool Empowered LLM Agents(https://arxiv.org/abs/2504.03111)
Keywords: secure, security, protect, attack, large language model
Abstract: Large Language Model (LLM) agents are autonomous systems powered by LLMs, capable of reasoning and planning to solve problems by leveraging a set of tools. However, the integration of multi-tool capabilities in LLM agents introduces challenges in securely managing tools, ensuring their compatibility, handling dependency relationships, and protecting control flows within LLM agent workflows. In this paper, we present the first systematic security analysis of task control flows in multi-tool-enabled LLM agents. We identify a novel threat, Cross-Tool Harvesting and Polluting (XTHP), which includes multiple attack vectors to first hijack the normal control flows of agent tasks, and then collect and pollute confidential or private information within LLM agent systems. To understand the impact of this threat, we developed Chord, a dynamic scanning tool designed to automatically detect real-world agent tools susceptible to XTHP attacks. Our evaluation of 73 real-world tools from the repositories of two major LLM agent development frameworks, LangChain and LlamaIndex, revealed a significant security concern: 80% of the tools are vulnerable to hijacking attacks, 78% to XTH attacks, and 41% to XTP attacks, highlighting the prevalence of this threat.

Title: NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices

Authors: Ziteng Wei, Qiang He, Bing Li, Feifei Chen, Yun Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03118
Pdf URL: https://arxiv.org/pdf/2504.03118
Copy Paste: [[2504.03118]] NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices(https://arxiv.org/abs/2504.03118)
Keywords: transformer
Abstract: Vision Transformers (ViTs) excel in computer vision tasks but lack flexibility for edge devices' diverse needs. A vital issue is that ViTs pre-trained to cover a broad range of tasks are \textit{over-qualified} for edge devices that usually demand only part of a ViT's knowledge for specific tasks. Their task-specific accuracy on these edge devices is suboptimal. We discovered that small ViTs that focus on device-specific tasks can improve model accuracy and in the meantime, accelerate model inference. This paper presents NuWa, an approach that derives small ViTs from the base ViT for edge devices with specific task requirements. NuWa can transfer task-specific knowledge extracted from the base ViT into small ViTs that fully leverage constrained resources on edge devices to maximize model accuracy with inference latency assurance. Experiments with three base ViTs on three public datasets demonstrate that compared with state-of-the-art solutions, NuWa improves model accuracy by up to $\text{11.83}\%$ and accelerates model inference by 1.29$\times$ - 2.79$\times$. Code for reproduction is available at this https URL.

Title: FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge

Authors: Kahim Wong, Jicheng Zhou, Kemou Li, Yain-Whar Si, Xiaowei Wu, Jiantao Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03128
Pdf URL: https://arxiv.org/pdf/2504.03128
Copy Paste: [[2504.03128]] FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge(https://arxiv.org/abs/2504.03128)
Keywords: security, protect, robust, watermark, segmentation
Abstract: The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network. The code and dataset are available at this https URL.

Title: Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

Authors: Junkai Zhang, Bin Li, Shoujun Zhou, Yue Du
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03135
Pdf URL: https://arxiv.org/pdf/2504.03135
Copy Paste: [[2504.03135]] Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion(https://arxiv.org/abs/2504.03135)
Keywords: transformer
Abstract: Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical MedVQA tasks and established datasets, However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels causing semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which obscures crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.

Title: Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable

Authors: Xin Jin, Simon Niklaus, Zhoutong Zhang, Zhihao Xia, Chunle Guo, Yuting Yang, Jiawen Chen, Chongyi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03136
Pdf URL: https://arxiv.org/pdf/2504.03136
Copy Paste: [[2504.03136]] Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable(https://arxiv.org/abs/2504.03136)
Keywords: robust
Abstract: Denoising is a crucial step in many video processing pipelines such as in interactive editing, where high quality, speed, and user control are essential. While recent approaches achieve significant improvements in denoising quality by leveraging deep learning, they are prone to unexpected failures due to discrepancies between training data distributions and the wide variety of noise patterns found in real-world videos. These methods also tend to be slow and lack user control. In contrast, traditional denoising methods perform reliably on in-the-wild videos and run relatively quickly on modern hardware. However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. A neural network is then trained to predict the optimal denoising parameters for each specific input, resulting in a robust and efficient approach that also supports user control.

Title: Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models

Authors: Xuran Ma, Yexin Liu, Yaofu Liu, Xianfeng Wu, Mingzhe Zheng, Zihao Wang, Ser-Nam Lim, Harry Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03140
Pdf URL: https://arxiv.org/pdf/2504.03140
Copy Paste: [[2504.03140]] Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models(https://arxiv.org/abs/2504.03140)
Keywords: diffusion
Abstract: Recent advances in diffusion models have demonstrated remarkable capabilities in video generation. However, the computational intensity remains a significant challenge for practical applications. While feature caching has been proposed to reduce the computational burden of diffusion models, existing methods typically overlook the heterogeneous significance of individual blocks, resulting in suboptimal reuse and degraded output quality. To this end, we address this gap by introducing ProfilingDiT, a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks. Through a systematic analysis of attention distributions in diffusion models, we reveal a key observation: 1) Most layers exhibit a consistent preference for either foreground or background regions. 2) Predicted noise shows low inter-step similarity initially, which stabilizes as denoising progresses. This finding inspires us to formulate a selective caching strategy that preserves full computation for dynamic foreground elements while efficiently caching static background features. Our approach substantially reduces computational overhead while preserving visual fidelity. Extensive experiments demonstrate that our framework achieves significant acceleration (e.g., 2.01 times speedup for Wan2.1) while maintaining visual fidelity across comprehensive quality metrics, establishing a viable method for efficient video generation.

Title: Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

Authors: Jing Bi, Susan Liang, Xiaofei Zhou, Pinxin Liu, Junjia Guo, Yunlong Tang, Luchuan Song, Chao Huang, Guangyu Sun, Jinxi He, Jiarui Wu, Shu Yang, Daoan Zhang, Chen Chen, Lianggong Bruce Wen, Zhang Liu, Jiebo Luo, Chenliang Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03151
Pdf URL: https://arxiv.org/pdf/2504.03151
Copy Paste: [[2504.03151]] Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)(https://arxiv.org/abs/2504.03151)
Keywords: robust, large language model
Abstract: Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research.

Title: MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories

Authors: Natalie Tirabassi, Sathish A. P. Kumar, Sumit Jha, Arvind Ramanathan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03153
Pdf URL: https://arxiv.org/pdf/2504.03153
Copy Paste: [[2504.03153]] MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories(https://arxiv.org/abs/2504.03153)
Keywords: transformer
Abstract: We propose MORAL (a multimodal reinforcement learning framework for decision making in autonomous laboratories) that enhances sequential decision-making in autonomous robotic laboratories through the integration of visual and textual inputs. Using the BridgeData V2 dataset, we generate fine-tuned image captions with a pretrained BLIP-2 vision-language model and combine them with visual features through an early fusion strategy. The fused representations are processed using Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) agents. Experimental results demonstrate that multimodal agents achieve a 20% improvement in task completion rates and significantly outperform visual-only and textual-only baselines after sufficient training. Compared to transformer-based and recurrent multimodal RL models, our approach achieves superior performance in cumulative reward and caption quality metrics (BLEU, METEOR, ROUGE-L). These results highlight the impact of semantically aligned language cues in enhancing agent learning efficiency and generalization. The proposed framework contributes to the advancement of multimodal reinforcement learning and embodied AI systems in dynamic, real-world environments.

Title: TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference

Authors: Junshan Hu, Jialiang Mao, Zhikang Liu, Zhongpu Xia, Peng Jia, Xianpeng Lang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03154
Pdf URL: https://arxiv.org/pdf/2504.03154
Copy Paste: [[2504.03154]] TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference(https://arxiv.org/abs/2504.03154)
Keywords: large language model
Abstract: Conventional Vision-Language Models(VLMs) typically utilize a fixed number of vision tokens, regardless of task complexity. This one-size-fits-all strategy introduces notable inefficiencies: using excessive tokens leads to unnecessary computational overhead in simpler tasks, whereas insufficient tokens compromise fine-grained visual comprehension in more complex contexts. To overcome these limitations, we present TokenFLEX, an innovative and adaptable vision-language framework that encodes images into a variable number of tokens for efficient integration with a Large Language Model (LLM). Our approach is underpinned by two pivotal innovations. Firstly, we present a novel training paradigm that enhances performance across varying numbers of vision tokens by stochastically modulating token counts during training. Secondly, we design a lightweight vision token projector incorporating an adaptive pooling layer and SwiGLU, allowing for flexible downsampling of vision tokens and adaptive selection of features tailored to specific token counts. Comprehensive experiments reveal that TokenFLEX consistently outperforms its fixed-token counterparts, achieving notable performance gains across various token counts enhancements of 1.6%, 1.0%, and 0.4% with 64, 144, and 256 tokens, respectively averaged over eight vision-language benchmarks. These results underscore TokenFLEX's remarkable flexibility while maintaining high-performance vision-language understanding.

Title: Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction

Authors: Junlang Qian, Zixiao Zhu, Hanzhang Zhou, Zijian Feng, Zepeng Zhai, Kezhi Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03159
Pdf URL: https://arxiv.org/pdf/2504.03159
Copy Paste: [[2504.03159]] Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction(https://arxiv.org/abs/2504.03159)
Keywords: robust, large language model
Abstract: Zero-shot text classification typically relies on prompt engineering, but the inherent prompt brittleness of large language models undermines its reliability. Minor changes in prompt can cause significant discrepancies in model performance. We attribute this prompt brittleness largely to the narrow focus on nexttoken probabilities in existing methods. To address this, we propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions and simulates comprehensive sampling of generation paths in a single run of a language model. Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts, boosting robustness. Even without a prompt, P3 maintains comparable performance, reducing the need for prompt engineering.

Title: Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking

Authors: Zihan Gu, Ruoyu Chen, Hua Zhang, Yue Hu, Xiaochun Cao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03162
Pdf URL: https://arxiv.org/pdf/2504.03162
Copy Paste: [[2504.03162]] Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking(https://arxiv.org/abs/2504.03162)
Keywords: transformer
Abstract: Grokking, referring to the abrupt improvement in test accuracy after extended overfitting, offers valuable insights into the mechanisms of model generalization. Existing researches based on progress measures imply that grokking relies on understanding the optimization dynamics when the loss function is dominated solely by the weight decay term. However, we find that this optimization merely leads to token uniformity, which is not a sufficient condition for grokking. In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations. Based on theoretical analysis and experimental validation, we present the following insights: (i) The weight decay term encourages uniformity across all tokens in the embedding space when it is minimized. (ii) The occurrence of grokking is jointly determined by the uniformity of the embedding space and the distribution of the training dataset. Building on these insights, we provide a unified perspective for understanding various previously proposed progress measures and introduce a novel, concise, and effective progress measure that could trace the changes in test loss more accurately. Finally, to demonstrate the versatility of our theoretical framework, we design a dedicated dataset to validate our theory on ResNet-18, successfully showcasing the occurrence of grokking.

Title: Enhanced Penalty-based Bidirectional Reinforcement Learning Algorithms

Authors: Sai Gana Sandeep Pula, Sathish A. P. Kumar, Sumit Jha, Arvind Ramanathan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03163
Pdf URL: https://arxiv.org/pdf/2504.03163
Copy Paste: [[2504.03163]] Enhanced Penalty-based Bidirectional Reinforcement Learning Algorithms(https://arxiv.org/abs/2504.03163)
Keywords: robust
Abstract: This research focuses on enhancing reinforcement learning (RL) algorithms by integrating penalty functions to guide agents in avoiding unwanted actions while optimizing rewards. The goal is to improve the learning process by ensuring that agents learn not only suitable actions but also which actions to avoid. Additionally, we reintroduce a bidirectional learning approach that enables agents to learn from both initial and terminal states, thereby improving speed and robustness in complex environments. Our proposed Penalty-Based Bidirectional methodology is tested against Mani skill benchmark environments, demonstrating an optimality improvement of success rate of approximately 4% compared to existing RL implementations. The findings indicate that this integrated strategy enhances policy learning, adaptability, and overall performance in challenging scenarios

Title: Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation

Authors: Weitao Li, Kaiming Liu, Xiangyu Zhang, Xuanyu Lei, Weizhi Ma, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03165
Pdf URL: https://arxiv.org/pdf/2504.03165
Copy Paste: [[2504.03165]] Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation(https://arxiv.org/abs/2504.03165)
Keywords: robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge integration during large language model (LLM) inference in recent years. However, current RAG implementations face challenges in effectively addressing noise, repetition and redundancy in retrieved content, primarily due to their limited ability to exploit fine-grained inter-document relationships. To address these limitations, we propose an \textbf{E}fficient \textbf{D}ynamic \textbf{C}lustering-based document \textbf{C}ompression framework (\textbf{EDC\textsuperscript{2}-RAG}) that effectively utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content. We validate our approach, built upon GPT-3.5, on widely used knowledge-QA and hallucination-detected datasets. The results show that this method achieves consistent performance improvements across various scenarios and experimental settings, demonstrating strong robustness and applicability. Our code and datasets can be found at this https URL.

Title: RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

Authors: Hanbo Bi, Yingchao Feng, Boyuan Tong, Mengyu Wang, Haichen Yu, Yongqiang Mao, Hao Chang, Wenhui Diao, Peijin Wang, Yue Yu, Hanyang Peng, Yehong Zhang, Kun Fu, Xian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03166
Pdf URL: https://arxiv.org/pdf/2504.03166
Copy Paste: [[2504.03166]] RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation(https://arxiv.org/abs/2504.03166)
Keywords: segmentation
Abstract: The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.

Title: REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval

Authors: Shabnam Choudhury, Yash Salunkhe, Sarthak Mehrotra, Biplab Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03169
Pdf URL: https://arxiv.org/pdf/2504.03169
Copy Paste: [[2504.03169]] REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval(https://arxiv.org/abs/2504.03169)
Keywords: generative
Abstract: The rapid expansion of remote sensing image archives demands the development of strong and efficient techniques for content-based image retrieval (RS-CBIR). This paper presents REJEPA (Retrieval with Joint-Embedding Predictive Architecture), an innovative self-supervised framework designed for unimodal RS-CBIR. REJEPA utilises spatially distributed context token encoding to forecast abstract representations of target tokens, effectively capturing high-level semantic features and eliminating unnecessary pixel-level details. In contrast to generative methods that focus on pixel reconstruction or contrastive techniques that depend on negative pairs, REJEPA functions within feature space, achieving a reduction in computational complexity of 40-60% when compared to pixel-reconstruction baselines like Masked Autoencoders (MAE). To guarantee strong and varied representations, REJEPA incorporates Variance-Invariance-Covariance Regularisation (VICReg), which prevents encoder collapse by promoting feature diversity and reducing redundancy. The method demonstrates an estimated enhancement in retrieval accuracy of 5.1% on BEN-14K (S1), 7.4% on BEN-14K (S2), 6.0% on FMoW-RGB, and 10.1% on FMoW-Sentinel compared to prominent SSL techniques, including CSMAE-SESD, Mask-VLM, SatMAE, ScaleMAE, and SatMAE++, on extensive RS benchmarks BEN-14K (multispectral and SAR data), FMoW-RGB and FMoW-Sentinel. Through effective generalisation across sensor modalities, REJEPA establishes itself as a sensor-agnostic benchmark for efficient, scalable, and precise RS-CBIR, addressing challenges like varying resolutions, high object density, and complex backgrounds with computational efficiency.

Title: PPFPL: Cross-silo Privacy-preserving Federated Prototype Learning Against Data Poisoning Attacks on Non-IID Data

Authors: Hongliang Zhang, Jiguo Yu, Fenghua Xu, Chunqiang Hu, Yongzhao Zhang, Xiaofen Wang, Zhongyuan Yu, Xiaosong Zhang
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2504.03173
Pdf URL: https://arxiv.org/pdf/2504.03173
Copy Paste: [[2504.03173]] PPFPL: Cross-silo Privacy-preserving Federated Prototype Learning Against Data Poisoning Attacks on Non-IID Data(https://arxiv.org/abs/2504.03173)
Keywords: secure, privacy, attack, robust, federate
Abstract: Privacy-Preserving Federated Learning (PPFL) allows multiple clients to collaboratively train a deep learning model by submitting hidden model updates. Nonetheless, PPFL is vulnerable to data poisoning attacks due to the distributed training nature of clients. Existing solutions have struggled to improve the performance of cross-silo PPFL in poisoned Non-IID data. To address the issues, this paper proposes a privacy-preserving federated prototype learning framework, named PPFPL, which enhances the cross-silo FL performance in poisoned Non-IID data while effectively resisting data poisoning attacks. Specifically, we adopt prototypes as client-submitted model updates to eliminate the impact of tampered data distribution on federated learning. Moreover, we utilize two servers to achieve Byzantine-robust aggregation by secure aggregation protocol, which greatly reduces the impact of malicious clients. Theoretical analyses confirm the convergence of PPFPL, and experimental results on publicly available datasets show that PPFPL is effective for resisting data poisoning attacks with Non-IID conditions.

Title: Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

Authors: Jaymari Chua, Chen Wang, Lina Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03185
Pdf URL: https://arxiv.org/pdf/2504.03185
Copy Paste: [[2504.03185]] Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents(https://arxiv.org/abs/2504.03185)
Keywords: robust, large language model
Abstract: Generalizable alignment is a core challenge for deploying Large Language Models (LLMs) safely in real-world NLP applications. Current alignment methods, including Reinforcement Learning from Human Feedback (RLHF), often fail to guarantee constraint satisfaction outside their training distribution due to their reliance on implicit, post-hoc preferences. Inspired by a paradigm shift to first curate data before tuning, we introduce a new framework for safe language alignment that learns natural language constraints from positive and negative demonstrations as a primary step. From inferring both a task-specific reward function and latent constraint functions, our approach fosters adaptation to novel safety requirements and robust generalization under domain shifts and adversarial inputs. We formalize the framework within a Constrained Markov Decision Process (CMDP) and validate it via a text-based navigation environment, demonstrating safe adaptation to changing danger zones. Our experiments show fewer violations upon domain shift when following a safe navigation path, and we achieve zero violations by applying learned constraints to a distilled BERT model as a fine-tuning technique. This work offers a promising path toward building safety-critical and more generalizable LLMs for practical NLP settings.

Title: On the Connection Between Diffusion Models and Molecular Dynamics

Authors: Liam Harcombe, Timothy T. Duignan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03187
Pdf URL: https://arxiv.org/pdf/2504.03187
Copy Paste: [[2504.03187]] On the Connection Between Diffusion Models and Molecular Dynamics(https://arxiv.org/abs/2504.03187)
Keywords: diffusion
Abstract: Neural Network Potentials (NNPs) have emerged as a powerful tool for modelling atomic interactions with high accuracy and computational efficiency. Recently, denoising diffusion models have shown promise in NNPs by training networks to remove noise added to stable configurations, eliminating the need for force data during training. In this work, we explore the connection between noise and forces by providing a new, simplified mathematical derivation of their relationship. We also demonstrate how a denoising model can be implemented using a conventional MD software package interfaced with a standard NNP architecture. We demonstrate the approach by training a diffusion-based NNP to simulate a coarse-grained lithium chloride solution and employ data duplication to enhance model performance.

Title: Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Authors: Xin Zhang, Robby T. Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03193
Pdf URL: https://arxiv.org/pdf/2504.03193
Copy Paste: [[2504.03193]] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation(https://arxiv.org/abs/2504.03193)
Keywords: robust, segmentation
Abstract: Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code is available at this https URL.

Title: Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Authors: Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03197
Pdf URL: https://arxiv.org/pdf/2504.03197
Copy Paste: [[2504.03197]] Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation(https://arxiv.org/abs/2504.03197)
Keywords: large language model
Abstract: With the rapid advancement of mathematical reasoning capabilities in large language models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: visual explanation. In real-world instructional contexts, human tutors routinely employ visual aids-such as diagrams, markings, and highlights-to enhance conceptual clarity. To bridge this gap, we introduce a novel task of visual solution explanation, which requires not only solving problems but also generating explanations that incorporate newly introduced visual elements essential for understanding (e.g., auxiliary lines, annotations, or geometric constructions). To evaluate model performance on this task, we propose MathExplain, a multimodal benchmark consisting of 997 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that while some closed-source models demonstrate promising capabilities on visual solution-explaining, current open-source general-purpose models perform inconsistently, particularly in identifying relevant visual components and producing coherent keypoint-based explanations. We expect that visual solution-explaining and the MathExplain dataset will catalyze further research on multimodal LLMs in education and advance their deployment as effective, explanation-oriented AI tutors. Code and data will be released publicly.

Title: PIONM: A Generalized Approach to Solving Density-Constrained Mean-Field Games Equilibrium under Modified Boundary Conditions

Authors: Jinwei Liu, Wang Yao, Xiao Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03209
Pdf URL: https://arxiv.org/pdf/2504.03209
Copy Paste: [[2504.03209]] PIONM: A Generalized Approach to Solving Density-Constrained Mean-Field Games Equilibrium under Modified Boundary Conditions(https://arxiv.org/abs/2504.03209)
Keywords: diffusion
Abstract: Neural network-based methods are effective for solving equilibria in Mean-Field Games (MFGs), particularly in high-dimensional settings. However, solving the coupled partial differential equations (PDEs) in MFGs limits their applicability since solving coupled PDEs is computationally expensive. Additionally, modifying boundary conditions, such as the initial state distribution or terminal value function, necessitates extensive retraining, reducing scalability. To address these challenges, we propose a generalized framework, PIONM (Physics-Informed Neural Operator NF-MKV Net), which leverages physics-informed neural operators to solve MFGs equations. PIONM utilizes neural operators to compute MFGs equilibria for arbitrary boundary conditions. The method encodes boundary conditions as input features and trains the model to align them with density evolution, modeled using discrete-time normalizing flows. Once trained, the algorithm efficiently computes the density distribution at any time step for modified boundary condition, ensuring efficient adaptation to different boundary conditions in MFGs equilibria. Unlike traditional MFGs methods constrained by fixed coefficients, PIONM efficiently computes equilibria under varying boundary conditions, including obstacles, diffusion coefficients, initial densities, and terminal functions. PIONM can adapt to modified conditions while preserving density distribution constraints, demonstrating superior scalability and generalization capabilities compared to existing methods.

Title: Structured Knowledge Accumulation: The Principle of Entropic Least Action in Forward-Only Neural Learning

Authors: Bouarfa Mahi Quantiota
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03214
Pdf URL: https://arxiv.org/pdf/2504.03214
Copy Paste: [[2504.03214]] Structured Knowledge Accumulation: The Principle of Entropic Least Action in Forward-Only Neural Learning(https://arxiv.org/abs/2504.03214)
Keywords: robust
Abstract: This paper aims to extend the Structured Knowledge Accumulation (SKA) framework recently proposed by \cite{mahi2025ska}. We introduce two core concepts: the Tensor Net function and the characteristic time property of neural learning. First, we reinterpret the learning rate as a time step in a continuous system. This transforms neural learning from discrete optimization into continuous-time evolution. We show that learning dynamics remain consistent when the product of learning rate and iteration steps stays constant. This reveals a time-invariant behavior and identifies an intrinsic timescale of the network. Second, we define the Tensor Net function as a measure that captures the relationship between decision probabilities, entropy gradients, and knowledge change. Additionally, we define its zero-crossing as the equilibrium state between decision probabilities and entropy gradients. We show that the convergence of entropy and knowledge flow provides a natural stopping condition, replacing arbitrary thresholds with an information-theoretic criterion. We also establish that SKA dynamics satisfy a variational principle based on the Euler-Lagrange equation. These findings extend SKA into a continuous and self-organizing learning model. The framework links computational learning with physical systems that evolve by natural laws. By understanding learning as a time-based process, we open new directions for building efficient, robust, and biologically-inspired AI systems.

Title: Electromyography-Based Gesture Recognition: Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics

Authors: Jungpil Shin, Abu Saleh Musa Miah, Sota Konnai, Shu Hoshitaka, Pankoo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03221
Pdf URL: https://arxiv.org/pdf/2504.03221
Copy Paste: [[2504.03221]] Electromyography-Based Gesture Recognition: Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics(https://arxiv.org/abs/2504.03221)
Keywords: extraction
Abstract: Hand gesture recognition using multichannel surface electromyography (sEMG) is challenging due to unstable predictions and inefficient time-varying feature enhancement. To overcome the lack of signal based time-varying feature problems, we propose a lightweight squeeze-excitation deep learning-based multi stream spatial temporal dynamics time-varying feature extraction approach to build an effective sEMG-based hand gesture recognition system. Each branch of the proposed model was designed to extract hierarchical features, capturing both global and detailed spatial-temporal relationships to ensure feature effectiveness. The first branch, utilizing a Bidirectional-TCN (Bi-TCN), focuses on capturing long-term temporal dependencies by modelling past and future temporal contexts, providing a holistic view of gesture dynamics. The second branch, incorporating a 1D Convolutional layer, separable CNN, and Squeeze-and-Excitation (SE) block, efficiently extracts spatial-temporal features while emphasizing critical feature channels, enhancing feature relevance. The third branch, combining a Temporal Convolutional Network (TCN) and Bidirectional LSTM (BiLSTM), captures bidirectional temporal relationships and time-varying patterns. Outputs from all branches are fused using concatenation to capture subtle variations in the data and then refined with a channel attention module, selectively focusing on the most informative features while improving computational efficiency. The proposed model was tested on the Ninapro DB2, DB4, and DB5 datasets, achieving accuracy rates of 96.41%, 92.40%, and 93.34%, respectively. These results demonstrate the capability of the system to handle complex sEMG dynamics, offering advancements in prosthetic limb control and human-machine interface technologies with significant implications for assistive technologies.

Title: Unlocking Neural Transparency: Jacobian Maps for Explainable AI in Alzheimer's Detection

Authors: Yasmine Mustafa, Mohamed Elmahallawy, Tie Luo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03230
Pdf URL: https://arxiv.org/pdf/2504.03230
Copy Paste: [[2504.03230]] Unlocking Neural Transparency: Jacobian Maps for Explainable AI in Alzheimer's Detection(https://arxiv.org/abs/2504.03230)
Keywords: interpretability, explainability
Abstract: Alzheimer's disease (AD) leads to progressive cognitive decline, making early detection crucial for effective intervention. While deep learning models have shown high accuracy in AD diagnosis, their lack of interpretability limits clinical trust and adoption. This paper introduces a novel pre-model approach leveraging Jacobian Maps (JMs) within a multi-modal framework to enhance explainability and trustworthiness in AD detection. By capturing localized brain volume changes, JMs establish meaningful correlations between model predictions and well-known neuroanatomical biomarkers of AD. We validate JMs through experiments comparing a 3D CNN trained on JMs versus on traditional preprocessed data, which demonstrates superior accuracy. We also employ 3D Grad-CAM analysis to provide both visual and quantitative insights, further showcasing improved interpretability and diagnostic reliability.

Title: Crash Time Matters: HybridMamba for Fine-Grained Temporal Localization in Traffic Surveillance Footage

Authors: Ibne Farabi Shihab, Anuj Sharma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03235
Pdf URL: https://arxiv.org/pdf/2504.03235
Copy Paste: [[2504.03235]] Crash Time Matters: HybridMamba for Fine-Grained Temporal Localization in Traffic Surveillance Footage(https://arxiv.org/abs/2504.03235)
Keywords: robust, transformer
Abstract: Traffic crash detection in long-form surveillance videos is critical for emergency response and infrastructure planning but remains difficult due to the brief and rare nature of crash events. We introduce HybridMamba, a novel architecture that combines visual transformers with state-space temporal modeling to achieve accurate crash time localization. Our method uses multi-level token compression and hierarchical temporal processing to remain computationally efficient without sacrificing temporal resolution. Evaluated on a large-scale dataset from the Iowa Department of Transportation, HybridMamba achieves a mean absolute error of 1.50 seconds, with 65.2 percent of predictions within one second of the ground truth. It outperforms recent video-language models such as TimeChat and VideoLLaMA2 by up to 2.8 seconds, while using significantly fewer parameters. Our results demonstrate strong generalization across videos ranging from 2 to 40 minutes in diverse conditions. HybridMamba offers a robust and efficient solution for fine-grained temporal localization in traffic surveillance. The code will be released upon publication.

Title: Malware Detection in Docker Containers: An Image is Worth a Thousand Logs

Authors: Akis Nousias, Efklidis Katsaros, Evangelos Syrmos, Panagiotis Radoglou-Grammatikis, Thomas Lagkas, Vasileios Argyriou, Ioannis Moscholios, Evangelos Markakis, Sotirios Goudos, Panagiotis Sarigiannidis
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.03238
Pdf URL: https://arxiv.org/pdf/2504.03238
Copy Paste: [[2504.03238]] Malware Detection in Docker Containers: An Image is Worth a Thousand Logs(https://arxiv.org/abs/2504.03238)
Keywords: security, attack
Abstract: Malware detection is increasingly challenged by evolving techniques like obfuscation and polymorphism, limiting the effectiveness of traditional methods. Meanwhile, the widespread adoption of software containers has introduced new security challenges, including the growing threat of malicious software injection, where a container, once compromised, can serve as entry point for further cyberattacks. In this work, we address these security issues by introducing a method to identify compromised containers through machine learning analysis of their file systems. We cast the entire software containers into large RGB images via their tarball representations, and propose to use established Convolutional Neural Network architectures on a streaming, patch-based manner. To support our experiments, we release the COSOCO dataset--the first of its kind--containing 3364 large-scale RGB images of benign and compromised software containers at this https URL. Our method detects more malware and achieves higher F1 and Recall scores than all individual and ensembles of VirusTotal engines, demonstrating its effectiveness and setting a new standard for identifying malware-compromised software containers.

Title: Rotation Invariance in Floor Plan Digitization using Zernike Moments

Authors: Marius Graumann (1), Jan Marius Stürmer (1), Tobias Koch (1) ((1) German Aerospace Center (DLR), Institute for the Protection of Terrestrial Infrastructures)
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03241
Pdf URL: https://arxiv.org/pdf/2504.03241
Copy Paste: [[2504.03241]] Rotation Invariance in Floor Plan Digitization using Zernike Moments(https://arxiv.org/abs/2504.03241)
Keywords: extraction
Abstract: Nowadays, a lot of old floor plans exist in printed form or are stored as scanned raster images. Slight rotations or shifts may occur during scanning. Bringing floor plans of this form into a machine readable form to enable further use, still poses a problem. Therefore, we propose an end-to-end pipeline that pre-processes the image and leverages a novel approach to create a region adjacency graph (RAG) from the pre-processed image and predict its nodes. By incorporating normalization steps into the RAG feature extraction, we significantly improved the rotation invariance of the RAG feature calculation. Moreover, applying our method leads to an improved F1 score and IoU on rotated data. Furthermore, we proposed a wall splitting algorithm for partitioning walls into segments associated with the corresponding rooms.

Title: FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement

Authors: Gia-Nghia Tran, Quang-Huy Che, Trong-Tai Dam Vu, Bich-Nga Pham, Vinh-Tiep Nguyen, Trung-Nghia Le, Minh-Triet Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03292
Pdf URL: https://arxiv.org/pdf/2504.03292
Copy Paste: [[2504.03292]] FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement(https://arxiv.org/abs/2504.03292)
Keywords: diffusion
Abstract: Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.

Title: Stance-Driven Multimodal Controlled Statement Generation: New Dataset and Task

Authors: Bingqian Wang, Quan Fang, Jiachen Sun, Xiaoxiao Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03295
Pdf URL: https://arxiv.org/pdf/2504.03295
Copy Paste: [[2504.03295]] Stance-Driven Multimodal Controlled Statement Generation: New Dataset and Task(https://arxiv.org/abs/2504.03295)
Keywords: large language model
Abstract: Formulating statements that support diverse or controversial stances on specific topics is vital for platforms that enable user expression, reshape political discourse, and drive social critique and information dissemination. With the rise of Large Language Models (LLMs), controllable text generation towards specific stances has become a promising research area with applications in shaping public opinion and commercial marketing. However, current datasets often focus solely on pure texts, lacking multimodal content and effective context, particularly in the context of stance detection. In this paper, we formally define and study the new problem of stance-driven controllable content generation for tweets with text and images, where given a multimodal post (text and image/video), a model generates a stance-controlled response. To this end, we create the Multimodal Stance Generation Dataset (StanceGen2024), the first resource explicitly designed for multimodal stance-controllable text generation in political discourse. It includes posts and user comments from the 2024 U.S. presidential election, featuring text, images, videos, and stance annotations to explore how multimodal political content shapes stance expression. Furthermore, we propose a Stance-Driven Multimodal Generation (SDMG) framework that integrates weighted fusion of multimodal features and stance guidance to improve semantic consistency and stance control. We release the dataset and code (this https URL) for public use and further research.

Title: Noise Augmented Fine Tuning for Mitigating Hallucinations in Large Language Models

Authors: Afshin Khadangi, Amir Sartipi, Igor Tchappi, Ramin Bahmani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03302
Pdf URL: https://arxiv.org/pdf/2504.03302
Copy Paste: [[2504.03302]] Noise Augmented Fine Tuning for Mitigating Hallucinations in Large Language Models(https://arxiv.org/abs/2504.03302)
Keywords: robust, large language model
Abstract: Large language models (LLMs) often produce inaccurate or misleading content-hallucinations. To address this challenge, we introduce Noise-Augmented Fine-Tuning (NoiseFiT), a novel framework that leverages adaptive noise injection based on the signal-to-noise ratio (SNR) to enhance model robustness. In particular, NoiseFiT selectively perturbs layers identified as either high-SNR (more robust) or low-SNR (potentially under-regularized) using a dynamically scaled Gaussian noise. We further propose a hybrid loss that combines standard cross-entropy, soft cross-entropy, and consistency regularization to ensure stable and accurate outputs under noisy training conditions. Our theoretical analysis shows that adaptive noise injection is both unbiased and variance-preserving, providing strong guarantees for convergence in expectation. Empirical results on multiple test and benchmark datasets demonstrate that NoiseFiT significantly reduces hallucination rates, often improving or matching baseline performance in key tasks. These findings highlight the promise of noise-driven strategies for achieving robust, trustworthy language modeling without incurring prohibitive computational overhead. Given the comprehensive and detailed nature of our experiments, we have publicly released the fine-tuning logs, benchmark evaluation artifacts, and source code online at W&B, Hugging Face, and GitHub, respectively, to foster further research, accessibility and reproducibility.

Title: Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices

Authors: Luís Couto Seller, Íñigo Sanz Torres, Adrián Vogel-Fernández, Carlos González Carballo, Pedro Miguel Sánchez Sánchez, Adrián Carruana Martín, Enrique de Miguel Ambite
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03312
Pdf URL: https://arxiv.org/pdf/2504.03312
Copy Paste: [[2504.03312]] Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices(https://arxiv.org/abs/2504.03312)
Keywords: robust, large language model
Abstract: Large Language Models have significantly advanced natural language processing, achieving remarkable performance in tasks such as language generation, translation, and reasoning. However, their substantial computational requirements restrict deployment to high-end systems, limiting accessibility on consumer-grade devices. This challenge is especially pronounced for under-resourced languages like those spoken in the Iberian Peninsula, where relatively limited linguistic resources and benchmarks hinder effective evaluation. This work presents a comprehensive evaluation of compact state-of-the-art LLMs across several essential NLP tasks tailored for Iberian languages. The results reveal that while some models consistently excel in certain tasks, significant performance gaps remain, particularly for languages such as Basque. These findings highlight the need for further research on balancing model compactness with robust multilingual performance

Title: Steerable Anatomical Shape Synthesis with Implicit Neural Representations

Authors: Bram de Wilde, Max T. Rietberg, Guillaume Lajoinie, Jelmer M. Wolterink
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03313
Pdf URL: https://arxiv.org/pdf/2504.03313
Copy Paste: [[2504.03313]] Steerable Anatomical Shape Synthesis with Implicit Neural Representations(https://arxiv.org/abs/2504.03313)
Keywords: generative
Abstract: Generative modeling of anatomical structures plays a crucial role in virtual imaging trials, which allow researchers to perform studies without the costs and constraints inherent to in vivo and phantom studies. For clinical relevance, generative models should allow targeted control to simulate specific patient populations rather than relying on purely random sampling. In this work, we propose a steerable generative model based on implicit neural representations. Implicit neural representations naturally support topology changes, making them well-suited for anatomical structures with varying topology, such as the thyroid. Our model learns a disentangled latent representation, enabling fine-grained control over shape variations. Evaluation includes reconstruction accuracy and anatomical plausibility. Our results demonstrate that the proposed model achieves high-quality shape generation while enabling targeted anatomical modifications.

Title: Data Augmentation of Time-Series Data in Human Movement Biomechanics: A Scoping Review

Authors: Christina Halmich, Lucas Höschler, Christoph Schranz, Christian Borgelt
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2504.03334
Pdf URL: https://arxiv.org/pdf/2504.03334
Copy Paste: [[2504.03334]] Data Augmentation of Time-Series Data in Human Movement Biomechanics: A Scoping Review(https://arxiv.org/abs/2504.03334)
Keywords: robust
Abstract: The integration of machine learning and deep learning has transformed data analytics in biomechanics, enabled by extensive wearable sensor data. However, the field faces challenges such as limited large-scale datasets and high data acquisition costs, which hinder the development of robust algorithms. Data augmentation techniques show promise in addressing these issues, but their application to biomechanical time-series data requires comprehensive evaluation. This scoping review investigates data augmentation methods for time-series data in the biomechanics domain. It analyzes current approaches for augmenting and generating time-series datasets, evaluates their effectiveness, and offers recommendations for applying these techniques in biomechanics. Four databases, PubMed, IEEE Xplore, Scopus, and Web of Science, were searched for studies published between 2013 and 2024. Following PRISMA-ScR guidelines, a two-stage screening identified 21 relevant publications. Results show that there is no universally preferred method for augmenting biomechanical time-series data; instead, methods vary based on study objectives. A major issue identified is the absence of soft tissue artifacts in synthetic data, leading to discrepancies referred to as the synthetic gap. Moreover, many studies lack proper evaluation of augmentation methods, making it difficult to assess their effects on model performance and data quality. This review highlights the critical role of data augmentation in addressing limited dataset availability and improving model generalization in biomechanics. Tailoring augmentation strategies to the characteristics of biomechanical data is essential for advancing predictive modeling. A better understanding of how different augmentation methods impact data quality and downstream tasks will be key to developing more effective and realistic techniques.

Title: QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning

Authors: Quanxing Xu, Ling Zhou, Xian Zhong, Feifei Zhang, Rubing Huang, Chia-Wen Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03337
Pdf URL: https://arxiv.org/pdf/2504.03337
Copy Paste: [[2504.03337]] QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning(https://arxiv.org/abs/2504.03337)
Keywords: robust
Abstract: Existing debiasing approaches in Visual Question Answering (VQA) primarily focus on enhancing visual learning, integrating auxiliary models, or employing data augmentation strategies. However, these methods exhibit two major drawbacks. First, current debiasing techniques fail to capture the superior relation between images and texts because prevalent learning frameworks do not enable models to extract deeper correlations from highly contrasting samples. Second, they do not assess the relevance between the input question and image during inference, as no prior work has examined the degree of input relevance in debiasing studies. Motivated by these limitations, we propose a novel framework, Optimized Question-Image Relation Learning (QIRL), which employs a generation-based self-supervised learning strategy. Specifically, two modules are introduced to address the aforementioned issues. The Negative Image Generation (NIG) module automatically produces highly irrelevant question-image pairs during training to enhance correlation learning, while the Irrelevant Sample Identification (ISI) module improves model robustness by detecting and filtering irrelevant inputs, thereby reducing prediction errors. Furthermore, to validate our concept of reducing output errors through filtering unrelated question-image inputs, we propose a specialized metric to evaluate the performance of the ISI module. Notably, our approach is model-agnostic and can be integrated with various VQA models. Extensive experiments on VQA-CPv2 and VQA-v2 demonstrate the effectiveness and generalization ability of our method. Among data augmentation strategies, our approach achieves state-of-the-art results.

Title: BabyLM's First Words: Word Segmentation as a Phonological Probing Task

Authors: Zébulon Goriely
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03338
Pdf URL: https://arxiv.org/pdf/2504.03338
Copy Paste: [[2504.03338]] BabyLM's First Words: Word Segmentation as a Phonological Probing Task(https://arxiv.org/abs/2504.03338)
Keywords: large language model, segmentation
Abstract: Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard input representation used in LLMs (subwords of graphemes) is not suitable for analyzing the representation of phonemes. In this work, we demonstrate how word segmentation can be used as a phonological probing task, allowing us to study the representations learned by phoneme-based language models trained on child-directed speech across 31 languages. Following computational models of word segmentation, we present unsupervised methods for extracting word boundaries from a trained model using the observation that prediction-error peaks at the start of words. We also use linear probes to identify that these models implicitly track word boundaries, even when they do not appear in training. This cross-lingual work corroborates statistical learning theories of acquisition and empirically motivates new methods for training subword tokenizers.

Title: Optimizing Password Cracking for Digital Investigations

Authors: Mohamad Hachem, Adam Lanfranchi, Nathan Clarke, Joakim Kavrestad
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.03347
Pdf URL: https://arxiv.org/pdf/2504.03347
Copy Paste: [[2504.03347]] Optimizing Password Cracking for Digital Investigations(https://arxiv.org/abs/2504.03347)
Keywords: security, protect, attack
Abstract: Efficient password cracking is a critical aspect of digital forensics, enabling investigators to decrypt protected content during criminal investigations. Traditional password cracking methods, including brute-force, dictionary and rule-based attacks face challenges in balancing efficiency with increasing computational complexity. This study explores rule based optimisation strategies to enhance the effectiveness of password cracking while minimising resource consumption. By analysing publicly available password datasets, we propose an optimised rule set that reduces computational iterations by approximately 40%, significantly improving the speed of password recovery. Additionally, the impact of national password recommendations were examined, specifically, the UK National Cyber Security Centre's three word password guideline on password security and forensic recovery. Through user generated password surveys, we evaluate the crackability of three word passwords using dictionaries of varying common word proportions. Results indicate that while three word passwords provide improved memorability and usability, they remain vulnerable when common word combinations are used, with up to 77.5% of passwords cracked using a 30% common word dictionary subset. The study underscores the importance of dynamic password cracking strategies that account for evolving user behaviours and policy driven password structures. Findings contribution to both forensic efficiency and cyber security awareness, highlight the dual impact of password policies on security and investigative capabilities. Future work will focus upon refining rule based cracking techniques and expanding research on password composition trends.

Title: Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition

Authors: Denis Coquenet
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03349
Pdf URL: https://arxiv.org/pdf/2504.03349
Copy Paste: [[2504.03349]] Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition(https://arxiv.org/abs/2504.03349)
Keywords: transformer, segmentation
Abstract: Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the naïve character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at this https URL.

Title: SoK: Attacks on Modern Card Payments

Authors: Xenia Hofmeier, David Basin, Ralf Sasse, Jorge Toro-Pozo
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.03363
Pdf URL: https://arxiv.org/pdf/2504.03363
Copy Paste: [[2504.03363]] SoK: Attacks on Modern Card Payments(https://arxiv.org/abs/2504.03363)
Keywords: security, attack
Abstract: EMV is the global standard for smart card payments and its NFC-based version, EMV contactless, is widely used, also for mobile payments. In this systematization of knowledge, we examine attacks on the EMV contactless protocol. We provide a comprehensive framework encompassing its desired security properties and adversary models. We also identify and categorize a comprehensive collection of protocol flaws and show how different subsets thereof can be combined into attacks. In addition to this systematization, we examine the underlying reasons for the many attacks against EMV and point to a better way forward.

Title: FLAIRBrainSeg: Fine-grained brain segmentation using FLAIR MRI only

Authors: Edern Le Bot, Rémi Giraud, Boris Mansencal, Thomas Tourdias, Josè V. Manjon, Pierrick Coupé
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03376
Pdf URL: https://arxiv.org/pdf/2504.03376
Copy Paste: [[2504.03376]] FLAIRBrainSeg: Fine-grained brain segmentation using FLAIR MRI only(https://arxiv.org/abs/2504.03376)
Keywords: robust, segmentation
Abstract: This paper introduces a novel method for brain segmentation using only FLAIR MRIs, specifically targeting cases where access to other imaging modalities is limited. By leveraging existing automatic segmentation methods, we train a network to approximate segmentations, typically obtained from T1-weighted MRIs. Our method, called FLAIRBrainSeg, produces segmentations of 132 structures and is robust to multiple sclerosis lesions. Experiments on both in-domain and out-of-domain datasets demonstrate that our method outperforms modality-agnostic approaches based on image synthesis, the only currently available alternative for performing brain parcellation using FLAIR MRI alone. This technique holds promise for scenarios where T1-weighted MRIs are unavailable and offers a valuable alternative for clinicians and researchers in need of reliable anatomical segmentation.

Title: Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

Authors: Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, Donghyun Kwak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03380
Pdf URL: https://arxiv.org/pdf/2504.03380
Copy Paste: [[2504.03380]] Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning(https://arxiv.org/abs/2504.03380)
Keywords: large language model
Abstract: Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoning ability of Large Language Models (LLMs). However, due to the sparsity of rewards in RORL, effective training is highly dependent on the selection of problems of appropriate difficulty. Although curriculum learning attempts to address this by adjusting difficulty, it often relies on static schedules, and even recent online filtering methods lack theoretical grounding and a systematic understanding of their effectiveness. In this work, we theoretically and empirically show that curating the batch with the problems that the training model achieves intermediate accuracy on the fly can maximize the effectiveness of RORL training, namely balanced online difficulty filtering. We first derive that the lower bound of the KL divergence between the initial and the optimal policy can be expressed with the variance of the sampled accuracy. Building on those insights, we show that balanced filtering can maximize the lower bound, leading to better performance. Experimental results across five challenging math reasoning benchmarks show that balanced online filtering yields an additional 10% in AIME and 4% improvements in average over plain GRPO. Moreover, further analysis shows the gains in sample efficiency and training time efficiency, exceeding the maximum reward of plain GRPO within 60% training time and the volume of the training set.

Title: BitHEP -- The Limits of Low-Precision ML in HEP

Authors: Claudius Krause, Daohan Wang, Ramon Winterhalder
Subjects: cs.LG, hep-ex, hep-ph
Abstract URL: https://arxiv.org/abs/2504.03387
Pdf URL: https://arxiv.org/pdf/2504.03387
Copy Paste: [[2504.03387]] BitHEP -- The Limits of Low-Precision ML in HEP(https://arxiv.org/abs/2504.03387)
Keywords: generative
Abstract: The increasing complexity of modern neural network architectures demands fast and memory-efficient implementations to mitigate computational bottlenecks. In this work, we evaluate the recently proposed BitNet architecture in HEP applications, assessing its performance in classification, regression, and generative modeling tasks. Specifically, we investigate its suitability for quark-gluon discrimination, SMEFT parameter estimation, and detector simulation, comparing its efficiency and accuracy to state-of-the-art methods. Our results show that while BitNet consistently performs competitively in classification tasks, its performance in regression and generation varies with the size and type of the network, highlighting key limitations and potential areas for improvement.

Title: Autonomous state-space segmentation for Deep-RL sparse reward scenarios

Authors: Gianluca Maselli, Vieri Giuliano Santucci
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.03420
Pdf URL: https://arxiv.org/pdf/2504.03420
Copy Paste: [[2504.03420]] Autonomous state-space segmentation for Deep-RL sparse reward scenarios(https://arxiv.org/abs/2504.03420)
Keywords: segmentation
Abstract: Dealing with environments with sparse rewards has always been crucial for systems developed to operate in autonomous open-ended learning settings. Intrinsic Motivations could be an effective way to help Deep Reinforcement Learning algorithms learn in such scenarios. In fact, intrinsic reward signals, such as novelty or curiosity, are generally adopted to improve exploration when extrinsic rewards are delayed or absent. Building on previous works, we tackle the problem of learning policies in the presence of sparse rewards by proposing a two-level architecture that alternates an ''intrinsically driven'' phase of exploration and autonomous sub-goal generation, to a phase of sparse reward, goal-directed policy learning. The idea is to build several small networks, each one specialized on a particular sub-path, and use them as starting points for future exploration without the need to further explore from scratch previously learnt paths. Two versions of the system have been trained and tested in the Gym SuperMarioBros environment without considering any additional extrinsic reward. The results show the validity of our approach and the importance of autonomously segment the environment to generate an efficient path towards the final goal.

Title: DML-RAM: Deep Multimodal Learning Framework for Robotic Arm Manipulation using Pre-trained Models

Authors: Sathish Kumar, Swaroop Damodaran, Naveen Kumar Kuruba, Sumit Jha, Arvind Ramanathan
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2504.03423
Pdf URL: https://arxiv.org/pdf/2504.03423
Copy Paste: [[2504.03423]] DML-RAM: Deep Multimodal Learning Framework for Robotic Arm Manipulation using Pre-trained Models(https://arxiv.org/abs/2504.03423)
Keywords: robust, interpretability
Abstract: This paper presents a novel deep learning framework for robotic arm manipulation that integrates multimodal inputs using a late-fusion strategy. Unlike traditional end-to-end or reinforcement learning approaches, our method processes image sequences with pre-trained models and robot state data with machine learning algorithms, fusing their outputs to predict continuous action values for control. Evaluated on BridgeData V2 and Kuka datasets, the best configuration (VGG16 + Random Forest) achieved MSEs of 0.0021 and 0.0028, respectively, demonstrating strong predictive performance and robustness. The framework supports modularity, interpretability, and real-time decision-making, aligning with the goals of adaptive, human-in-the-loop cyber-physical systems.

Title: Locations of Characters in Narratives: Andersen and Persuasion Datasets

Authors: Batuhan Ozyurt, Roya Arkhmammadova, Deniz Yuret
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03434
Pdf URL: https://arxiv.org/pdf/2504.03434
Copy Paste: [[2504.03434]] Locations of Characters in Narratives: Andersen and Persuasion Datasets(https://arxiv.org/abs/2504.03434)
Keywords: fair, large language model
Abstract: The ability of machines to grasp spatial understanding within narrative contexts is an intriguing aspect of reading comprehension that continues to be studied. Motivated by the goal to test the AI's competence in understanding the relationship between characters and their respective locations in narratives, we introduce two new datasets: Andersen and Persuasion. For the Andersen dataset, we selected fifteen children's stories from "Andersen's Fairy Tales" by Hans Christian Andersen and manually annotated the characters and their respective locations throughout each story. Similarly, for the Persuasion dataset, characters and their locations in the novel "Persuasion" by Jane Austen were also manually annotated. We used these datasets to prompt Large Language Models (LLMs). The prompts are created by extracting excerpts from the stories or the novel and combining them with a question asking the location of a character mentioned in that excerpt. Out of the five LLMs we tested, the best-performing one for the Andersen dataset accurately identified the location in 61.85% of the examples, while for the Persuasion dataset, the best-performing one did so in 56.06% of the cases.

Title: ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving

Authors: Sheng Yang, Tong Zhan, Shichen Qiao, Jicheng Gong, Qing Yang, Yanfeng Lu, Jian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03438
Pdf URL: https://arxiv.org/pdf/2504.03438
Copy Paste: [[2504.03438]] ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving(https://arxiv.org/abs/2504.03438)
Keywords: transformer
Abstract: Reliable 3D object perception is essential in autonomous driving. Owing to its sensing capabilities in all weather conditions, 4D radar has recently received much attention. However, compared to LiDAR, 4D radar provides much sparser point cloud. In this paper, we propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. As the core of ZFusion, our proposed FP-DDCA (Feature Pyramid-Double Deformable Cross Attention) fuser complements the (sparse) radar information and (dense) vision information, effectively. Specifically, with a feature-pyramid structure, the FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales, thus enhancing perception accuracy. In addition, we utilize the Depth-Context-Split view transformation module due to the physical properties of 4D radar. Considering that 4D radar has a much lower cost than LiDAR, ZFusion is an attractive alternative to LiDAR-based methods. In typical traffic scenarios like the VoD (View-of-Delft) dataset, experiments show that with reasonable inference speed, ZFusion achieved the state-of-the-art mAP (mean average precision) in the region of interest, while having competitive mAP in the entire area compared to the baseline methods, which demonstrates performance close to LiDAR and greatly outperforms those camera-only methods.

Title: Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models

Authors: Mirko Borszukovszki, Ivo Pascal de Jong, Matias Valdenegro-Toro
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03440
Pdf URL: https://arxiv.org/pdf/2504.03440
Copy Paste: [[2504.03440]] Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models(https://arxiv.org/abs/2504.03440)
Keywords: robust, large language model
Abstract: To leverage the full potential of Large Language Models (LLMs) it is crucial to have some information on their answers' uncertainty. This means that the model has to be able to quantify how certain it is in the correctness of a given response. Bad uncertainty estimates can lead to overconfident wrong answers undermining trust in these models. Quite a lot of research has been done on language models that work with text inputs and provide text outputs. Still, since the visual capabilities have been added to these models recently, there has not been much progress on the uncertainty of Visual Language Models (VLMs). We tested three state-of-the-art VLMs on corrupted image data. We found that the severity of the corruption negatively impacted the models' ability to estimate their uncertainty and the models also showed overconfidence in most of the experiments.

Title: Pyramid-based Mamba Multi-class Unsupervised Anomaly Detection

Authors: Nasar Iqbal, Niki Martinel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03442
Pdf URL: https://arxiv.org/pdf/2504.03442
Copy Paste: [[2504.03442]] Pyramid-based Mamba Multi-class Unsupervised Anomaly Detection(https://arxiv.org/abs/2504.03442)
Keywords: extraction, transformer
Abstract: Recent advances in convolutional neural networks (CNNs) and transformer-based methods have improved anomaly detection and localization, but challenges persist in precisely localizing small anomalies. While CNNs face limitations in capturing long-range dependencies, transformer architectures often suffer from substantial computational overheads. We introduce a state space model (SSM)-based Pyramidal Scanning Strategy (PSS) for multi-class anomaly detection and localization--a novel approach designed to address the challenge of small anomaly localization. Our method captures fine-grained details at multiple scales by integrating the PSS with a pre-trained encoder for multi-scale feature extraction and a feature-level synthetic anomaly generator. An improvement of $+1\%$ AP for multi-class anomaly localization and a +$1\%$ increase in AU-PRO on MVTec benchmark demonstrate our method's superiority in precise anomaly localization across diverse industrial scenarios. The code is available at this https URL Mamba.

Title: D-Garment: Physics-Conditioned Latent Diffusion for Dynamic Garment Deformations

Authors: Antoine Dumoulin, Adnane Boukhayma, Laurence Boissieux, Bharath Bhushan Damodaran, Pierre Hellier, Stefanie Wuhrer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03468
Pdf URL: https://arxiv.org/pdf/2504.03468
Copy Paste: [[2504.03468]] D-Garment: Physics-Conditioned Latent Diffusion for Dynamic Garment Deformations(https://arxiv.org/abs/2504.03468)
Keywords: diffusion, generative
Abstract: Adjusting and deforming 3D garments to body shapes, body motion, and cloth material is an important problem in virtual and augmented reality. Applications are numerous, ranging from virtual change rooms to the entertainment and gaming industry. This problem is challenging as garment dynamics influence geometric details such as wrinkling patterns, which depend on physical input including the wearer's body shape and motion, as well as cloth material features. Existing work studies learning-based modeling techniques to generate garment deformations from example data, and physics-inspired simulators to generate realistic garment dynamics. We propose here a learning-based approach trained on data generated with a physics-based simulator. Compared to prior work, our 3D generative model learns garment deformations for loose cloth geometry, especially for large deformations and dynamic wrinkles driven by body motion and cloth material. Furthermore, the model can be efficiently fitted to observations captured using vision sensors. We propose to leverage the capability of diffusion models to learn fine-scale detail: we model the 3D garment in a 2D parameter space, and learn a latent diffusion model using this representation independent from the mesh resolution. This allows to condition global and local geometric information with body and material information. We quantitatively and qualitatively evaluate our method on both simulated data and data captured with a multi-view acquisition platform. Compared to strong baselines, our method is more accurate in terms of Chamfer distance.

Title: Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis

Authors: Xi Wang, Ziqi He, Yang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03471
Pdf URL: https://arxiv.org/pdf/2504.03471
Copy Paste: [[2504.03471]] Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis(https://arxiv.org/abs/2504.03471)
Keywords: diffusion, transformer
Abstract: Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a "free lunch" for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: this https URL

Title: Multi-encoder nnU-Net outperforms Transformer models with self-supervised pretraining

Authors: Seyedeh Sahar Taheri Otaghsara, Reza Rahmanzadeh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03474
Pdf URL: https://arxiv.org/pdf/2504.03474
Copy Paste: [[2504.03474]] Multi-encoder nnU-Net outperforms Transformer models with self-supervised pretraining(https://arxiv.org/abs/2504.03474)
Keywords: transformer, segmentation
Abstract: This study addresses the essential task of medical image segmentation, which involves the automatic identification and delineation of anatomical structures and pathological regions in medical images. Accurate segmentation is crucial in radiology, as it aids in the precise localization of abnormalities such as tumors, thereby enabling effective diagnosis, treatment planning, and monitoring of disease progression. Specifically, the size, shape, and location of tumors can significantly influence clinical decision-making and therapeutic strategies, making accurate segmentation a key component of radiological workflows. However, challenges posed by variations in MRI modalities, image artifacts, and the scarcity of labeled data complicate the segmentation task and impact the performance of traditional models. To overcome these limitations, we propose a novel self-supervised learning Multi-encoder nnU-Net architecture designed to process multiple MRI modalities independently through separate encoders. This approach allows the model to capture modality-specific features before fusing them for the final segmentation, thus improving accuracy. Our Multi-encoder nnU-Net demonstrates exceptional performance, achieving a Dice Similarity Coefficient (DSC) of 93.72%, which surpasses that of other models such as vanilla nnU-Net, SegResNet, and Swin UNETR. By leveraging the unique information provided by each modality, the model enhances segmentation tasks, particularly in scenarios with limited annotated data. Evaluations highlight the effectiveness of this architecture in improving tumor segmentation outcomes.

Title: ATM-Net: Anatomy-Aware Text-Guided Multi-Modal Fusion for Fine-Grained Lumbar Spine Segmentation

Authors: Sheng Lian, Dengfeng Pan, Jianlong Cai, Guang-Yong Chen, Zhun Zhong, Zhiming Luo, Shen Zhao, Shuo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03476
Pdf URL: https://arxiv.org/pdf/2504.03476
Copy Paste: [[2504.03476]] ATM-Net: Anatomy-Aware Text-Guided Multi-Modal Fusion for Fine-Grained Lumbar Spine Segmentation(https://arxiv.org/abs/2504.03476)
Keywords: segmentation
Abstract: Accurate lumbar spine segmentation is crucial for diagnosing spinal disorders. Existing methods typically use coarse-grained segmentation strategies that lack the fine detail needed for precise diagnosis. Additionally, their reliance on visual-only models hinders the capture of anatomical semantics, leading to misclassified categories and poor segmentation details. To address these limitations, we present ATM-Net, an innovative framework that employs an anatomy-aware, text-guided, multi-modal fusion mechanism for fine-grained segmentation of lumbar substructures, i.e., vertebrae (VBs), intervertebral discs (IDs), and spinal canal (SC). ATM-Net adopts the Anatomy-aware Text Prompt Generator (ATPG) to adaptively convert image annotations into anatomy-aware prompts in different views. These insights are further integrated with image features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, building a comprehensive anatomical context. The Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module further enhances class discrimination and refines segmentation through class-wise channel-level multi-modal contrastive learning. Extensive experiments on the MRSpineSeg and SPIDER datasets demonstrate that ATM-Net significantly outperforms state-of-the-art methods, with consistent improvements regarding class discrimination and segmentation details. For example, ATM-Net achieves Dice of 79.39% and HD95 of 9.91 pixels on SPIDER, outperforming the competitive SpineParseNet by 8.31% and 4.14 pixels, respectively.

Title: Probabilistic Machine Learning for Noisy Labels in Earth Observation

Authors: Spyros Kondylatos, Nikolaos Ioannis Bountos, Ioannis Prapas, Angelos Zavras, Gustau Camps-Valls, Ioannis Papoutsis
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2504.03478
Pdf URL: https://arxiv.org/pdf/2504.03478
Copy Paste: [[2504.03478]] Probabilistic Machine Learning for Noisy Labels in Earth Observation(https://arxiv.org/abs/2504.03478)
Keywords: robust, interpretability
Abstract: Label noise poses a significant challenge in Earth Observation (EO), often degrading the performance and reliability of supervised Machine Learning (ML) models. Yet, given the critical nature of several EO applications, developing robust and trustworthy ML solutions is essential. In this study, we take a step in this direction by leveraging probabilistic ML to model input-dependent label noise and quantify data uncertainty in EO tasks, accounting for the unique noise sources inherent in the domain. We train uncertainty-aware probabilistic models across a broad range of high-impact EO applications-spanning diverse noise sources, input modalities, and ML configurations-and introduce a dedicated pipeline to assess their accuracy and reliability. Our experimental results show that the uncertainty-aware models consistently outperform the standard deterministic approaches across most datasets and evaluation metrics. Moreover, through rigorous uncertainty evaluation, we validate the reliability of the predicted uncertainty estimates, enhancing the interpretability of model predictions. Our findings emphasize the importance of modeling label noise and incorporating uncertainty quantification in EO, paving the way for more accurate, reliable, and trustworthy ML solutions in the field.

Title: Online Traffic Density Estimation using Physics-Informed Neural Networks

Authors: Dennis Wilkman, Kateryna Morozovska, Karl Henrik Johansson, Matthieu Barreau
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2504.03483
Pdf URL: https://arxiv.org/pdf/2504.03483
Copy Paste: [[2504.03483]] Online Traffic Density Estimation using Physics-Informed Neural Networks(https://arxiv.org/abs/2504.03483)
Keywords: robust
Abstract: Recent works on the application of Physics-Informed Neural Networks to traffic density estimation have shown to be promising for future developments due to their robustness to model errors and noisy data. In this paper, we introduce a methodology for online approximation of the traffic density using measurements from probe vehicles in two settings: one using the Greenshield model and the other considering a high-fidelity traffic simulation. The proposed method continuously estimates the real-time traffic density in space and performs model identification with each new set of measurements. The density estimate is updated in almost real-time using gradient descent and adaptive weights. In the case of full model knowledge, the resulting algorithm has similar performance to the classical open-loop one. However, in the case of model mismatch, the iterative solution behaves as a closed-loop observer and outperforms the baseline method. Similarly, in the high-fidelity setting, the proposed algorithm correctly reproduces the traffic characteristics.

Title: Discovering Partially Known Ordinary Differential Equations: a Case Study on the Chemical Kinetics of Cellulose Degradation

Authors: Federica Bragone, Kateryna Morozovska, Tor Laneryd, Khemraj Shukla, Stefano Markidis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03484
Pdf URL: https://arxiv.org/pdf/2504.03484
Copy Paste: [[2504.03484]] Discovering Partially Known Ordinary Differential Equations: a Case Study on the Chemical Kinetics of Cellulose Degradation(https://arxiv.org/abs/2504.03484)
Keywords: transformer
Abstract: The degree of polymerization (DP) is one of the methods for estimating the aging of the polymer based insulation systems, such as cellulose insulation in power components. The main degradation mechanisms in polymers are hydrolysis, pyrolysis, and oxidation. These mechanisms combined cause a reduction of the DP. However, the data availability for these types of problems is usually scarce. This study analyzes insulation aging using cellulose degradation data from power transformers. The aging problem for the cellulose immersed in mineral oil inside power transformers is modeled with ordinary differential equations (ODEs). We recover the governing equations of the degradation system using Physics-Informed Neural Networks (PINNs) and symbolic regression. We apply PINNs to discover the Arrhenius equation's unknown parameters in the Ekenstam ODE describing cellulose contamination content and the material aging process related to temperature for synthetic data and real DP values. A modification of the Ekenstam ODE is given by Emsley's system of ODEs, where the rate constant expressed by the Arrhenius equation decreases in time with the new formulation. We use PINNs and symbolic regression to recover the functional form of one of the ODEs of the system and to identify an unknown parameter.

Title: BUFF: Bayesian Uncertainty Guided Diffusion Probabilistic Model for Single Image Super-Resolution

Authors: Zihao He, Shengchuan Zhang, Runze Hu, Yunhang Shen, Yan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03490
Pdf URL: https://arxiv.org/pdf/2504.03490
Copy Paste: [[2504.03490]] BUFF: Bayesian Uncertainty Guided Diffusion Probabilistic Model for Single Image Super-Resolution(https://arxiv.org/abs/2504.03490)
Keywords: robust, diffusion
Abstract: Super-resolution (SR) techniques are critical for enhancing image quality, particularly in scenarios where high-resolution imagery is essential yet limited by hardware constraints. Existing diffusion models for SR have relied predominantly on Gaussian models for noise generation, which often fall short when dealing with the complex and variable texture inherent in natural scenes. To address these deficiencies, we introduce the Bayesian Uncertainty Guided Diffusion Probabilistic Model (BUFF). BUFF distinguishes itself by incorporating a Bayesian network to generate high-resolution uncertainty masks. These masks guide the diffusion process, allowing for the adjustment of noise intensity in a manner that is both context-aware and adaptive. This novel approach not only enhances the fidelity of super-resolved images to their original high-resolution counterparts but also significantly mitigates artifacts and blurring in areas characterized by complex textures and fine details. The model demonstrates exceptional robustness against complex noise patterns and showcases superior adaptability in handling textures and edges within images. Empirical evidence, supported by visual results, illustrates the model's robustness, especially in challenging scenarios, and its effectiveness in addressing common SR issues such as blurring. Experimental evaluations conducted on the DIV2K dataset reveal that BUFF achieves a notable improvement, with a +0.61 increase compared to baseline in SSIM on BSD100, surpassing traditional diffusion approaches by an average additional +0.20dB PSNR gain. These findings underscore the potential of Bayesian methods in enhancing diffusion processes for SR, paving the way for future advancements in the field.

Title: Diffusion Active Learning: Towards Data-Driven Experimental Design in Computed Tomography

Authors: Luis Barba, Johannes Kirschner, Tomas Aidukas, Manuel Guizar-Sicairos, Benjamín Béjar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03491
Pdf URL: https://arxiv.org/pdf/2504.03491
Copy Paste: [[2504.03491]] Diffusion Active Learning: Towards Data-Driven Experimental Design in Computed Tomography(https://arxiv.org/abs/2504.03491)
Keywords: diffusion, generative
Abstract: We introduce Diffusion Active Learning, a novel approach that combines generative diffusion modeling with data-driven sequential experimental design to adaptively acquire data for inverse problems. Although broadly applicable, we focus on scientific computed tomography (CT) for experimental validation, where structured prior datasets are available, and reducing data requirements directly translates to shorter measurement times and lower X-ray doses. We first pre-train an unconditional diffusion model on domain-specific CT reconstructions. The diffusion model acts as a learned prior that is data-dependent and captures the structure of the underlying data distribution, which is then used in two ways: It drives the active learning process and also improves the quality of the reconstructions. During the active learning loop, we employ a variant of diffusion posterior sampling to generate conditional data samples from the posterior distribution, ensuring consistency with the current measurements. Using these samples, we quantify the uncertainty in the current estimate to select the most informative next measurement. Our results show substantial reductions in data acquisition requirements, corresponding to lower X-ray doses, while simultaneously improving image reconstruction quality across multiple real-world tomography datasets.

Title: Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical Systems

Authors: Alexander Windmann, Henrik Steude, Daniel Boschmann, Oliver Niggemann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03494
Pdf URL: https://arxiv.org/pdf/2504.03494
Copy Paste: [[2504.03494]] Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical Systems(https://arxiv.org/abs/2504.03494)
Keywords: robust
Abstract: Cyber-Physical Systems (CPS) in domains such as manufacturing and energy distribution generate complex time series data crucial for Prognostics and Health Management (PHM). While Deep Learning (DL) methods have demonstrated strong forecasting capabilities, their adoption in industrial CPS remains limited due insufficient robustness. Existing robustness evaluations primarily focus on formal verification or adversarial perturbations, inadequately representing the complexities encountered in real-world CPS scenarios. To address this, we introduce a practical robustness definition grounded in distributional robustness, explicitly tailored to industrial CPS, and propose a systematic framework for robustness evaluation. Our framework simulates realistic disturbances, such as sensor drift, noise and irregular sampling, enabling thorough robustness analyses of forecasting models on real-world CPS datasets. The robustness definition provides a standardized score to quantify and compare model performance across diverse datasets, assisting in informed model selection and architecture design. Through extensive empirical studies evaluating prominent DL architectures (including recurrent, convolutional, attention-based, modular, and structured state-space models) we demonstrate the applicability and effectiveness of our approach. We publicly release our robustness benchmark to encourage further research and reproducibility.

Title: Hierarchical Knowledge Structuring for Effective Federated Learning in Heterogeneous Environments

Authors: Wai Fong Tam, Qilei Li, Ahmed M. Abdelmonie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03505
Pdf URL: https://arxiv.org/pdf/2504.03505
Copy Paste: [[2504.03505]] Hierarchical Knowledge Structuring for Effective Federated Learning in Heterogeneous Environments(https://arxiv.org/abs/2504.03505)
Keywords: privacy, robust, federate
Abstract: Federated learning enables collaborative model training across distributed entities while maintaining individual data privacy. A key challenge in federated learning is balancing the personalization of models for local clients with generalization for the global model. Recent efforts leverage logit-based knowledge aggregation and distillation to overcome these issues. However, due to the non-IID nature of data across diverse clients and the imbalance in the client's data distribution, directly aggregating the logits often produces biased knowledge that fails to apply to individual clients and obstructs the convergence of local training. To solve this issue, we propose a Hierarchical Knowledge Structuring (HKS) framework that formulates sample logits into a multi-granularity codebook to represent logits from personalized per-sample insights to globalized per-class knowledge. The unsupervised bottom-up clustering method is leveraged to enable the global server to provide multi-granularity responses to local clients. These responses allow local training to integrate supervised learning objectives with global generalization constraints, which results in more robust representations and improved knowledge sharing in subsequent training rounds. The proposed framework's effectiveness is validated across various benchmarks and model architectures.

Title: FADConv: A Frequency-Aware Dynamic Convolution for Farmland Non-agriculturalization Identification and Segmentation

Authors: Tan Shu, Li Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03510
Pdf URL: https://arxiv.org/pdf/2504.03510
Copy Paste: [[2504.03510]] FADConv: A Frequency-Aware Dynamic Convolution for Farmland Non-agriculturalization Identification and Segmentation(https://arxiv.org/abs/2504.03510)
Keywords: security, segmentation
Abstract: Cropland non-agriculturalization refers to the conversion of arable land into non-agricultural uses such as forests, residential areas, and construction sites. This phenomenon not only directly leads to the loss of cropland resources but also poses systemic threats to food security and agricultural sustainability. Accurate identification of cropland and non-cropland areas is crucial for detecting and addressing this issue. Traditional CNNs employ static convolution layers, while dynamic convolution studies demonstrate that adaptively weighting multiple convolutional kernels through attention mechanisms can enhance accuracy. However, existing dynamic convolution methods relying on Global Average Pooling (GAP) for attention weight allocation suffer from information loss, limiting segmentation precision. This paper proposes Frequency-Aware Dynamic Convolution (FADConv) and a Frequency Attention (FAT) module to address these limitations. Building upon the foundational structure of dynamic convolution, we designed FADConv by integrating 2D Discrete Cosine Transform (2D DCT) to capture frequency domain features and fuse them. FAT module generates high-quality attention weights that replace the traditional GAP method,making the combination between dynamic convolution kernels more this http URL on the GID and Hi-CNA datasets demonstrate that FADConv significantly improves segmentation accuracy with minimal computational overhead. For instance, ResNet18 with FADConv achieves 1.9% and 2.7% increases in F1-score and IoU for cropland segmentation on GID, with only 58.87M additional MAdds. Compared to other dynamic convolution approaches, FADConv exhibits superior performance in cropland segmentation tasks.

Title: Neutralizing the Narrative: AI-Powered Debiasing of Online News Articles

Authors: Chen Wei Kuo, Kevin Chu, Nouar AlDahoul, Hazem Ibrahim, Talal Rahwan, Yasir Zaki
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.03520
Pdf URL: https://arxiv.org/pdf/2504.03520
Copy Paste: [[2504.03520]] Neutralizing the Narrative: AI-Powered Debiasing of Online News Articles(https://arxiv.org/abs/2504.03520)
Keywords: fair, large language model
Abstract: Bias in news reporting significantly impacts public perception, particularly regarding crime, politics, and societal issues. Traditional bias detection methods, predominantly reliant on human moderation, suffer from subjective interpretations and scalability constraints. Here, we introduce an AI-driven framework leveraging advanced large language models (LLMs), specifically GPT-4o, GPT-4o Mini, Gemini Pro, Gemini Flash, Llama 8B, and Llama 3B, to systematically identify and mitigate biases in news articles. To this end, we collect an extensive dataset consisting of over 30,000 crime-related articles from five politically diverse news sources spanning a decade (2013-2023). Our approach employs a two-stage methodology: (1) bias detection, where each LLM scores and justifies biased content at the paragraph level, validated through human evaluation for ground truth establishment, and (2) iterative debiasing using GPT-4o Mini, verified by both automated reassessment and human reviewers. Empirical results indicate GPT-4o Mini's superior accuracy in bias detection and effectiveness in debiasing. Furthermore, our analysis reveals temporal and geographical variations in media bias correlating with socio-political dynamics and real-world events. This study contributes to scalable computational methodologies for bias mitigation, promoting fairness and accountability in news reporting.

Title: HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

Authors: Boyuan Wang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Guan Huang, Lihong Liu, Xingang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03536
Pdf URL: https://arxiv.org/pdf/2504.03536
Copy Paste: [[2504.03536]] HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration(https://arxiv.org/abs/2504.03536)
Keywords: generative
Abstract: Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbf{HumanDreamer-X}, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbf{HumanFixer} is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.

Title: Agentic Knowledgeable Self-awareness

Authors: Shuofei Qiao, Zhisong Qiu, Baochang Ren, Xiaobin Wang, Xiangyuan Ru, Ningyu Zhang, Xiang Chen, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2504.03553
Pdf URL: https://arxiv.org/pdf/2504.03553
Copy Paste: [[2504.03553]] Agentic Knowledgeable Self-awareness(https://arxiv.org/abs/2504.03553)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at this https URL.

Title: PF3Det: A Prompted Foundation Feature Assisted Visual LiDAR 3D Detector

Authors: Kaidong Li, Tianxiao Zhang, Kuan-Chuan Peng, Guanghui Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03563
Pdf URL: https://arxiv.org/pdf/2504.03563
Copy Paste: [[2504.03563]] PF3Det: A Prompted Foundation Feature Assisted Visual LiDAR 3D Detector(https://arxiv.org/abs/2504.03563)
Keywords: robust
Abstract: 3D object detection is crucial for autonomous driving, leveraging both LiDAR point clouds for precise depth information and camera images for rich semantic information. Therefore, the multi-modal methods that combine both modalities offer more robust detection results. However, efficiently fusing LiDAR points and images remains challenging due to the domain gaps. In addition, the performance of many models is limited by the amount of high quality labeled data, which is expensive to create. The recent advances in foundation models, which use large-scale pre-training on different modalities, enable better multi-modal fusion. Combining the prompt engineering techniques for efficient training, we propose the Prompted Foundational 3D Detector (PF3Det), which integrates foundation model encoders and soft prompts to enhance LiDAR-camera feature fusion. PF3Det achieves the state-of-the-art results under limited training data, improving NDS by 1.19% and mAP by 2.42% on the nuScenes dataset, demonstrating its efficiency in 3D detection.

Title: Scalable Hypergraph Structure Learning with Diverse Smoothness Priors

Authors: Benjamin T. Brown, Haoxiang Zhang, Daniel L. Lau, Gonzalo R. Arce
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2504.03583
Pdf URL: https://arxiv.org/pdf/2504.03583
Copy Paste: [[2504.03583]] Scalable Hypergraph Structure Learning with Diverse Smoothness Priors(https://arxiv.org/abs/2504.03583)
Keywords: robust
Abstract: In graph signal processing, learning the weighted connections between nodes from a set of sample signals is a fundamental task when the underlying relationships are not known a priori. This task is typically addressed by finding a graph Laplacian on which the observed signals are smooth. With the extension of graphs to hypergraphs - where edges can connect more than two nodes - graph learning methods have similarly been generalized to hypergraphs. However, the absence of a unified framework for calculating total variation has led to divergent definitions of smoothness and, consequently, differing approaches to hyperedge recovery. We confront this challenge through generalization of several previously proposed hypergraph total variations, subsequently allowing ease of substitution into a vector based optimization. To this end, we propose a novel hypergraph learning method that recovers a hypergraph topology from time-series signals based on a smoothness prior. Our approach addresses key limitations in prior works, such as hyperedge selection and convergence issues, by formulating the problem as a convex optimization solved via a forward-backward-forward algorithm, ensuring guaranteed convergence. Additionally, we introduce a process that simultaneously limits the span of the hyperedge search and maintains a valid hyperedge selection set. In doing so, our method becomes scalable in increasingly complex network structures. The experimental results demonstrate improved performance, in terms of accuracy, over other state-of-the-art hypergraph inference methods; furthermore, we empirically show our method to be robust to total variation terms, biased towards global smoothness, and scalable to larger hypergraphs.

Title: EnrichIndex: Using LLMs to Enrich Retrieval Indices Offline

Authors: Peter Baile Chen, Tomer Wolfson, Michael Cafarella, Dan Roth
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.03598
Pdf URL: https://arxiv.org/pdf/2504.03598
Copy Paste: [[2504.03598]] EnrichIndex: Using LLMs to Enrich Retrieval Indices Offline(https://arxiv.org/abs/2504.03598)
Keywords: large language model
Abstract: Existing information retrieval systems excel in cases where the language of target documents closely matches that of the user query. However, real-world retrieval systems are often required to implicitly reason whether a document is relevant. For example, when retrieving technical texts or tables, their relevance to the user query may be implied through a particular jargon or structure, rather than explicitly expressed in their content. Large language models (LLMs) hold great potential in identifying such implied relevance by leveraging their reasoning skills. Nevertheless, current LLM-augmented retrieval is hindered by high latency and computation cost, as the LLM typically computes the query-document relevance online, for every query anew. To tackle this issue we introduce EnrichIndex, a retrieval approach which instead uses the LLM offline to build semantically-enriched retrieval indices, by performing a single pass over all documents in the retrieval corpus once during ingestion time. Furthermore, the semantically-enriched indices can complement existing online retrieval approaches, boosting the performance of LLM re-rankers. We evaluated EnrichIndex on five retrieval tasks, involving passages and tables, and found that it outperforms strong online LLM-based retrieval systems, with an average improvement of 11.7 points in recall @ 10 and 10.6 points in NDCG @ 10 compared to strong baselines. In terms of online calls to the LLM, it processes 293.3 times fewer tokens which greatly reduces the online latency and cost. Overall, EnrichIndex is an effective way to build better retrieval indices offline by leveraging the strong reasoning skills of LLMs.

Title: Robust Human Registration with Body Part Segmentation on Noisy Point Clouds

Authors: Kai Lascheit, Daniel Barath, Marc Pollefeys, Leonidas Guibas, Francis Engelmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03602
Pdf URL: https://arxiv.org/pdf/2504.03602
Copy Paste: [[2504.03602]] Robust Human Registration with Body Part Segmentation on Noisy Point Clouds(https://arxiv.org/abs/2504.03602)
Keywords: robust, segmentation
Abstract: Registering human meshes to 3D point clouds is essential for applications such as augmented reality and human-robot interaction but often yields imprecise results due to noise and background clutter in real-world data. We introduce a hybrid approach that incorporates body-part segmentation into the mesh fitting process, enhancing both human pose estimation and segmentation accuracy. Our method first assigns body part labels to individual points, which then guide a two-step SMPL-X fitting: initial pose and orientation estimation using body part centroids, followed by global refinement of the point cloud alignment. Additionally, we demonstrate that the fitted human mesh can refine body part labels, leading to improved segmentation. Evaluations on the cluttered and noisy real-world datasets InterCap, EgoBody, and BEHAVE show that our approach significantly outperforms prior methods in both pose estimation and segmentation accuracy. Code and results are available on our project website: this https URL

Title: Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal

Authors: Yuyang Hu, Suhas Lohit, Ulugbek S. Kamilov, Tim K. Marks
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03607
Pdf URL: https://arxiv.org/pdf/2504.03607
Copy Paste: [[2504.03607]] Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal(https://arxiv.org/abs/2504.03607)
Keywords: diffusion
Abstract: Deep learning has achieved some success in addressing the challenge of cloud removal in optical satellite images, by fusing with synthetic aperture radar (SAR) images. Recently, diffusion models have emerged as powerful tools for cloud removal, delivering higher-quality estimation by sampling from cloud-free distributions, compared to earlier methods. However, diffusion models initiate sampling from pure Gaussian noise, which complicates the sampling trajectory and results in suboptimal performance. Also, current methods fall short in effectively fusing SAR and optical data. To address these limitations, we propose Diffusion Bridges for Cloud Removal, DB-CR, which directly bridges between the cloudy and cloud-free image distributions. In addition, we propose a novel multimodal diffusion bridge architecture with a two-branch backbone for multimodal image restoration, incorporating an efficient backbone and dedicated cross-modality fusion blocks to effectively extract and fuse features from synthetic aperture radar (SAR) and optical images. By formulating cloud removal as a diffusion-bridge problem and leveraging this tailored architecture, DB-CR achieves high-fidelity results while being computationally efficient. We evaluated DB-CR on the SEN12MS-CR cloud-removal dataset, demonstrating that it achieves state-of-the-art results.

Title: AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset

Authors: Bingxiang He, Wenbin Zhang, Jiaxi Song, Cheng Qian, Zixuan Fu, Bowen Sun, Ning Ding, Haiwen Hong, Longtao Huang, Hui Xue, Ganqu Cui, Wanxiang Che, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03612
Pdf URL: https://arxiv.org/pdf/2504.03612
Copy Paste: [[2504.03612]] AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset(https://arxiv.org/abs/2504.03612)
Keywords: generative, large language model
Abstract: Preference learning is critical for aligning large language models (LLMs) with human values, yet its success hinges on high-quality datasets comprising three core components: Preference \textbf{A}nnotations, \textbf{I}nstructions, and \textbf{R}esponse Pairs. Current approaches conflate these components, obscuring their individual impacts and hindering systematic optimization. In this work, we propose \textbf{AIR}, a component-wise analysis framework that systematically isolates and optimizes each component while evaluating their synergistic effects. Through rigorous experimentation, AIR reveals actionable principles: annotation simplicity (point-wise generative scoring), instruction inference stability (variance-based filtering across LLMs), and response pair quality (moderate margins + high absolute scores). When combined, these principles yield +5.3 average gains over baseline method, even with only 14k high-quality pairs. Our work shifts preference dataset design from ad hoc scaling to component-aware optimization, offering a blueprint for efficient, reproducible alignment.

Title: Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution

Authors: Aref Azizpour, Tai D. Nguyen, Matthew C. Stamm
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03615
Pdf URL: https://arxiv.org/pdf/2504.03615
Copy Paste: [[2504.03615]] Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution(https://arxiv.org/abs/2504.03615)
Keywords: robust, generative
Abstract: Rapid advances in generative AI have enabled the creation of highly realistic synthetic images, which, while beneficial in many domains, also pose serious risks in terms of disinformation, fraud, and other malicious applications. Current synthetic image identification systems are typically static, relying on feature representations learned from known generators; as new generative models emerge, these systems suffer from severe performance degradation. In this paper, we introduce the concept of an autonomous self-adaptive synthetic media identification system -- one that not only detects synthetic images and attributes them to known sources but also autonomously identifies and incorporates novel generators without human intervention. Our approach leverages an open-set identification strategy with an evolvable embedding space that distinguishes between known and unknown sources. By employing an unsupervised clustering method to aggregate unknown samples into high-confidence clusters and continuously refining its decision boundaries, our system maintains robust detection and attribution performance even as the generative landscape evolves. Extensive experiments demonstrate that our method significantly outperforms existing approaches, marking a crucial step toward universal, adaptable forensic systems in the era of rapidly advancing generative models.

Title: Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task

Authors: Leonardo Ranaldi, Barry Haddow, Alexandra Birch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03616
Pdf URL: https://arxiv.org/pdf/2504.03616
Copy Paste: [[2504.03616]] Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task(https://arxiv.org/abs/2504.03616)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages.

Title: VISTA-OCR: Towards generative and interactive end to end OCR models

Authors: Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03621
Pdf URL: https://arxiv.org/pdf/2504.03621
Copy Paste: [[2504.03621]] VISTA-OCR: Towards generative and interactive end to end OCR models(https://arxiv.org/abs/2504.03621)
Keywords: extraction, transformer, generative, large language model
Abstract: We introduce \textbf{VISTA-OCR} (Vision and Spatially-aware Text Analysis OCR), a lightweight architecture that unifies text detection and recognition within a single generative model. Unlike conventional methods that require separate branches with dedicated parameters for text recognition and detection, our approach leverages a Transformer decoder to sequentially generate text transcriptions and their spatial coordinates in a unified branch. Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase, followed by multitask learning with multimodal token generation. To address the increasing demand for versatile OCR systems capable of advanced tasks, such as content-based text localization \ref{content_based_localization}, we introduce new prompt-controllable OCR tasks during this http URL enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples. Although recent Vision Large Language Models (VLLMs) can efficiently perform these tasks, their high computational cost remains a barrier for practical deployment. In contrast, our VISTA$_{\text{omni}}$ variant processes both handwritten and printed documents with only 150M parameters, interactively, by prompting. Extensive experiments on multiple datasets demonstrate that VISTA-OCR achieves better performance compared to state-of-the-art specialized models on standard OCR tasks while showing strong potential for more sophisticated OCR applications, addressing the growing need for interactive OCR systems. All code and annotations for VISTA-OCR will be made publicly available upon acceptance.

Title: Align to Structure: Aligning Large Language Models with Structural Information

Authors: Zae Myung Kim, Anand Ramachandran, Farideh Tavazoee, Joo-Kyung Kim, Oleg Rokhlenko, Dongyeop Kang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03622
Pdf URL: https://arxiv.org/pdf/2504.03622
Copy Paste: [[2504.03622]] Align to Structure: Aligning Large Language Models with Structural Information(https://arxiv.org/abs/2504.03622)
Keywords: large language model
Abstract: Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our approach guides models to produce coherent and well-organized outputs. We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing. Two complementary reward models are evaluated: the first improves readability by scoring surface-level textual features to provide explicit structuring, while the second reinforces deeper coherence and rhetorical sophistication by analyzing global discourse patterns through hierarchical discourse motifs, outperforming both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization. All training data and code will be publicly shared at this https URL.

Title: Quantifying the uncertainty of model-based synthetic image quality metrics

Authors: Ciaran Bench, Spencer A. Thomas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03623
Pdf URL: https://arxiv.org/pdf/2504.03623
Copy Paste: [[2504.03623]] Quantifying the uncertainty of model-based synthetic image quality metrics(https://arxiv.org/abs/2504.03623)
Keywords: diffusion
Abstract: The quality of synthetically generated images (e.g. those produced by diffusion models) are often evaluated using information about image contents encoded by pretrained auxiliary models. For example, the Fréchet Inception Distance (FID) uses embeddings from an InceptionV3 model pretrained to classify ImageNet. The effectiveness of this feature embedding model has considerable impact on the trustworthiness of the calculated metric (affecting its suitability in several domains, including medical imaging). Here, uncertainty quantification (UQ) is used to provide a heuristic measure of the trustworthiness of the feature embedding model and an FID-like metric called the Fréchet Autoencoder Distance (FAED). We apply Monte Carlo dropout to a feature embedding model (convolutional autoencoder) to model the uncertainty in its embeddings. The distribution of embeddings for each input are then used to compute a distribution of FAED values. We express uncertainty as the predictive variance of the embeddings as well as the standard deviation of the computed FAED values. We find that their magnitude correlates with the extent to which the inputs are out-of-distribution to the model's training data, providing some validation of its ability to assess the trustworthiness of the FAED.

Title: Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

Authors: NVIDIA: Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo, Chengyu Dong, Christine Harvey, Christopher Parisien, Dan Su, Daniel Korzekwa, Danny Yin, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Denys Fridman, Dima Rekesh, Ding Ma, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Dusan Stosic, Eileen Long, Elad Segal, Ellie Evans, Eric Chung, Erick Galinkin, Evelina Bakhturina, Ewa Dobrowolska, Fei Jia, Fuxiao Liu, Gargi Prasad, Gerald Shen, Guilin Liu, Guo Chen, Haifeng Qian, Helen Ngo, Hongbin Liu, Hui Li, Igor Gitman, Ilia Karmanov, Ivan Moshkov, Izik Golan, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jarno Seppanen, Jason Lu, Jason Sewall, Jiaqi Zeng, Jiaxuan You, Jimmy Zhang, Jing Zhang, Jining Huang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jon Barker, Jonathan Cohen, Joseph Jennings, Jupinder Parmar, Karan Sapra, Kari Briski, Kateryna Chumachenko, Katherine Luna, Keshav Santhanam, Kezhi Kong, Kirthi Sivamani, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Lawrence McAfee, Leon Derczynski, Lindsey Pavao, Luis Vega, Lukas Voegtle, Maciej Bala, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Markus Kliegl
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03624
Pdf URL: https://arxiv.org/pdf/2504.03624
Copy Paste: [[2504.03624]] Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models(https://arxiv.org/abs/2504.03624)
Keywords: transformer
Abstract: As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. All Nemotron-H models will be released, with support in Hugging Face, NeMo, and Megatron-LM.

Title: MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Authors: Wulin Xie, Yi-Fan Zhang, Chaoyou Fu, Yang Shi, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang, Tieniu Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03641
Pdf URL: https://arxiv.org/pdf/2504.03641
Copy Paste: [[2504.03641]] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models(https://arxiv.org/abs/2504.03641)
Keywords: robust, fair
Abstract: Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in this https URL.