2025-04-14

Title: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Authors: Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua
Subjects: cs.CV, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2504.07981
Pdf URL: https://arxiv.org/pdf/2504.07981
Copy Paste: [[2504.07981]] ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use(https://arxiv.org/abs/2504.07981)
Keywords: large language model
Abstract: Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at this https URL.

Title: Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT

Authors: Harishwar Reddy, Madhusudan Srinivasan, Upulee Kanewala
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07982
Pdf URL: https://arxiv.org/pdf/2504.07982
Copy Paste: [[2504.07982]] Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT(https://arxiv.org/abs/2504.07982)
Keywords: robust, fair, large language model
Abstract: Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues, often reflecting biases inherent in their training data. These biases pose risks, particularly when LLMs are deployed in sensitive areas such as healthcare, finance, and law. This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs. We define and apply a set of fairness-oriented metamorphic relations (MRs) to assess the LLaMA and GPT model, a state-of-the-art LLM, across diverse demographic inputs. Our methodology includes generating source and follow-up test cases for each MR and analyzing model responses for fairness violations. The results demonstrate the effectiveness of MT in exposing bias patterns, especially in relation to tone and sentiment, and highlight specific intersections of sensitive attributes that frequently reveal fairness faults. This research improves fairness testing in LLMs, providing a structured approach to detect and mitigate biases and improve model robustness in fairness-sensitive applications.

Title: Psychological Health Knowledge-Enhanced LLM-based Social Network Crisis Intervention Text Transfer Recognition Method

Authors: Shurui Wu, Xinyi Huang, Dingxin Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07983
Pdf URL: https://arxiv.org/pdf/2504.07983
Copy Paste: [[2504.07983]] Psychological Health Knowledge-Enhanced LLM-based Social Network Crisis Intervention Text Transfer Recognition Method(https://arxiv.org/abs/2504.07983)
Keywords: large language model
Abstract: As the prevalence of mental health crises increases on social media platforms, identifying and preventing potential harm has become an urgent challenge. This study introduces a large language model (LLM)-based text transfer recognition method for social network crisis intervention, enhanced with domain-specific mental health knowledge. We propose a multi-level framework that incorporates transfer learning using BERT, and integrates mental health knowledge, sentiment analysis, and behavior prediction techniques. The framework includes a crisis annotation tool trained on social media datasets from real-world events, enabling the model to detect nuanced emotional cues and identify psychological crises. Experimental results show that the proposed method outperforms traditional models in crisis detection accuracy and exhibits greater sensitivity to subtle emotional and contextual variations.

Title: Topic mining based on fine-tuning Sentence-BERT and LDA

Authors: Jianheng Li, Lirong Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.07984
Pdf URL: https://arxiv.org/pdf/2504.07984
Copy Paste: [[2504.07984]] Topic mining based on fine-tuning Sentence-BERT and LDA(https://arxiv.org/abs/2504.07984)
Keywords: extraction
Abstract: Research background: With the continuous development of society, consumers pay more attention to the key information of product fine-grained attributes when shopping. Research purposes: This study will fine tune the Sentence-BERT word embedding model and LDA model, mine the subject characteristics in online reviews of goods, and show consumers the details of various aspects of goods. Research methods: First, the Sentence-BERT model was fine tuned in the field of e-commerce online reviews, and the online review text was converted into a word vector set with richer semantic information; Secondly, the vectorized word set is input into the LDA model for topic feature extraction; Finally, focus on the key functions of the product through keyword analysis under the theme. Results: This study compared this model with other word embedding models and LDA models, and compared it with common topic extraction methods. The theme consistency of this model is 0.5 higher than that of other models, which improves the accuracy of theme extraction

Title: SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Authors: Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, Zhangyang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07986
Pdf URL: https://arxiv.org/pdf/2504.07986
Copy Paste: [[2504.07986]] SEAL: Steerable Reasoning Calibration of Large Language Models for Free(https://arxiv.org/abs/2504.07986)
Keywords: large language model
Abstract: Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at this https URL.

Title: 'Neural howlround' in large language models: a self-reinforcing bias phenomenon, and a dynamic attenuation solution

Authors: Seth Drake
Subjects: cs.CL, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2504.07992
Pdf URL: https://arxiv.org/pdf/2504.07992
Copy Paste: [[2504.07992]] 'Neural howlround' in large language models: a self-reinforcing bias phenomenon, and a dynamic attenuation solution(https://arxiv.org/abs/2504.07992)
Keywords: robust, large language model
Abstract: Large language model (LLM)-driven AI systems may exhibit an inference failure mode we term `neural howlround,' a self-reinforcing cognitive loop where certain highly weighted inputs become dominant, leading to entrenched response patterns resistant to correction. This paper explores the mechanisms underlying this phenomenon, which is distinct from model collapse and biased salience weighting. We propose an attenuation-based correction mechanism that dynamically introduces counterbalancing adjustments and can restore adaptive reasoning, even in `locked-in' AI systems. Additionally, we discuss some other related effects arising from improperly managed reinforcement. Finally, we outline potential applications of this mitigation strategy for improving AI robustness in real-world decision-making tasks.

Title: SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness

Authors: Biplav Srivastava, Kausik Lakkaraju, Nitin Gupta, Vansh Nagpal, Bharath C. Muppasani, Sara E. Jones
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07995
Pdf URL: https://arxiv.org/pdf/2504.07995
Copy Paste: [[2504.07995]] SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness(https://arxiv.org/abs/2504.07995)
Keywords: large language model
Abstract: Collaborative assistants, or chatbots, are data-driven decision support systems that enable natural interaction for task completion. While they can meet critical needs in modern society, concerns about their reliability and trustworthiness persist. In particular, Large Language Model (LLM)-based chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible. However, such chatbots have limitations, including their inability to explain response generation, the risk of generating problematic content, the lack of standardized testing for reliability, and the need for deep AI expertise and extended development times. These issues make chatbots unsuitable for trust-sensitive applications like elections or healthcare. To address these concerns, we introduce SafeChat, a general architecture for building safe and trustworthy chatbots, with a focus on information retrieval use cases. Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance), and 'do-not-respond' strategies to prevent harmful answers; (b) usability, with automatic extractive summarization of long responses, traceable to their sources, and automated trust assessments to communicate expected chatbot behavior, such as sentiment; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices. We implemented SafeChat in an executable framework using the open-source chatbot platform Rasa. A case study demonstrates its application in building ElectionBot-SC, a chatbot designed to safely disseminate official election information. SafeChat is being used in many domains, validating its potential, and is available at: this https URL.

Title: BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models

Authors: Tian Xie, Tongxin Yin, Vaishakh Keshava, Xueru Zhang, Siddhartha Reddy Jonnalagadda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07997
Pdf URL: https://arxiv.org/pdf/2504.07997
Copy Paste: [[2504.07997]] BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models(https://arxiv.org/abs/2504.07997)
Keywords: large language model
Abstract: While large language models (LLMs) already play significant roles in society, research has shown that LLMs still generate content including social bias against certain sensitive groups. While existing benchmarks have effectively identified social biases in LLMs, a critical gap remains in our understanding of the underlying reasoning that leads to these biased outputs. This paper goes one step further to evaluate the causal reasoning process of LLMs when they answer questions eliciting social biases. We first propose a novel conceptual framework to classify the causal reasoning produced by LLMs. Next, we use LLMs to synthesize $1788$ questions covering $8$ sensitive attributes and manually validate them. The questions can test different kinds of causal reasoning by letting LLMs disclose their reasoning process with causal graphs. We then test 4 state-of-the-art LLMs. All models answer the majority of questions with biased causal reasoning, resulting in a total of $4135$ biased causal graphs. Meanwhile, we discover $3$ strategies for LLMs to avoid biased causal reasoning by analyzing the "bias-free" cases. Finally, we reveal that LLMs are also prone to "mistaken-biased" causal reasoning, where they first confuse correlation with causality to infer specific sensitive group names and then incorporate biased causal reasoning.

Title: Linguistic Interpretability of Transformer-based Language Models: a systematic review

Authors: Miguel López-Otal, Jorge Gracia, Jordi Bernad, Carlos Bobed, Lucía Pitarch-Ballesteros, Emma Anglés-Herrero
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08001
Pdf URL: https://arxiv.org/pdf/2504.08001
Copy Paste: [[2504.08001]] Linguistic Interpretability of Transformer-based Language Models: a systematic review(https://arxiv.org/abs/2504.08001)
Keywords: interpretability, transformer
Abstract: Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.

Title: More diverse more adaptive: Comprehensive Multi-task Learning for Improved LLM Domain Adaptation in E-commerce

Authors: Tong Piao, Pei Tang, Zhipeng Zhang, Jiaqi Li, Qiao Liu, Zufeng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08002
Pdf URL: https://arxiv.org/pdf/2504.08002
Copy Paste: [[2504.08002]] More diverse more adaptive: Comprehensive Multi-task Learning for Improved LLM Domain Adaptation in E-commerce(https://arxiv.org/abs/2504.08002)
Keywords: large language model
Abstract: In recent years, Large Language Models (LLMs) have been widely applied across various domains due to their powerful domain adaptation capabilities. Previous studies have suggested that diverse, multi-modal data can enhance LLMs' domain adaptation performance. However, this hypothesis remains insufficiently validated in the e-commerce sector. To address this gap, we propose a comprehensive e-commerce multi-task framework and design empirical experiments to examine the impact of diverse data and tasks on LLMs from two perspectives: "capability comprehensiveness" and "task comprehensiveness." Specifically, we observe significant improvements in LLM performance by progressively introducing tasks related to new major capability areas and by continuously adding subtasks within different major capability domains. Furthermore, we observe that increasing model capacity amplifies the benefits of diversity, suggesting a synergistic relationship between model capacity and data diversity. Finally, we validate the best-performing model from our empirical experiments in the KDD Cup 2024, achieving a rank 5 in Task 1. This outcome demonstrates the significance of our research for advancing LLMs in the e-commerce domain.

Title: Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

Authors: Ning Li, Jingran Zhang, Justin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08003
Pdf URL: https://arxiv.org/pdf/2504.08003
Copy Paste: [[2504.08003]] Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability(https://arxiv.org/abs/2504.08003)
Keywords: robust
Abstract: OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis--seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence--remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o's strong capabilities in image generation and editing, our evaluation reveals GPT-4o's persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o's unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.

Title: Self-Bootstrapping for Versatile Test-Time Adaptation

Authors: Shuaicheng Niu, Guohao Chen, Peilin Zhao, Tianyi Wang, Pengcheng Wu, Zhiqi Shen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08010
Pdf URL: https://arxiv.org/pdf/2504.08010
Copy Paste: [[2504.08010]] Self-Bootstrapping for Versatile Test-Time Adaptation(https://arxiv.org/abs/2504.08010)
Keywords: transformer, segmentation
Abstract: In this paper, we seek to develop a versatile test-time adaptation (TTA) objective for a variety of tasks - classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image's geometric information, e.g., object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image's information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models.

Title: Can Reasoning LLMs Enhance Clinical Document Classification?

Authors: Akram Mustafa, Usman Naseem, Mostafa Rahimi Azghadi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08040
Pdf URL: https://arxiv.org/pdf/2504.08040
Copy Paste: [[2504.08040]] Can Reasoning LLMs Enhance Clinical Document Classification?(https://arxiv.org/abs/2504.08040)
Keywords: privacy, large language model
Abstract: Clinical document classification is essential for converting unstructured medical texts into standardised ICD-10 diagnoses, yet it faces challenges due to complex medical language, privacy constraints, and limited annotated datasets. Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task. This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat); in classifying clinical discharge summaries using the MIMIC-IV dataset. Using cTAKES to structure clinical narratives, models were assessed across three experimental runs, with majority voting determining final predictions. Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%), with Gemini 2.0 Flash Thinking achieving the highest accuracy (75%) and F1 score (76%). However, non-reasoning models demonstrated greater stability (91% vs 84% consistency). Performance varied across ICD-10 codes, with reasoning models excelling in complex cases but struggling with abstract categories. Findings indicate a trade-off between accuracy and consistency, suggesting that a hybrid approach could optimise clinical coding. Future research should explore multi-label classification, domain-specific fine-tuning, and ensemble methods to enhance model reliability in real-world applications.

Title: Teaching Humans Subtle Differences with DIFFusion

Authors: Mia Chiquier, Orr Avrech, Yossi Gandelsman, Berthy Feng, Katherine Bouman, Carl Vondrick
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08046
Pdf URL: https://arxiv.org/pdf/2504.08046
Copy Paste: [[2504.08046]] Teaching Humans Subtle Differences with DIFFusion(https://arxiv.org/abs/2504.08046)
Keywords: diffusion, generative
Abstract: Human expertise depends on the ability to recognize subtle visual differences, such as distinguishing diseases, species, or celestial phenomena. We propose a new method to teach novices how to differentiate between nuanced categories in specialized domains. Our method uses generative models to visualize the minimal change in features to transition between classes, i.e., counterfactuals, and performs well even in domains where data is sparse, examples are unpaired, and category boundaries are not easily explained by text. By manipulating the conditioning space of diffusion models, our proposed method DIFFusion disentangles category structure from instance identity, enabling high-fidelity synthesis even in challenging domains. Experiments across six domains show accurate transitions even with limited and unpaired examples across categories. User studies confirm that our generated counterfactuals outperform unpaired examples in teaching perceptual expertise, showing the potential of generative models for specialized visual learning.

Title: Compositional Flows for 3D Molecule and Synthesis Pathway Co-design

Authors: Tony Shen, Seonghwan Seo, Ross Irwin, Kieran Didi, Simon Olsson, Woo Youn Kim, Martin Ester
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08051
Pdf URL: https://arxiv.org/pdf/2504.08051
Copy Paste: [[2504.08051]] Compositional Flows for 3D Molecule and Synthesis Pathway Co-design(https://arxiv.org/abs/2504.08051)
Keywords: generative
Abstract: Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features. Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process. We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule's synthetic pathway with its 3D binding pose. Our approach achieves state-of-the-art binding affinity on all 15 targets from the LIT-PCBA benchmark, and 5.8$\times$ improvement in sampling efficiency compared to 2D synthesis-based baseline. To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.38) and AiZynth success rate (62.2\%) on the CrossDocked benchmark.

Title: X-DECODE: EXtreme Deblurring with Curriculum Optimization and Domain Equalization

Authors: Sushant Gautam, Jingdao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08072
Pdf URL: https://arxiv.org/pdf/2504.08072
Copy Paste: [[2504.08072]] X-DECODE: EXtreme Deblurring with Curriculum Optimization and Domain Equalization(https://arxiv.org/abs/2504.08072)
Keywords: robust
Abstract: Restoring severely blurred images remains a significant challenge in computer vision, impacting applications in autonomous driving, medical imaging, and photography. This paper introduces a novel training strategy based on curriculum learning to improve the robustness of deep learning models for extreme image deblurring. Unlike conventional approaches that train on only low to moderate blur levels, our method progressively increases the difficulty by introducing images with higher blur severity over time, allowing the model to adapt incrementally. Additionally, we integrate perceptual and hinge loss during training to enhance fine detail restoration and improve training stability. We experimented with various curriculum learning strategies and explored the impact of the train-test domain gap on the deblurring performance. Experimental results on the Extreme-GoPro dataset showed that our method outperforms the next best method by 14% in SSIM, whereas experiments on the Extreme-KITTI dataset showed that our method outperforms the next best by 18% in SSIM. Ablation studies showed that a linear curriculum progression outperforms step-wise, sigmoid, and exponential progressions, while hyperparameter settings such as the training blur percentage and loss function formulation all play important roles in addressing extreme blur artifacts. Datasets and code are available at this https URL

Title: Deep Reinforcement Learning for Day-to-day Dynamic Tolling in Tradable Credit Schemes

Authors: Xiaoyi Wu, Ravi Seshadri, Filipe Rodrigues, Carlos Lima Azevedo
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2504.08074
Pdf URL: https://arxiv.org/pdf/2504.08074
Copy Paste: [[2504.08074]] Deep Reinforcement Learning for Day-to-day Dynamic Tolling in Tradable Credit Schemes(https://arxiv.org/abs/2504.08074)
Keywords: robust
Abstract: Tradable credit schemes (TCS) are an increasingly studied alternative to congestion pricing, given their revenue neutrality and ability to address issues of equity through the initial credit allocation. Modeling TCS to aid future design and implementation is associated with challenges involving user and market behaviors, demand-supply dynamics, and control mechanisms. In this paper, we focus on the latter and address the day-to-day dynamic tolling problem under TCS, which is formulated as a discrete-time Markov Decision Process and solved using reinforcement learning (RL) algorithms. Our results indicate that RL algorithms achieve travel times and social welfare comparable to the Bayesian optimization benchmark, with generalization across varying capacities and demand levels. We further assess the robustness of RL under different hyperparameters and apply regularization techniques to mitigate action oscillation, which generates practical tolling strategies that are transferable under day-to-day demand and supply variability. Finally, we discuss potential challenges such as scaling to large networks, and show how transfer learning can be leveraged to improve computational efficiency and facilitate the practical deployment of RL-based TCS solutions.

Title: Differentially Private Selection using Smooth Sensitivity

Authors: Iago Chaves, Victor Farias, Amanda Perez, Diego Parente, Javam Machado
Subjects: cs.LG, cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2504.08086
Pdf URL: https://arxiv.org/pdf/2504.08086
Copy Paste: [[2504.08086]] Differentially Private Selection using Smooth Sensitivity(https://arxiv.org/abs/2504.08086)
Keywords: privacy
Abstract: Differentially private selection mechanisms offer strong privacy guarantees for queries aiming to identify the top-scoring element r from a finite set R, based on a dataset-dependent utility function. While selection queries are fundamental in data science, few mechanisms effectively ensure their privacy. Furthermore, most approaches rely on global sensitivity to achieve differential privacy (DP), which can introduce excessive noise and impair downstream inferences. To address this limitation, we propose the Smooth Noisy Max (SNM) mechanism, which leverages smooth sensitivity to yield provably tighter (upper bounds on) expected errors compared to global sensitivity-based methods. Empirical results demonstrate that SNM is more accurate than state-of-the-art differentially private selection methods in three applications: percentile selection, greedy decision trees, and random forests.

Title: ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting

Authors: Junbang Liu, Enpei Huang, Dongxing Mao, Hui Zhang, Xinyuan Song, Yongxin Ni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08100
Pdf URL: https://arxiv.org/pdf/2504.08100
Copy Paste: [[2504.08100]] ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting(https://arxiv.org/abs/2504.08100)
Keywords: diffusion, generative
Abstract: Creating 3D content from single-view images is a challenging problem that has attracted considerable attention in recent years. Current approaches typically utilize score distillation sampling (SDS) from pre-trained 2D diffusion models to generate multi-view 3D representations. Although some methods have made notable progress by balancing generation speed and model quality, their performance is often limited by the visual inconsistencies of the diffusion model outputs. In this work, we propose ContrastiveGaussian, which integrates contrastive learning into the generative process. By using a perceptual loss, we effectively differentiate between positive and negative samples, leveraging the visual inconsistencies to improve 3D generation quality. To further enhance sample differentiation and improve contrastive learning, we incorporate a super-resolution model and introduce another Quantity-Aware Triplet Loss to address varying sample distributions during training. Our experiments demonstrate that our approach achieves superior texture fidelity and improved geometric consistency.

Title: Multi-view autoencoders for Fake News Detection

Authors: Ingryd V. S. T. Pereira, George D. C. Cavalcanti, Rafael M. O. Cruz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08102
Pdf URL: https://arxiv.org/pdf/2504.08102
Copy Paste: [[2504.08102]] Multi-view autoencoders for Fake News Detection(https://arxiv.org/abs/2504.08102)
Keywords: extraction
Abstract: Given the volume and speed at which fake news spreads across social media, automatic fake news detection has become a highly important task. However, this task presents several challenges, including extracting textual features that contain relevant information about fake news. Research about fake news detection shows that no single feature extraction technique consistently outperforms the others across all scenarios. Nevertheless, different feature extraction techniques can provide complementary information about the textual data and enable a more comprehensive representation of the content. This paper proposes using multi-view autoencoders to generate a joint feature representation for fake news detection by integrating several feature extraction techniques commonly used in the literature. Experiments on fake news datasets show a significant improvement in classification performance compared to individual views (feature representations). We also observed that selecting a subset of the views instead of composing a latent space with all the views can be advantageous in terms of accuracy and computational effort. For further details, including source codes, figures, and datasets, please refer to the project's repository: this https URL.

Title: Geneshift: Impact of different scenario shift on Jailbreaking LLM

Authors: Tianyi Wu, Zhiwei Xue, Yue Liu, Jiaheng Zhang, Bryan Hooi, See-Kiong Ng
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.08104
Pdf URL: https://arxiv.org/pdf/2504.08104
Copy Paste: [[2504.08104]] Geneshift: Impact of different scenario shift on Jailbreaking LLM(https://arxiv.org/abs/2504.08104)
Keywords: attack, steal
Abstract: Jailbreak attacks, which aim to cause LLMs to perform unrestricted behaviors, have become a critical and challenging direction in AI safety. Despite achieving the promising attack success rate using dictionary-based evaluation, existing jailbreak attack methods fail to output detailed contents to satisfy the harmful request, leading to poor performance on GPT-based evaluation. To this end, we propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. Firstly, we observe that the malicious queries perform optimally under different scenario shifts. Based on it, we develop a genetic algorithm to evolve and select the hybrid of scenario shifts. It guides our method to elicit detailed and actionable harmful responses while keeping the seemingly benign facade, improving stealthiness. Extensive experiments demonstrate the superiority of GeneShift. Notably, GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.

Title: Towards Unconstrained 2D Pose Estimation of the Human Spine

Authors: Muhammad Saif Ullah Khan, Stephan Krauß, Didier Stricker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08110
Pdf URL: https://arxiv.org/pdf/2504.08110
Copy Paste: [[2504.08110]] Towards Unconstrained 2D Pose Estimation of the Human Spine(https://arxiv.org/abs/2504.08110)
Keywords: robust
Abstract: We present SpineTrack, the first comprehensive dataset for 2D spine pose estimation in unconstrained settings, addressing a crucial need in sports analytics, healthcare, and realistic animation. Existing pose datasets often simplify the spine to a single rigid segment, overlooking the nuanced articulation required for accurate motion analysis. In contrast, SpineTrack annotates nine detailed spinal keypoints across two complementary subsets: a synthetic set comprising 25k annotations created using Unreal Engine with biomechanical alignment through OpenSim, and a real-world set comprising over 33k annotations curated via an active learning pipeline that iteratively refines automated annotations with human feedback. This integrated approach ensures anatomically consistent labels at scale, even for challenging, in-the-wild images. We further introduce SpinePose, extending state-of-the-art body pose estimators using knowledge distillation and an anatomical regularization strategy to jointly predict body and spine keypoints. Our experiments in both general and sports-specific contexts validate the effectiveness of SpineTrack for precise spine pose estimation, establishing a robust foundation for future research in advanced biomechanical analysis and 3D spine reconstruction in the wild.

Title: POEM: Precise Object-level Editing via MLLM control

Authors: Marco Schouten, Mehmet Onurcan Kaya, Serge Belongie, Dim P. Papadopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08111
Pdf URL: https://arxiv.org/pdf/2504.08111
Copy Paste: [[2504.08111]] POEM: Precise Object-level Editing via MLLM control(https://arxiv.org/abs/2504.08111)
Keywords: diffusion, large language model
Abstract: Diffusion models have significantly improved text-to-image generation, producing high-quality, realistic images from textual descriptions. Beyond generation, object-level image editing remains a challenging problem, requiring precise modifications while preserving visual coherence. Existing text-based instructional editing methods struggle with localized shape and layout transformations, often introducing unintended global changes. Image interaction-based approaches offer better accuracy but require manual human effort to provide precise guidance. To reduce this manual effort while maintaining a high image editing accuracy, in this paper, we propose POEM, a framework for Precise Object-level Editing using Multimodal Large Language Models (MLLMs). POEM leverages MLLMs to analyze instructional prompts and generate precise object masks before and after transformation, enabling fine-grained control without extensive user input. This structured reasoning stage guides the diffusion-based editing process, ensuring accurate object localization and transformation. To evaluate our approach, we introduce VOCEdits, a benchmark dataset based on PASCAL VOC 2012, augmented with instructional edit prompts, ground-truth transformations, and precise object masks. Experimental results show that POEM outperforms existing text-based image editing approaches in precision and reliability while reducing manual effort compared to interaction-based methods.

Title: Scaling Laws of Graph Neural Networks for Atomistic Materials Modeling

Authors: Chaojian Li, Zhifan Ye, Massimiliano Lupo Pasini, Jong Youl Choi, Cheng Wan, Yingyan Celine Lin, Prasanna Balaprakash
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2504.08112
Pdf URL: https://arxiv.org/pdf/2504.08112
Copy Paste: [[2504.08112]] Scaling Laws of Graph Neural Networks for Atomistic Materials Modeling(https://arxiv.org/abs/2504.08112)
Keywords: large language model
Abstract: Atomistic materials modeling is a critical task with wide-ranging applications, from drug discovery to materials science, where accurate predictions of the target material property can lead to significant advancements in scientific discovery. Graph Neural Networks (GNNs) represent the state-of-the-art approach for modeling atomistic material data thanks to their capacity to capture complex relational structures. While machine learning performance has historically improved with larger models and datasets, GNNs for atomistic materials modeling remain relatively small compared to large language models (LLMs), which leverage billions of parameters and terabyte-scale datasets to achieve remarkable performance in their respective domains. To address this gap, we explore the scaling limits of GNNs for atomistic materials modeling by developing a foundational model with billions of parameters, trained on extensive datasets in terabyte-scale. Our approach incorporates techniques from LLM libraries to efficiently manage large-scale data and models, enabling both effective training and deployment of these large-scale GNN models. This work addresses three fundamental questions in scaling GNNs: the potential for scaling GNN model architectures, the effect of dataset size on model accuracy, and the applicability of LLM-inspired techniques to GNN architectures. Specifically, the outcomes of this study include (1) insights into the scaling laws for GNNs, highlighting the relationship between model size, dataset volume, and accuracy, (2) a foundational GNN model optimized for atomistic materials modeling, and (3) a GNN codebase enhanced with advanced LLM-based training techniques. Our findings lay the groundwork for large-scale GNNs with billions of parameters and terabyte-scale datasets, establishing a scalable pathway for future advancements in atomistic materials modeling.

Title: Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) Algorithms

Authors: Lucian Chauvina, Somil Guptac, Angelina Ibarrac, Joshua Peeples
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08115
Pdf URL: https://arxiv.org/pdf/2504.08115
Copy Paste: [[2504.08115]] Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) Algorithms(https://arxiv.org/abs/2504.08115)
Keywords: segmentation
Abstract: Anomaly detection is a key research challenge in computer vision and machine learning with applications in many fields from quality control to radar imaging. In radar imaging, specifically synthetic aperture radar (SAR), anomaly detection can be used for the classification, detection, and segmentation of objects of interest. However, there is no method for developing and benchmarking these methods on SAR imagery. To address this issue, we introduce SAR imagery anomaly detection (SARIAD). In conjunction with Anomalib, a deep-learning library for anomaly detection, SARIAD provides a comprehensive suite of algorithms and datasets for assessing and developing anomaly detection approaches on SAR imagery. SARIAD specifically integrates multiple SAR datasets along with tools to effectively apply various anomaly detection algorithms to SAR imagery. Several anomaly detection metrics and visualizations are available. Overall, SARIAD acts as a central package for benchmarking SAR models and datasets to allow for reproducible research in the field of anomaly detection in SAR imagery. This package is publicly available: this https URL.

Title: DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

Authors: Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08120
Pdf URL: https://arxiv.org/pdf/2504.08120
Copy Paste: [[2504.08120]] DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?(https://arxiv.org/abs/2504.08120)
Keywords: large language model
Abstract: Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.

Title: Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects

Authors: Shalini Maiti, Lourdes Agapito, Filippos Kokkinos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08125
Pdf URL: https://arxiv.org/pdf/2504.08125
Copy Paste: [[2504.08125]] Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects(https://arxiv.org/abs/2504.08125)
Keywords: robust, large language model
Abstract: Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation. The project page can be found here: \href{this https URL}{this https URL}.

Title: A physics informed neural network approach to simulating ice dynamics governed by the shallow ice approximation

Authors: Kapil Chawla, William Holmes
Subjects: cs.LG, math.NA, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2504.08136
Pdf URL: https://arxiv.org/pdf/2504.08136
Copy Paste: [[2504.08136]] A physics informed neural network approach to simulating ice dynamics governed by the shallow ice approximation(https://arxiv.org/abs/2504.08136)
Keywords: robust
Abstract: In this article we develop a Physics Informed Neural Network (PINN) approach to simulate ice sheet dynamics governed by the Shallow Ice Approximation. This problem takes the form of a time-dependent parabolic obstacle problem. Prior work has used this approach to address the stationary obstacle problem and here we extend it to the time dependent problem. Through comprehensive 1D and 2D simulations, we validate the model's effectiveness in capturing complex free-boundary conditions. By merging traditional mathematical modeling with cutting-edge deep learning methods, this approach provides a scalable and robust solution for predicting temporal variations in ice thickness. To illustrate this approach in a real world setting, we simulate the dynamics of the Devon Ice Cap, incorporating aerogeophysical data from 2000 and 2018.

Title: Impact of Language Guidance: A Reproducibility Study

Authors: Cherish Puniani, Advika Sinha, Shree Singhi, Aayan Yadav
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08140
Pdf URL: https://arxiv.org/pdf/2504.08140
Copy Paste: [[2504.08140]] Impact of Language Guidance: A Reproducibility Study(https://arxiv.org/abs/2504.08140)
Keywords: interpretability
Abstract: Modern deep-learning architectures need large amounts of data to produce state-of-the-art results. Annotating such huge datasets is time-consuming, expensive, and prone to human error. Recent advances in self-supervised learning allow us to train huge models without explicit annotation. Contrastive learning is a popular paradigm in self-supervised learning. Recent works like SimCLR and CLIP rely on image augmentations or directly minimizing cross-modal loss between image and text. Banani et al. (2023) propose to use language guidance to sample view pairs. They claim that language enables better conceptual similarity, eliminating the effects of visual variability. We reproduce their experiments to verify their claims and find that their dataset, RedCaps, contains low-quality captions. We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance, and we also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.

Title: LoRAX: LoRA eXpandable Networks for Continual Synthetic Image Attribution

Authors: Danielle Sullivan-Pao, Nicole Tian, Pooya Khorrami
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08149
Pdf URL: https://arxiv.org/pdf/2504.08149
Copy Paste: [[2504.08149]] LoRAX: LoRA eXpandable Networks for Continual Synthetic Image Attribution(https://arxiv.org/abs/2504.08149)
Keywords: generative
Abstract: As generative AI image technologies become more widespread and advanced, there is a growing need for strong attribution models. These models are crucial for verifying the authenticity of images and identifying the architecture of their originating generative models-key to maintaining media integrity. However, attribution models struggle to generalize to unseen models, and traditional fine-tuning methods for updating these models have shown to be impractical in real-world settings. To address these challenges, we propose LoRA eXpandable Networks (LoRAX), a parameter-efficient class incremental algorithm that adapts to novel generative image models without the need for full retraining. Our approach trains an extremely parameter-efficient feature extractor per continual learning task via Low Rank Adaptation. Each task-specific feature extractor learns distinct features while only requiring a small fraction of the parameters present in the underlying feature extractor's backbone model. Our extensive experimentation shows LoRAX outperforms or remains competitive with state-of-the-art class incremental learning algorithms on the Continual Deepfake Detection benchmark across all training scenarios and memory settings, while requiring less than 3% of the number of trainable parameters per feature extractor compared to the full-rank implementation. LoRAX code is available at: this https URL.

Title: Beyond Feature Importance: Feature Interactions in Predicting Post-Stroke Rigidity with Graph Explainable AI

Authors: Jiawei Xu, Yonggeon Lee, Anthony Elkommos Youssef, Eunjin Yun, Tinglin Huang, Tianjian Guo, Hamidreza Saber, Rex Ying, Ying Ding
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08150
Pdf URL: https://arxiv.org/pdf/2504.08150
Copy Paste: [[2504.08150]] Beyond Feature Importance: Feature Interactions in Predicting Post-Stroke Rigidity with Graph Explainable AI(https://arxiv.org/abs/2504.08150)
Keywords: explainability, transformer
Abstract: This study addresses the challenge of predicting post-stroke rigidity by emphasizing feature interactions through graph-based explainable AI. Post-stroke rigidity, characterized by increased muscle tone and stiffness, significantly affects survivors' mobility and quality of life. Despite its prevalence, early prediction remains limited, delaying intervention. We analyze 519K stroke hospitalization records from the Healthcare Cost and Utilization Project dataset, where 43% of patients exhibited rigidity. We compare traditional approaches such as Logistic Regression, XGBoost, and Transformer with graph-based models like Graphormer and Graph Attention Network. These graph models inherently capture feature interactions and incorporate intrinsic or post-hoc explainability. Our results show that graph-based methods outperform others (AUROC 0.75), identifying key predictors such as NIH Stroke Scale and APR-DRG mortality risk scores. They also uncover interactions missed by conventional models. This research provides a novel application of graph-based XAI in stroke prognosis, with potential to guide early identification and personalized rehabilitation strategies.

Title: Adaptive Bounded Exploration and Intermediate Actions for Data Debiasing

Authors: Yifan Yang, Yang Liu, Parinaz Naghizadeh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08151
Pdf URL: https://arxiv.org/pdf/2504.08151
Copy Paste: [[2504.08151]] Adaptive Bounded Exploration and Intermediate Actions for Data Debiasing(https://arxiv.org/abs/2504.08151)
Keywords: fair
Abstract: The performance of algorithmic decision rules is largely dependent on the quality of training datasets available to them. Biases in these datasets can raise economic and ethical concerns due to the resulting algorithms' disparate treatment of different groups. In this paper, we propose algorithms for sequentially debiasing the training dataset through adaptive and bounded exploration in a classification problem with costly and censored feedback. Our proposed algorithms balance between the ultimate goal of mitigating the impacts of data biases -- which will in turn lead to more accurate and fairer decisions, and the exploration risks incurred to achieve this goal. Specifically, we propose adaptive bounds to limit the region of exploration, and leverage intermediate actions which provide noisy label information at a lower cost. We analytically show that such exploration can help debias data in certain distributions, investigate how {algorithmic fairness interventions} can work in conjunction with our proposed algorithms, and validate the performance of these algorithms through numerical experiments on synthetic and real-world data.

Title: Investigating Vision-Language Model for Point Cloud-based Vehicle Classification

Authors: Yiqiao Li, Jie Wei, Camille Kamga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08154
Pdf URL: https://arxiv.org/pdf/2504.08154
Copy Paste: [[2504.08154]] Investigating Vision-Language Model for Point Cloud-based Vehicle Classification(https://arxiv.org/abs/2504.08154)
Keywords: large language model
Abstract: Heavy-duty trucks pose significant safety challenges due to their large size and limited maneuverability compared to passenger vehicles. A deeper understanding of truck characteristics is essential for enhancing the safety perspective of cooperative autonomous driving. Traditional LiDAR-based truck classification methods rely on extensive manual annotations, which makes them labor-intensive and costly. The rapid advancement of large language models (LLMs) trained on massive datasets presents an opportunity to leverage their few-shot learning capabilities for truck classification. However, existing vision-language models (VLMs) are primarily trained on image datasets, which makes it challenging to directly process point cloud data. This study introduces a novel framework that integrates roadside LiDAR point cloud data with VLMs to facilitate efficient and accurate truck classification, which supports cooperative and safe driving environments. This study introduces three key innovations: (1) leveraging real-world LiDAR datasets for model development, (2) designing a preprocessing pipeline to adapt point cloud data for VLM input, including point cloud registration for dense 3D rendering and mathematical morphological techniques to enhance feature representation, and (3) utilizing in-context learning with few-shot prompting to enable vehicle classification with minimally labeled training data. Experimental results demonstrate encouraging performance of this method and present its potential to reduce annotation efforts while improving classification accuracy.

Title: Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Authors: Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjape, Adina Williams, Tal Linzen, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08165
Pdf URL: https://arxiv.org/pdf/2504.08165
Copy Paste: [[2504.08165]] Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora(https://arxiv.org/abs/2504.08165)
Keywords: large language model
Abstract: Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.

Title: Learning Object Focused Attention

Authors: Vivek Trivedy, Amani Almalki, Longin Jan Latecki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08166
Pdf URL: https://arxiv.org/pdf/2504.08166
Copy Paste: [[2504.08166]] Learning Object Focused Attention(https://arxiv.org/abs/2504.08166)
Keywords: diffusion, transformer
Abstract: We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.

Title: On the Practice of Deep Hierarchical Ensemble Network for Ad Conversion Rate Prediction

Authors: Jinfeng Zhuang, Yinrui Li, Runze Su, Ke Xu, Zhixuan Shao, Kungang Li, Ling Leng, Han Sun, Meng Qi, Yixiong Meng, Yang Tang, Zhifang Liu, Qifei Shen, Aayush Mudgal
Subjects: cs.LG, cs.AI, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08169
Pdf URL: https://arxiv.org/pdf/2504.08169
Copy Paste: [[2504.08169]] On the Practice of Deep Hierarchical Ensemble Network for Ad Conversion Rate Prediction(https://arxiv.org/abs/2504.08169)
Keywords: attack, transformer
Abstract: The predictions of click through rate (CTR) and conversion rate (CVR) play a crucial role in the success of ad-recommendation systems. A Deep Hierarchical Ensemble Network (DHEN) has been proposed to integrate multiple feature crossing modules and has achieved great success in CTR prediction. However, its performance for CVR prediction is unclear in the conversion ads setting, where an ad bids for the probability of a user's off-site actions on a third party website or app, including purchase, add to cart, sign up, etc. A few challenges in DHEN: 1) What feature-crossing modules (MLP, DCN, Transformer, to name a few) should be included in DHEN? 2) How deep and wide should DHEN be to achieve the best trade-off between efficiency and efficacy? 3) What hyper-parameters to choose in each feature-crossing module? Orthogonal to the model architecture, the input personalization features also significantly impact model performance with a high degree of freedom. In this paper, we attack this problem and present our contributions biased to the applied data science side, including: First, we propose a multitask learning framework with DHEN as the single backbone model architecture to predict all CVR tasks, with a detailed study on how to make DHEN work effectively in practice; Second, we build both on-site real-time user behavior sequences and off-site conversion event sequences for CVR prediction purposes, and conduct ablation study on its importance; Last but not least, we propose a self-supervised auxiliary loss to predict future actions in the input sequence, to help resolve the label sparseness issue in CVR prediction. Our method achieves state-of-the-art performance compared to previous single feature crossing modules with pre-trained user personalization features.

Title: Multi-person Physics-based Pose Estimation for Combat Sports

Authors: Hossein Feiz, David Labbé, Thomas Romeas, Jocelyn Faubert, Sheldon Andrews
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08175
Pdf URL: https://arxiv.org/pdf/2504.08175
Copy Paste: [[2504.08175]] Multi-person Physics-based Pose Estimation for Combat Sports(https://arxiv.org/abs/2504.08175)
Keywords: robust, transformer, segmentation
Abstract: We propose a novel framework for accurate 3D human pose estimation in combat sports using sparse multi-camera setups. Our method integrates robust multi-view 2D pose tracking via a transformer-based top-down approach, employing epipolar geometry constraints and long-term video object segmentation for consistent identity tracking across views. Initial 3D poses are obtained through weighted triangulation and spline smoothing, followed by kinematic optimization to refine pose accuracy. We further enhance pose realism and robustness by introducing a multi-person physics-based trajectory optimization step, effectively addressing challenges such as rapid motions, occlusions, and close interactions. Experimental results on diverse datasets, including a new benchmark of elite boxing footage, demonstrate state-of-the-art performance. Additionally, we release comprehensive annotated video datasets to advance future research in multi-person pose estimation for combat sports.

Title: GenXSS: an AI-Driven Framework for Automated Detection of XSS Attacks in WAFs

Authors: Vahid Babaey, Arun Ravindran
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.08176
Pdf URL: https://arxiv.org/pdf/2504.08176
Copy Paste: [[2504.08176]] GenXSS: an AI-Driven Framework for Automated Detection of XSS Attacks in WAFs(https://arxiv.org/abs/2504.08176)
Keywords: secure, security, protect, defense, attack, generative, large language model
Abstract: The increasing reliance on web services has led to a rise in cybersecurity threats, particularly Cross-Site Scripting (XSS) attacks, which target client-side layers of web applications by injecting malicious scripts. Traditional Web Application Firewalls (WAFs) struggle to detect highly obfuscated and complex attacks, as their rules require manual updates. This paper presents a novel generative AI framework that leverages Large Language Models (LLMs) to enhance XSS mitigation. The framework achieves two primary objectives: (1) generating sophisticated and syntactically validated XSS payloads using in-context learning, and (2) automating defense mechanisms by testing these attacks against a vulnerable application secured by a WAF, classifying bypassing attacks, and generating effective WAF security rules. Experimental results using GPT-4o demonstrate the framework's effectiveness generating 264 XSS payloads, 83% of which were validated, with 80% bypassing ModSecurity WAF equipped with an industry standard security rule set developed by the Open Web Application Security Project (OWASP) to protect against web vulnerabilities. Through rule generation, 86% of previously successful attacks were blocked using only 15 new rules. In comparison, Google Gemini Pro achieved a lower bypass rate of 63%, highlighting performance differences across LLMs.

Title: TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

Authors: Ruineng Li, Daitao Xing, Huiming Sun, Yuanzhou Ha, Jinglin Shen, Chiuman Ho
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08181
Pdf URL: https://arxiv.org/pdf/2504.08181
Copy Paste: [[2504.08181]] TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation(https://arxiv.org/abs/2504.08181)
Keywords: diffusion
Abstract: Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially-and-temporally varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion's effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.

Title: SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

Authors: Aashiq Muhamed, Jacopo Bonato, Mona Diab, Virginia Smith
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2504.08192
Pdf URL: https://arxiv.org/pdf/2504.08192
Copy Paste: [[2504.08192]] SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs(https://arxiv.org/abs/2504.08192)
Keywords: attack, robust, interpretability
Abstract: Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce $\textbf{Dynamic DAE Guardrails}$ (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning -- offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.

Title: The More is not the Merrier: Investigating the Effect of Client Size on Federated Learning

Authors: Eleanor Wallach, Sage Siler, Jing Deng
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.08198
Pdf URL: https://arxiv.org/pdf/2504.08198
Copy Paste: [[2504.08198]] The More is not the Merrier: Investigating the Effect of Client Size on Federated Learning(https://arxiv.org/abs/2504.08198)
Keywords: security, privacy, protect, attack, federate
Abstract: Federated Learning (FL) has been introduced as a way to keep data local to clients while training a shared machine learning model, as clients train on their local data and send trained models to a central aggregator. It is expected that FL will have a huge implication on Mobile Edge Computing, the Internet of Things, and Cross-Silo FL. In this paper, we focus on the widely used FedAvg algorithm to explore the effect of the number of clients in FL. We find a significant deterioration of learning accuracy for FedAvg as the number of clients increases. To address this issue for a general application, we propose a method called Knowledgeable Client Insertion (KCI) that introduces a very small number of knowledgeable clients to the MEC setting. These knowledgeable clients are expected to have accumulated a large set of data samples to help with training. With the help of KCI, the learning accuracy of FL increases much faster even with a normal FedAvg aggregation technique. We expect this approach to be able to provide great privacy protection for clients against security attacks such as model inversion attacks. Our code is available at this https URL.

Title: Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models

Authors: Yu Fu, Haz Sameen Shahgir, Hui Liu, Xianfeng Tang, Qi He, Yue Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08202
Pdf URL: https://arxiv.org/pdf/2504.08202
Copy Paste: [[2504.08202]] Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models(https://arxiv.org/abs/2504.08202)
Keywords: large language model
Abstract: Recent advances in long-context models (LCMs), designed to handle extremely long input contexts, primarily focus on utilizing external contextual information, often leaving the influence of large language models' intrinsic knowledge underexplored. In this work, we investigate how this intrinsic knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model's ability to utilize intrinsic knowledge, which we call intrinsic retrieval ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval can interfere with the model's ability to use its own knowledge effectively, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both retrieval abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior intrinsic retrieval ability. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance under LCM conditions, highlighting the importance of evaluating models from a dual-retrieval perspective.

Title: EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models

Authors: Minjae Seo, Myoungsung You, Junhee Lee, Jaehan Kim, Hwanjo Heo, Jintae Oh, Jinwoo Kim
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2504.08205
Pdf URL: https://arxiv.org/pdf/2504.08205
Copy Paste: [[2504.08205]] EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models(https://arxiv.org/abs/2504.08205)
Keywords: attack
Abstract: Vision models are increasingly deployed in critical applications such as autonomous driving and CCTV monitoring, yet they remain susceptible to resource-consuming attacks. In this paper, we introduce a novel energy-overloading attack that leverages vision language model (VLM) prompts to generate adversarial images targeting vision models. These images, though imperceptible to the human eye, significantly increase GPU energy consumption across various vision models, threatening the availability of these systems. Our framework, EO-VLM (Energy Overload via VLM), is model-agnostic, meaning it is not limited by the architecture or type of the target vision model. By exploiting the lack of safety filters in VLMs like DALL-E 3, we create adversarial noise images without requiring prior knowledge or internal structure of the target vision models. Our experiments demonstrate up to a 50% increase in energy consumption, revealing a critical vulnerability in current vision models.

Title: DrivAer Transformer: A high-precision and fast prediction method for vehicle aerodynamic drag coefficient based on the DrivAerNet++ dataset

Authors: Jiaqi He, Xiangwen Luo, Yiping Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08217
Pdf URL: https://arxiv.org/pdf/2504.08217
Copy Paste: [[2504.08217]] DrivAer Transformer: A high-precision and fast prediction method for vehicle aerodynamic drag coefficient based on the DrivAerNet++ dataset(https://arxiv.org/abs/2504.08217)
Keywords: transformer
Abstract: At the current stage, deep learning-based methods have demonstrated excellent capabilities in evaluating aerodynamic performance, significantly reducing the time and cost required for traditional computational fluid dynamics (CFD) simulations. However, when faced with the task of processing extremely complex three-dimensional (3D) vehicle models, the lack of large-scale datasets and training resources, coupled with the inherent diversity and complexity of the geometry of different vehicle models, means that the prediction accuracy and versatility of these networks are still not up to the level required for current production. In view of the remarkable success of Transformer models in the field of natural language processing and their strong potential in the field of image processing, this study innovatively proposes a point cloud learning framework called DrivAer Transformer (DAT). The DAT structure uses the DrivAerNet++ dataset, which contains high-fidelity CFD data of industrial-standard 3D vehicle shapes. enabling accurate estimation of air drag directly from 3D meshes, thus avoiding the limitations of traditional methods such as 2D image rendering or signed distance fields (SDF). DAT enables fast and accurate drag prediction, driving the evolution of the aerodynamic evaluation process and laying the critical foundation for introducing a data-driven approach to automotive design. The framework is expected to accelerate the vehicle design process and improve development efficiency.

Title: VL-UR: Vision-Language-guided Universal Restoration of Images Degraded by Adverse Weather Conditions

Authors: Ziyan Liu, Yuxu Lu, Huashan Yu, Dong yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08219
Pdf URL: https://arxiv.org/pdf/2504.08219
Copy Paste: [[2504.08219]] VL-UR: Vision-Language-guided Universal Restoration of Images Degraded by Adverse Weather Conditions(https://arxiv.org/abs/2504.08219)
Keywords: security, robust
Abstract: Image restoration is critical for improving the quality of degraded images, which is vital for applications like autonomous driving, security surveillance, and digital content enhancement. However, existing methods are often tailored to specific degradation scenarios, limiting their adaptability to the diverse and complex challenges in real-world environments. Moreover, real-world degradations are typically non-uniform, highlighting the need for adaptive and intelligent solutions. To address these issues, we propose a novel vision-language-guided universal restoration (VL-UR) framework. VL-UR leverages a zero-shot contrastive language-image pre-training (CLIP) model to enhance image restoration by integrating visual and semantic information. A scene classifier is introduced to adapt CLIP, generating high-quality language embeddings aligned with degraded images while predicting degraded types for complex scenarios. Extensive experiments across eleven diverse degradation settings demonstrate VL-UR's state-of-the-art performance, robustness, and adaptability. This positions VL-UR as a transformative solution for modern image restoration challenges in dynamic, real-world environments.

Title: DaemonSec: Examining the Role of Machine Learning for Daemon Security in Linux Environments

Authors: Sheikh Muhammad Farjad
Subjects: cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2504.08227
Pdf URL: https://arxiv.org/pdf/2504.08227
Copy Paste: [[2504.08227]] DaemonSec: Examining the Role of Machine Learning for Daemon Security in Linux Environments(https://arxiv.org/abs/2504.08227)
Keywords: security, protect, defense, attack
Abstract: DaemonSec is an early-stage startup exploring machine learning (ML)-based security for Linux daemons, a critical yet often overlooked attack surface. While daemon security remains underexplored, conventional defenses struggle against adaptive threats and zero-day exploits. To assess the perspectives of IT professionals on ML-driven daemon protection, a systematic interview study based on semi-structured interviews was conducted with 22 professionals from industry and academia. The study evaluates adoption, feasibility, and trust in ML-based security solutions. While participants recognized the potential of ML for real-time anomaly detection, findings reveal skepticism toward full automation, limited security awareness among non-security roles, and concerns about patching delays creating attack windows. This paper presents the methods, key findings, and implications for advancing ML-driven daemon security in industry.

Title: Out of Style: RAG's Fragility to Linguistic Variation

Authors: Tianyu Cao, Neel Bhandari, Akhila Yerukola, Akari Asai, Maarten Sap
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08231
Pdf URL: https://arxiv.org/pdf/2504.08231
Copy Paste: [[2504.08231]] Out of Style: RAG's Fragility to Linguistic Variation(https://arxiv.org/abs/2504.08231)
Keywords: robust
Abstract: Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely underexplored. This presents a critical gap for practical deployment, where user queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. In this work, we systematically analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance. We evaluate two retrieval models and nine LLMs, ranging from 3 to 72 billion parameters, across four information-seeking Question Answering (QA) datasets. Our results reveal that linguistic reformulations significantly impact both retrieval and generation stages, leading to a relative performance drop of up to 40.41% in Recall@5 scores for less formal queries and 38.86% in answer match scores for queries containing grammatical errors. Notably, RAG systems exhibit greater sensitivity to such variations compared to LLM-only generations, highlighting their vulnerability to error propagation due to linguistic shifts. These findings highlight the need for improved robustness techniques to enhance reliability in diverse user interactions.

Title: Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner

Authors: Liu Xiao, Li Zhiyuan, Lin Yueyu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.08247
Pdf URL: https://arxiv.org/pdf/2504.08247
Copy Paste: [[2504.08247]] Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner(https://arxiv.org/abs/2504.08247)
Keywords: transformer
Abstract: State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures, achieving linear complexity while demonstrating greater expressive power in short-context scenarios and enabling state tracking beyond the $\text{TC}^0$ complexity class. However, RWKV-7 lacks mechanisms for token-parameter interactions and native scalability, limiting its adaptability and growth without retraining. In this paper, we propose \textbf{Meta-State}, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach, integrating token-parameter interactions through a \textbf{Self-State Encoder} (SSE) mechanism. The SSE repurposes a portion of the RWKV-7 Weighted Key-Value (WKV) state as transformation weights to encode token-parameter interactions in a linear, state-driven manner without introducing new trainable matrices or softmax operations, while preserving the autoregressive property of token processing. Meta-State supports progressive model scaling by expanding the WKV state and parameter tokens, reusing existing parameters without retraining. Our approach bridges the gap between state-based modeling, token-parameter interactions, and scalable architectures, offering a flexible framework for efficient and adaptable sequence modeling with linear complexity and constant memory usage.

Title: Knowledge Distillation for Underwater Feature Extraction and Matching via GAN-synthesized Images

Authors: Jinghe Yang, Mingming Gong, Ye Pu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08253
Pdf URL: https://arxiv.org/pdf/2504.08253
Copy Paste: [[2504.08253]] Knowledge Distillation for Underwater Feature Extraction and Matching via GAN-synthesized Images(https://arxiv.org/abs/2504.08253)
Keywords: robust, extraction
Abstract: Autonomous Underwater Vehicles (AUVs) play a crucial role in underwater exploration. Vision-based methods offer cost-effective solutions for localization and mapping in the absence of conventional sensors like GPS and LIDAR. However, underwater environments present significant challenges for feature extraction and matching due to image blurring and noise caused by attenuation, scattering, and the interference of \textit{marine snow}. In this paper, we aim to improve the robustness of the feature extraction and matching in the turbid underwater environment using the cross-modal knowledge distillation method that transfers the in-air feature extraction models to underwater settings using synthetic underwater images as the medium. We first propose a novel adaptive GAN-synthesis method to estimate water parameters and underwater noise distribution, to generate environment-specific synthetic underwater images. We then introduce a general knowledge distillation framework compatible with different teacher models. The evaluation of GAN-based synthesis highlights the significance of the new components, i.e. GAN-synthesized noise and forward scattering, in the proposed model. Additionally, the downstream application of feature extraction and matching (VSLAM) on real underwater sequences validates the effectiveness of the transferred model.

Title: Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy

Authors: Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Sofiane Mahiou, Emiliano De Cristofaro
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08254
Pdf URL: https://arxiv.org/pdf/2504.08254
Copy Paste: [[2504.08254]] Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy(https://arxiv.org/abs/2504.08254)
Keywords: privacy, attack, extraction, membership infer, generative
Abstract: Privacy attacks, particularly membership inference attacks (MIAs), are widely used to assess the privacy of generative models for tabular synthetic data, including those with Differential Privacy (DP) guarantees. These attacks often exploit outliers, which are especially vulnerable due to their position at the boundaries of the data domain (e.g., at the minimum and maximum values). However, the role of data domain extraction in generative models and its impact on privacy attacks have been overlooked. In this paper, we examine three strategies for defining the data domain: assuming it is externally provided (ideally from public data), extracting it directly from the input data, and extracting it with DP mechanisms. While common in popular implementations and libraries, we show that the second approach breaks end-to-end DP guarantees and leaves models vulnerable. While using a provided domain (if representative) is preferable, extracting it with DP can also defend against popular MIAs, even at high privacy budgets.

Title: CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model

Authors: Ruohao Zhan, Yijin Li, Yisheng He, Shuo Chen, Yichen Shen, Xinyu Chen, Zilong Dong, Zhaoyang Huang, Guofeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08259
Pdf URL: https://arxiv.org/pdf/2504.08259
Copy Paste: [[2504.08259]] CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model(https://arxiv.org/abs/2504.08259)
Keywords: diffusion, generative
Abstract: Sketches serve as fundamental blueprints in artistic creation because sketch editing is easier and more intuitive than pixel-level RGB image editing for painting artists, yet sketch generation remains unexplored despite advancements in generative models. We propose a novel framework CoProSketch, providing prominent controllability and details for sketch generation with diffusion models. A straightforward method is fine-tuning a pretrained image generation diffusion model with binarized sketch images. However, we find that the diffusion models fail to generate clear binary images, which makes the produced sketches chaotic. We thus propose to represent the sketches by unsigned distance field (UDF), which is continuous and can be easily decoded to sketches through a lightweight network. With CoProSketch, users generate a rough sketch from a bounding box and a text prompt. The rough sketch can be manually edited and fed back into the model for iterative refinement and will be decoded to a detailed sketch as the final result. Additionally, we curate the first large-scale text-sketch paired dataset as the training data. Experiments demonstrate superior semantic consistency and controllability over baselines, offering a practical solution for integrating user feedback into generative workflows.

Title: Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

Authors: Yonchanok Khaokaew, Flora D. Salim, Andreas Züfle, Hao Xue, Taylor Anderson, Matthew Scotch, David J Heslop
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08260
Pdf URL: https://arxiv.org/pdf/2504.08260
Copy Paste: [[2504.08260]] Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare(https://arxiv.org/abs/2504.08260)
Keywords: privacy, generative, large language model
Abstract: Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.

Title: To See or Not to See -- Fingerprinting Devices in Adversarial Environments Amid Advanced Machine Learning

Authors: Justin Feng, Nader Sehatbakhsh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.08264
Pdf URL: https://arxiv.org/pdf/2504.08264
Copy Paste: [[2504.08264]] To See or Not to See -- Fingerprinting Devices in Adversarial Environments Amid Advanced Machine Learning(https://arxiv.org/abs/2504.08264)
Keywords: secure, security, attack, generative
Abstract: The increasing use of the Internet of Things raises security concerns. To address this, device fingerprinting is often employed to authenticate devices, detect adversaries, and identify eavesdroppers in an environment. This requires the ability to discern between legitimate and malicious devices which is achieved by analyzing the unique physical and/or operational characteristics of IoT devices. In the era of the latest progress in machine learning, particularly generative models, it is crucial to methodically examine the current studies in device fingerprinting. This involves explaining their approaches and underscoring their limitations when faced with adversaries armed with these ML tools. To systematically analyze existing methods, we propose a generic, yet simplified, model for device fingerprinting. Additionally, we thoroughly investigate existing methods to authenticate devices and detect eavesdropping, using our proposed model. We further study trends and similarities between works in authentication and eavesdropping detection and present the existing threats and attacks in these domains. Finally, we discuss future directions in fingerprinting based on these trends to develop more secure IoT fingerprinting schemes.

Title: VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

Authors: Qi Zhi Lim, Chin Poo Lee, Kian Ming Lim, Kalaiarasi Sonai Muthu Anbananthen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.08269
Pdf URL: https://arxiv.org/pdf/2504.08269
Copy Paste: [[2504.08269]] VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering(https://arxiv.org/abs/2504.08269)
Keywords: transformer
Abstract: The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often suffer from limited reasoning capabilities, reliance on modality conversion, and inadequate alignment between visual and textual representations. To address these limitations, this paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a transformer-based vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space, eliminating the need for intermediate projection layers. To enhance cross-modal alignment and reasoning, a three-stage pretraining strategy is proposed to progressively align vision-language representations and improve the model's capacity for multimodal understanding. Based on the pretrained backbone, two task-specific modules are instantiated to form a two-stage MMQA framework: a multimodal reranker that predicts document relevance scores and utilizes a relative threshold with top-k strategy for context retrieval, and a multimodal question answering model that generates contextually grounded answers based on the retrieved evidence. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach. On MultimodalQA validation set, VLMT-Large achieves 76.5% Exact Match and 80.1% F1, outperforming the previous state-of-the-art by +9.1% in Exact Match and +8.8% in F1. On WebQA, it attains a QA score of 47.6, surpassing prior models such as PERQA by +3.2. These results highlight VLMT's strong capabilities in multimodal reasoning and its potential to advance real-world information retrieval and question answering systems.

Title: Palmprint De-Identification Using Diffusion Model for High-Quality and Diverse Synthesis

Authors: Licheng Yan, Bob Zhang, Andrew Beng Jin Teoh, Lu Leng, Shuyi Li, Yuqi Wang, Ziyuan Yang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2504.08272
Pdf URL: https://arxiv.org/pdf/2504.08272
Copy Paste: [[2504.08272]] Palmprint De-Identification Using Diffusion Model for High-Quality and Diverse Synthesis(https://arxiv.org/abs/2504.08272)
Keywords: diffusion
Abstract: Palmprint recognition techniques have advanced significantly in recent years, enabling reliable recognition even when palmprints are captured in uncontrolled or challenging environments. However, this strength also introduces new risks, as publicly available palmprint images can be misused by adversaries for malicious activities. Despite this growing concern, research on methods to obscure or anonymize palmprints remains largely unexplored. Thus, it is essential to develop a palmprint de-identification technique capable of removing identity-revealing features while retaining the image's utility and preserving non-sensitive information. In this paper, we propose a training-free framework that utilizes pre-trained diffusion models to generate diverse, high-quality palmprint images that conceal identity features for de-identification purposes. To ensure greater stability and controllability in the synthesis process, we incorporate a semantic-guided embedding fusion alongside a prior interpolation mechanism. We further propose the de-identification ratio, a novel metric for intuitive de-identification assessment. Extensive experiments across multiple palmprint datasets and recognition methods demonstrate that our method effectively conceals identity-related traits with significant diversity across de-identified samples. The de-identified samples preserve high visual fidelity and maintain excellent usability, achieving a balance between de-identification and retaining non-identity information.

Title: PNE-SGAN: Probabilistic NDT-Enhanced Semantic Graph Attention Network for LiDAR Loop Closure Detection

Authors: Xiong Li, Shulei Liu, Xingning Chen, Yisong Wu, Dong Zhu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.08280
Pdf URL: https://arxiv.org/pdf/2504.08280
Copy Paste: [[2504.08280]] PNE-SGAN: Probabilistic NDT-Enhanced Semantic Graph Attention Network for LiDAR Loop Closure Detection(https://arxiv.org/abs/2504.08280)
Keywords: robust
Abstract: LiDAR loop closure detection (LCD) is crucial for consistent Simultaneous Localization and Mapping (SLAM) but faces challenges in robustness and accuracy. Existing methods, including semantic graph approaches, often suffer from coarse geometric representations and lack temporal robustness against noise, dynamics, and viewpoint changes. We introduce PNE-SGAN, a Probabilistic NDT-Enhanced Semantic Graph Attention Network, to overcome these limitations. PNE-SGAN enhances semantic graphs by using Normal Distributions Transform (NDT) covariance matrices as rich, discriminative geometric node features, processed via a Graph Attention Network (GAT). Crucially, it integrates graph similarity scores into a probabilistic temporal filtering framework (modeled as an HMM/Bayes filter), incorporating uncertain odometry for motion modeling and utilizing forward-backward smoothing to effectively handle ambiguities. Evaluations on challenging KITTI sequences (00 and 08) demonstrate state-of-the-art performance, achieving Average Precision of 96.2\% and 95.1\%, respectively. PNE-SGAN significantly outperforms existing methods, particularly in difficult bidirectional loop scenarios where others falter. By synergizing detailed NDT geometry with principled probabilistic temporal reasoning, PNE-SGAN offers a highly accurate and robust solution for LiDAR LCD, enhancing SLAM reliability in complex, large-scale environments.

Title: ELSA: A Style Aligned Dataset for Emotionally Intelligent Language Generation

Authors: Vishal Gandhi, Sagar Gandhi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08281
Pdf URL: https://arxiv.org/pdf/2504.08281
Copy Paste: [[2504.08281]] ELSA: A Style Aligned Dataset for Emotionally Intelligent Language Generation(https://arxiv.org/abs/2504.08281)
Keywords: interpretability, large language model
Abstract: Advancements in emotion aware language processing increasingly shape vital NLP applications ranging from conversational AI and affective computing to computational psychology and creative content generation. Existing emotion datasets either lack emotional granularity or fail to capture necessary stylistic diversity, limiting the advancement of effective emotion conditioned text generation systems. Seeking to bridge this crucial gap between granularity and style diversity, this paper introduces a novel systematically constructed dataset named ELSA Emotion and Language Style Alignment Dataset leveraging fine grained emotion taxonomies adapted from existing sources such as dair ai emotion dataset and GoEmotions taxonomy. This dataset comprises multiple emotionally nuanced variations of original sentences regenerated across distinct contextual styles such as conversational, formal, poetic, and narrative, using advanced Large Language Models LLMs. Rigorous computational evaluation using metrics such as perplexity, embedding variance, readability, lexical diversity, and semantic coherence measures validates the datasets emotional authenticity, linguistic fluency, and textual diversity. Comprehensive metric analyses affirm its potential to support deeper explorations into emotion conditioned style adaptive text generation. By enabling precision tuned emotionally nuanced language modeling, our dataset creates fertile ground for research on fine grained emotional control, prompt driven explanation, interpretability, and style adaptive expressive language generation with LLMs.

Title: DreamFuse: Adaptive Image Fusion with Diffusion Transformer

Authors: Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08291
Pdf URL: https://arxiv.org/pdf/2504.08291
Copy Paste: [[2504.08291]] DreamFuse: Adaptive Image Fusion with Diffusion Transformer(https://arxiv.org/abs/2504.08291)
Keywords: diffusion, transformer
Abstract: Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results. Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.

Title: Generative AI for Film Creation: A Survey of Recent Advances

Authors: Ruihan Zhang, Borou Yu, Jiajian Min, Yetong Xin, Zheng Wei, Juncheng Nemo Shi, Mingzhen Huang, Xianghao Kong, Nix Liu Xin, Shanshan Jiang, Praagya Bahuguna, Mark Chan, Khushi Hora, Lijian Yang, Yongqi Liang, Runhe Bian, Yunlei Liu, Isabela Campillo Valencia, Patricia Morales Tredinick, Ilia Kozlov, Sijia Jiang, Peiwen Huang, Na Chen, Xuanxuan Liu, Anyi Rao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08296
Pdf URL: https://arxiv.org/pdf/2504.08296
Copy Paste: [[2504.08296]] Generative AI for Film Creation: A Survey of Recent Advances(https://arxiv.org/abs/2504.08296)
Keywords: diffusion, generative
Abstract: Generative AI (GenAI) is transforming filmmaking, equipping artists with tools like text-to-image and image-to-video diffusion, neural radiance fields, avatar generation, and 3D synthesis. This paper examines the adoption of these technologies in filmmaking, analyzing workflows from recent AI-driven films to understand how GenAI contributes to character creation, aesthetic styling, and narration. We explore key strategies for maintaining character consistency, achieving stylistic coherence, and ensuring motion continuity. Additionally, we highlight emerging trends such as the growing use of 3D generation and the integration of real footage with AI-generated elements. Beyond technical advancements, we examine how GenAI is enabling new artistic expressions, from generating hard-to-shoot footage to dreamlike diffusion-based morphing effects, abstract visuals, and unworldly objects. We also gather artists' feedback on challenges and desired improvements, including consistency, controllability, fine-grained editing, and motion refinement. Our study provides insights into the evolving intersection of AI and filmmaking, offering a roadmap for researchers and artists navigating this rapidly expanding field.

Title: Large language models could be rote learners

Authors: Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08300
Pdf URL: https://arxiv.org/pdf/2504.08300
Copy Paste: [[2504.08300]] Large language models could be rote learners(https://arxiv.org/abs/2504.08300)
Keywords: large language model
Abstract: Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).

Title: STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

Authors: Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08306
Pdf URL: https://arxiv.org/pdf/2504.08306
Copy Paste: [[2504.08306]] STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge(https://arxiv.org/abs/2504.08306)
Keywords: segmentation
Abstract: Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the "imaplus" this http URL finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long-video sequences. In the inference phase, an Adaptive Pseudo-labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo-labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.

Title: DSM: Building A Diverse Semantic Map for 3D Visual Grounding

Authors: Qinghongbing Xie, Zijian Liang, Long Zeng
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.08307
Pdf URL: https://arxiv.org/pdf/2504.08307
Copy Paste: [[2504.08307]] DSM: Building A Diverse Semantic Map for 3D Visual Grounding(https://arxiv.org/abs/2504.08307)
Keywords: extraction, large language model, segmentation
Abstract: In recent years, with the growing research and application of multimodal large language models (VLMs) in robotics, there has been an increasing trend of utilizing VLMs for robotic scene understanding tasks. Existing approaches that use VLMs for 3D Visual Grounding tasks often focus on obtaining scene information through geometric and visual information, overlooking the extraction of diverse semantic information from the scene and the understanding of rich implicit semantic attributes, such as appearance, physics, and affordance. The 3D scene graph, which combines geometry and language, is an ideal representation method for environmental perception and is an effective carrier for language models in 3D Visual Grounding tasks. To address these issues, we propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks. This method leverages VLMs to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy. We enhance the understanding of grounding information based on DSM and introduce a novel approach named DSM-Grounding. Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art. In addition, we have deployed this method on robots to validate its effectiveness in navigation and grasping tasks.

Title: SortBench: Benchmarking LLMs based on their ability to sort lists

Authors: Steffen Herbold
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08312
Pdf URL: https://arxiv.org/pdf/2504.08312
Copy Paste: [[2504.08312]] SortBench: Benchmarking LLMs based on their ability to sort lists(https://arxiv.org/abs/2504.08312)
Keywords: fair, large language model
Abstract: Sorting is a tedious but simple task for human intelligence and can be solved fairly easily algorithmically. However, for Large Language Models (LLMs) this task is surprisingly hard, as some properties of sorting are among known weaknesses of LLMs: being faithful to the input data, logical comparisons between values, and strictly differentiating between syntax (used for sorting) and semantics (typically learned by embeddings). Within this paper, we describe the new SortBench benchmark for LLMs that comes with different difficulties and that can be easily scaled in terms of difficulty. We apply this benchmark to seven state-of-the-art LLMs, including current test-time reasoning models. Our results show that while the o3-mini model is very capable at sorting in general, even this can be fooled if strings are defined to mix syntactical and semantical aspects, e.g., by asking to sort numbers written-out as word. Furthermore, all models have problems with the faithfulness to the input of long lists, i.e., they drop items and add new ones. Our results also show that test-time reasoning has a tendency to overthink problems which leads to performance degradation. Finally, models without test-time reasoning like GPT-4o are not much worse than reasoning models.

Title: Practical Secure Aggregation by Combining Cryptography and Trusted Execution Environments

Authors: Romain de Laage, Peterson Yuhala, François-Xavier Wicht, Pascal Felber, Christian Cachin, Valerio Schiavoni
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.08325
Pdf URL: https://arxiv.org/pdf/2504.08325
Copy Paste: [[2504.08325]] Practical Secure Aggregation by Combining Cryptography and Trusted Execution Environments(https://arxiv.org/abs/2504.08325)
Keywords: secure, security, privacy
Abstract: Secure aggregation enables a group of mutually distrustful parties, each holding private inputs, to collaboratively compute an aggregate value while preserving the privacy of their individual inputs. However, a major challenge in adopting secure aggregation approaches for practical applications is the significant computational overhead of the underlying cryptographic protocols, e.g. fully homomorphic encryption. This overhead makes secure aggregation protocols impractical, especially for large datasets. In contrast, hardware-based security techniques such as trusted execution environments (TEEs) enable computation at near-native speeds, making them a promising alternative for reducing the computational burden typically associated with purely cryptographic techniques. Yet, in many scenarios, parties may opt for either cryptographic or hardware-based security mechanisms, highlighting the need for hybrid approaches. In this work, we introduce several secure aggregation architectures that integrate both cryptographic and TEE-based techniques, analyzing the trade-offs between security and performance.

Title: EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

Authors: Renda Li, Xiaohua Qi, Qiang Ling, Jun Yu, Ziyi Chen, Peng Chang, Mei HanJing Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08344
Pdf URL: https://arxiv.org/pdf/2504.08344
Copy Paste: [[2504.08344]] EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model(https://arxiv.org/abs/2504.08344)
Keywords: diffusion
Abstract: Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal this http URL entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.

Title: Geometric Consistency Refinement for Single Image Novel View Synthesis via Test-Time Adaptation of Diffusion Models

Authors: Josef Bengtson, David Nilsson, Fredrik Kahl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08348
Pdf URL: https://arxiv.org/pdf/2504.08348
Copy Paste: [[2504.08348]] Geometric Consistency Refinement for Single Image Novel View Synthesis via Test-Time Adaptation of Diffusion Models(https://arxiv.org/abs/2504.08348)
Keywords: diffusion
Abstract: Diffusion models for single image novel view synthesis (NVS) can generate highly realistic and plausible images, but they are limited in the geometric consistency to the given relative poses. The generated images often show significant errors with respect to the epipolar constraints that should be fulfilled, as given by the target pose. In this paper we address this issue by proposing a methodology to improve the geometric correctness of images generated by a diffusion model for single image NVS. We formulate a loss function based on image matching and epipolar constraints, and optimize the starting noise in a diffusion sampling process such that the generated image should both be a realistic image and fulfill geometric constraints derived from the given target pose. Our method does not require training data or fine-tuning of the diffusion models, and we show that we can apply it to multiple state-of-the-art models for single image NVS. The method is evaluated on the MegaScenes dataset and we show that geometric consistency is improved compared to the baseline models while retaining the quality of the generated images.

Title: An Adaptive Clustering Scheme for Client Selections in Communication-Efficient Federated Learning

Authors: Yan-Ann Chen, Guan-Lin Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08356
Pdf URL: https://arxiv.org/pdf/2504.08356
Copy Paste: [[2504.08356]] An Adaptive Clustering Scheme for Client Selections in Communication-Efficient Federated Learning(https://arxiv.org/abs/2504.08356)
Keywords: federate
Abstract: Federated learning is a novel decentralized learning architecture. During the training process, the client and server must continuously upload and receive model parameters, which consumes a lot of network transmission resources. Some methods use clustering to find more representative customers, select only a part of them for training, and at the same time ensure the accuracy of training. However, in federated learning, it is not trivial to know what the number of clusters can bring the best training result. Therefore, we propose to dynamically adjust the number of clusters to find the most ideal grouping results. It may reduce the number of users participating in the training to achieve the effect of reducing communication costs without affecting the model performance. We verify its experimental results on the non-IID handwritten digit recognition dataset and reduce the cost of communication and transmission by almost 50% compared with traditional federated learning without affecting the accuracy of the model.

Title: SN-LiDAR: Semantic Neural Fields for Novel Space-time View LiDAR Synthesis

Authors: Yi Chen, Tianchen Deng, Wentao Zhao, Xiaoning Wang, Wenqian Xi, Weidong Chen, Jingchuan Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.08361
Pdf URL: https://arxiv.org/pdf/2504.08361
Copy Paste: [[2504.08361]] SN-LiDAR: Semantic Neural Fields for Novel Space-time View LiDAR Synthesis(https://arxiv.org/abs/2504.08361)
Keywords: segmentation
Abstract: Recent research has begun exploring novel view synthesis (NVS) for LiDAR point clouds, aiming to generate realistic LiDAR scans from unseen viewpoints. However, most existing approaches do not reconstruct semantic labels, which are crucial for many downstream applications such as autonomous driving and robotic perception. Unlike images, which benefit from powerful segmentation models, LiDAR point clouds lack such large-scale pre-trained models, making semantic annotation time-consuming and labor-intensive. To address this challenge, we propose SN-LiDAR, a method that jointly performs accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis. Specifically, we employ a coarse-to-fine planar-grid feature representation to extract global features from multi-frame point clouds and leverage a CNN-based encoder to extract local semantic features from the current frame point cloud. Extensive experiments on SemanticKITTI and KITTI-360 demonstrate the superiority of SN-LiDAR in both semantic and geometric reconstruction, effectively handling dynamic objects and large-scale scenes. Codes will be available on this https URL.

Title: Proofs as Explanations: Short Certificates for Reliable Predictions

Authors: Avrim Blum, Steve Hanneke, Chirag Pabbaraju, Donya Saless
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08377
Pdf URL: https://arxiv.org/pdf/2504.08377
Copy Paste: [[2504.08377]] Proofs as Explanations: Short Certificates for Reliable Predictions(https://arxiv.org/abs/2504.08377)
Keywords: robust
Abstract: We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S'$ of the training data (if it exists) such that all classifiers $h' \in H$ that make at most $b$ mistakes on $S'$ predict $h'(x)=y$. Such a set $S'$ serves as a proof that $x$ indeed has label $y$ under the assumption that (1) the target function $h^\star$ belongs to $H$, and (2) the set $S$ contains at most $b$ corrupted points. For example, if $b=0$ and $H$ is the family of linear classifiers in $\mathbb{R}^d$, and if $x$ lies inside the convex hull of the positive data points in $S$ (and hence every consistent linear classifier labels $x$ as positive), then Carathéodory's theorem states that $x$ lies inside the convex hull of $d+1$ of those points. So, a set $S'$ of size $d+1$ could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of realizability. In this work, we consider this problem more generally, for general hypothesis classes $H$ and general values $b\geq 0$. We define the notion of the robust hollow star number of $H$ (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as distribution-dependent bounds that we show tightly control the sample size needed to get a certificate for any given test example. In particular, we define a notion of the certificate coefficient $\varepsilon_x$ of an example $x$ with respect to a data distribution $D$ and target function $h^\star$, and prove matching upper and lower bounds on sample size as a function of $\varepsilon_x$, $b$, and the VC dimension $d$ of $H$.

Title: Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Authors: Fucheng Jia, Zewen Wu, Shiqi Jiang, Huiqiang Jiang, Qianxi Zhang, Yuqing Yang, Yunxin Liu, Ju Ren, Deyu Zhang, Ting Cao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08378
Pdf URL: https://arxiv.org/pdf/2504.08378
Copy Paste: [[2504.08378]] Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash(https://arxiv.org/abs/2504.08378)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight DRAM-flash swapping and incorporates three novel techniques: (1) Cross-layer active weights preloading. It uses the activations from the current layer to predict the active weights of several subsequent layers, enabling computation and data loading to overlap, as well as facilitating large I/O transfers. (2) Sparsity-aware self-distillation. It adjusts the active weights to align with the dense-model output distribution, compensating for approximations introduced by contextual sparsity. (3) Active weight DRAM-flash swapping pipeline. It orchestrates the DRAM space allocation among the hot weight cache, preloaded active weights, and computation-involved weights based on available memory. Results show ActiveFlow achieves the performance-cost Pareto frontier compared to existing efficiency optimization methods.

Title: Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking

Authors: Huu-Loc Tran, Tinh-Anh Nguyen-Nhu, Huu-Phong Phan-Nguyen, Tien-Huy Nguyen, Nhat-Minh Nguyen-Dich, Anh Dao, Huy-Duc Do, Quan Nguyen, Hoang M. Le, Quang-Vinh Dinh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08384
Pdf URL: https://arxiv.org/pdf/2504.08384
Copy Paste: [[2504.08384]] Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking(https://arxiv.org/abs/2504.08384)
Keywords: robust, interpretability
Abstract: Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.

Title: PCA-RAG: Principal Component Analysis for Efficient Retrieval-Augmented Generation

Authors: Arman Khaledian, Amirreza Ghadiridehkordi, Nariman Khaledian
Subjects: cs.LG, cs.AI, cs.IR, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08386
Pdf URL: https://arxiv.org/pdf/2504.08386
Copy Paste: [[2504.08386]] PCA-RAG: Principal Component Analysis for Efficient Retrieval-Augmented Generation(https://arxiv.org/abs/2504.08386)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for grounding large language models in external knowledge sources, improving the precision of agents responses. However, high-dimensional language model embeddings, often in the range of hundreds to thousands of dimensions, can present scalability challenges in terms of storage and latency, especially when processing massive financial text corpora. This paper investigates the use of Principal Component Analysis (PCA) to reduce embedding dimensionality, thereby mitigating computational bottlenecks without incurring large accuracy losses. We experiment with a real-world dataset and compare different similarity and distance metrics under both full-dimensional and PCA-compressed embeddings. Our results show that reducing vectors from 3,072 to 110 dimensions provides a sizeable (up to $60\times$) speedup in retrieval operations and a $\sim 28.6\times$ reduction in index size, with only moderate declines in correlation metrics relative to human-annotated similarity scores. These findings demonstrate that PCA-based compression offers a viable balance between retrieval fidelity and resource efficiency, essential for real-time systems such as Zanista AI's \textit{Newswitch} platform. Ultimately, our study underscores the practicality of leveraging classical dimensionality reduction techniques to scale RAG architectures for knowledge-intensive applications in finance and trading, where speed, memory efficiency, and accuracy must jointly be optimized.

Title: MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Authors: Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08388
Pdf URL: https://arxiv.org/pdf/2504.08388
Copy Paste: [[2504.08388]] MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft(https://arxiv.org/abs/2504.08388)
Keywords: diffusion, transformer
Abstract: World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate $4$ to $7$ frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.

Title: Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

Authors: Yin Jou Huang, Rafik Hadfi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08399
Pdf URL: https://arxiv.org/pdf/2504.08399
Copy Paste: [[2504.08399]] Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models(https://arxiv.org/abs/2504.08399)
Keywords: robust, large language model
Abstract: There is a growing interest in assessing the personality traits of Large language models (LLMs). However, traditional personality assessments based on self-report questionnaires may fail to capture their true behavioral nuances due to inherent biases and meta-knowledge contamination. This paper introduces a novel multi-observer framework for LLM personality assessment that draws inspiration from informant-report methods in psychology. Instead of relying solely on self-assessments, our approach employs multiple observer agents configured with a specific relationship context (e.g., family, friend, or workplace) to simulate interactive scenarios with a subject LLM. These observers engage in dialogues and subsequently provide ratings across the Big Five personality dimensions. Our experiments reveal that LLMs possess systematic biases in self-report personality ratings. Moreover, aggregating observer ratings effectively reduces non-systematic biases and achieves optimal reliability with 5-7 observers. The findings highlight the significant impact of relationship context on personality perception and demonstrate that a multi-observer paradigm yields a more robust and context-sensitive evaluation of LLM personality traits.

Title: A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation

Authors: Dawei Zhou, Suzhi Gang, Decheng Liu, Tongliang Liu, Nannan Wang, Xinbo Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08411
Pdf URL: https://arxiv.org/pdf/2504.08411
Copy Paste: [[2504.08411]] A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation(https://arxiv.org/abs/2504.08411)
Keywords: security, protect, defense
Abstract: Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields. To alleviate these issues, adversarial noise-based defenses have been enthusiastically studied in recent years. However, ``data-only" methods tend to distort fake samples in the low-level feature space rather than the high-level semantic space, leading to limitations in resisting malicious manipulation. Frontier research has shown that integrating knowledge in deep learning can produce reliable and generalizable solutions. Inspired by these, we propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples. Specifically, in the process of generating adversarial noise, we focus on constructing significant semantic confusions at the domain-specific knowledge level, and exploit a metric closely related to visual perception to replace the general pixel-wise metrics. The generated adversarial noise can actively interfere with the malicious manipulation model by triggering knowledge-guided and perception-related disruptions in the fake samples. To validate the effectiveness of the proposed method, we conduct qualitative and quantitative experiments on human perception and visual quality assessment. The results on two different tasks both show that our defense provides better protection compared to state-of-the-art methods and achieves great generalizability.

Title: Adversarial Examples in Environment Perception for Automated Driving (Review)

Authors: Jun Yan, Huilin Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08414
Pdf URL: https://arxiv.org/pdf/2504.08414
Copy Paste: [[2504.08414]] Adversarial Examples in Environment Perception for Automated Driving (Review)(https://arxiv.org/abs/2504.08414)
Keywords: defense, attack, robust
Abstract: The renaissance of deep learning has led to the massive development of automated driving. However, deep neural networks are vulnerable to adversarial examples. The perturbations of adversarial examples are imperceptible to human eyes but can lead to the false predictions of neural networks. It poses a huge risk to artificial intelligence (AI) applications for automated driving. This survey systematically reviews the development of adversarial robustness research over the past decade, including the attack and defense methods and their applications in automated driving. The growth of automated driving pushes forward the realization of trustworthy AI applications. This review lists significant references in the research history of adversarial examples.

Title: seeBias: A Comprehensive Tool for Assessing and Visualizing AI Fairness

Authors: Yilin Ning, Yian Ma, Mingxuan Liu, Xin Li, Nan Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08418
Pdf URL: https://arxiv.org/pdf/2504.08418
Copy Paste: [[2504.08418]] seeBias: A Comprehensive Tool for Assessing and Visualizing AI Fairness(https://arxiv.org/abs/2504.08418)
Keywords: fair
Abstract: Fairness in artificial intelligence (AI) prediction models is increasingly emphasized to support responsible adoption in high-stakes domains such as health care and criminal justice. Guidelines and implementation frameworks highlight the importance of both predictive accuracy and equitable outcomes. However, current fairness toolkits often evaluate classification performance disparities in isolation, with limited attention to other critical aspects such as calibration. To address these gaps, we present seeBias, an R package for comprehensive evaluation of model fairness and predictive performance. seeBias offers an integrated evaluation across classification, calibration, and other performance domains, providing a more complete view of model behavior. It includes customizable visualizations to support transparent reporting and responsible AI implementation. Using public datasets from criminal justice and healthcare, we demonstrate how seeBias supports fairness evaluations, and uncovers disparities that conventional fairness metrics may overlook. The R package is available on GitHub, and a Python version is under development.

Title: GeoTexBuild: 3D Building Model Generation from Map Footprints

Authors: Ruizhe Wang, Junyan Yang, Qiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08419
Pdf URL: https://arxiv.org/pdf/2504.08419
Copy Paste: [[2504.08419]] GeoTexBuild: 3D Building Model Generation from Map Footprints(https://arxiv.org/abs/2504.08419)
Keywords: generative
Abstract: We introduce GeoTexBuild, a modular generative framework for creating 3D building models from map footprints. The proposed framework employs a three-stage process comprising height map generation, geometry reconstruction, and appearance stylization, culminating in building models with intricate geometry and appearance attributes. By integrating customized ControlNet and Text2Mesh models, we explore effective methods for controlling both geometric and visual attributes during the generation process. By this, we eliminate the problem of structural variations behind a single facade photo of the existing 3D generation techniques. Experimental results at each stage validate the capability of GeoTexBuild to generate detailed and accurate building models from footprints derived from site planning or map designs. Our framework significantly reduces manual labor in modeling buildings and can offer inspiration for designers.

Title: Customizing Spider Silk: Generative Models with Mechanical Property Conditioning for Protein Engineering

Authors: Neeru Dubey, Elin Karlsson, Miguel Angel Redondo, Johan Reimegård, Anna Rising, Hedvig Kjellström
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08437
Pdf URL: https://arxiv.org/pdf/2504.08437
Copy Paste: [[2504.08437]] Customizing Spider Silk: Generative Models with Mechanical Property Conditioning for Protein Engineering(https://arxiv.org/abs/2504.08437)
Keywords: generative
Abstract: The remarkable mechanical properties of spider silk, including its tensile strength and extensibility, are primarily governed by the repetitive regions of the proteins that constitute the fiber, the major ampullate spidroins (MaSps). However, establishing correlations between mechanical characteristics and repeat sequences is challenging due to the intricate sequence-structure-function relationships of MaSps and the limited availability of annotated datasets. In this study, we present a novel computational framework for designing MaSp repeat sequences with customizable mechanical properties. To achieve this, we developed a lightweight GPT-based generative model by distilling the pre-trained ProtGPT2 protein language model. The distilled model was subjected to multilevel fine-tuning using curated subsets of the Spider Silkome dataset. Specifically, we adapt the model for MaSp repeat generation using 6,000 MaSp repeat sequences and further refine it with 572 repeats associated with experimentally determined fiber-level mechanical properties. Our model generates biologically plausible MaSp repeat regions tailored to specific mechanical properties while also predicting those properties for given sequences. Validation includes sequence-level analysis, assessing physicochemical attributes and expected distribution of key motifs as well as secondary structure compositions. A correlation study using BLAST on the Spider Silkome dataset and a test set of MaSp repeats with known mechanical properties further confirmed the predictive accuracy of the model. This framework advances the rational design of spider silk-inspired biomaterials, offering a versatile tool for engineering protein sequences with tailored mechanical attributes.

Title: SARFormer -- An Acquisition Parameter Aware Vision Transformer for Synthetic Aperture Radar Data

Authors: Jonathan Prexl, Michael Recla, Michael Schmitt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08441
Pdf URL: https://arxiv.org/pdf/2504.08441
Copy Paste: [[2504.08441]] SARFormer -- An Acquisition Parameter Aware Vision Transformer for Synthetic Aperture Radar Data(https://arxiv.org/abs/2504.08441)
Keywords: transformer, segmentation
Abstract: This manuscript introduces SARFormer, a modified Vision Transformer (ViT) architecture designed for processing one or multiple synthetic aperture radar (SAR) images. Given the complex image geometry of SAR data, we propose an acquisition parameter encoding module that significantly guides the learning process, especially in the case of multiple images, leading to improved performance on downstream tasks. We further explore self-supervised pre-training, conduct experiments with limited labeled data, and benchmark our contribution and adaptations thoroughly in ablation experiments against a baseline, where the model is tested on tasks such as height reconstruction and segmentation. Our approach achieves up to 17% improvement in terms of RMSE over baseline models

Title: Muon-Accelerated Attention Distillation for Real-Time Edge Synthesis via Optimized Latent Diffusion

Authors: Weiye Chen, Qingen Zhu, Qian Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08451
Pdf URL: https://arxiv.org/pdf/2504.08451
Copy Paste: [[2504.08451]] Muon-Accelerated Attention Distillation for Real-Time Edge Synthesis via Optimized Latent Diffusion(https://arxiv.org/abs/2504.08451)
Keywords: diffusion
Abstract: Recent advances in visual synthesis have leveraged diffusion models and attention mechanisms to achieve high-fidelity artistic style transfer and photorealistic text-to-image generation. However, real-time deployment on edge devices remains challenging due to computational and memory constraints. We propose Muon-AD, a co-designed framework that integrates the Muon optimizer with attention distillation for real-time edge synthesis. By eliminating gradient conflicts through orthogonal parameter updates and dynamic pruning, Muon-AD achieves 3.2 times faster convergence compared to Stable Diffusion-TensorRT, while maintaining synthesis quality (15% lower FID, 4% higher SSIM). Our framework reduces peak memory to 7GB on Jetson Orin and enables 24FPS real-time generation through mixed-precision quantization and curriculum learning. Extensive experiments on COCO-Stuff and ImageNet-Texture demonstrate Muon-AD's Pareto-optimal efficiency-quality trade-offs. Here, we show a 65% reduction in communication overhead during distributed training and real-time 10s/image generation on edge GPUs. These advancements pave the way for democratizing high-quality visual synthesis in resource-constrained environments.

Title: Road Grip Uncertainty Estimation Through Surface State Segmentation

Authors: Jyri Maanpää, Julius Pesonen, Iaroslav Melekhov, Heikki Hyyti, Juha Hyyppä
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08452
Pdf URL: https://arxiv.org/pdf/2504.08452
Copy Paste: [[2504.08452]] Road Grip Uncertainty Estimation Through Surface State Segmentation(https://arxiv.org/abs/2504.08452)
Keywords: robust, segmentation
Abstract: Slippery road conditions pose significant challenges for autonomous driving. Beyond predicting road grip, it is crucial to estimate its uncertainty reliably to ensure safe vehicle control. In this work, we benchmark several uncertainty prediction methods to assess their effectiveness for grip uncertainty estimation. Additionally, we propose a novel approach that leverages road surface state segmentation to predict grip uncertainty. Our method estimates a pixel-wise grip probability distribution based on inferred road surface conditions. Experimental results indicate that the proposed approach enhances the robustness of grip uncertainty prediction.

Title: Cut-and-Splat: Leveraging Gaussian Splatting for Synthetic Data Generation

Authors: Bram Vanherle, Brent Zoomers, Jeroen Put, Frank Van Reeth, Nick Michiels
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08473
Pdf URL: https://arxiv.org/pdf/2504.08473
Copy Paste: [[2504.08473]] Cut-and-Splat: Leveraging Gaussian Splatting for Synthetic Data Generation(https://arxiv.org/abs/2504.08473)
Keywords: diffusion, segmentation
Abstract: Generating synthetic images is a useful method for cheaply obtaining labeled data for training computer vision models. However, obtaining accurate 3D models of relevant objects is necessary, and the resulting images often have a gap in realism due to challenges in simulating lighting effects and camera artifacts. We propose using the novel view synthesis method called Gaussian Splatting to address these challenges. We have developed a synthetic data pipeline for generating high-quality context-aware instance segmentation training data for specific objects. This process is fully automated, requiring only a video of the target object. We train a Gaussian Splatting model of the target object and automatically extract the object from the video. Leveraging Gaussian Splatting, we then render the object on a random background image, and monocular depth estimation is employed to place the object in a believable pose. We introduce a novel dataset to validate our approach and show superior performance over other data generation approaches, such as Cut-and-Paste and Diffusion model-based generation.

Title: Toward Realistic Adversarial Attacks in IDS: A Novel Feasibility Metric for Transferability

Authors: Sabrine Ennaji, Elhadj Benkhelifa, Luigi Vincenzo Mancini
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.08480
Pdf URL: https://arxiv.org/pdf/2504.08480
Copy Paste: [[2504.08480]] Toward Realistic Adversarial Attacks in IDS: A Novel Feasibility Metric for Transferability(https://arxiv.org/abs/2504.08480)
Keywords: security, defense, attack, robust
Abstract: Transferability-based adversarial attacks exploit the ability of adversarial examples, crafted to deceive a specific source Intrusion Detection System (IDS) model, to also mislead a target IDS model without requiring access to the training data or any internal model parameters. These attacks exploit common vulnerabilities in machine learning models to bypass security measures and compromise systems. Although the transferability concept has been widely studied, its practical feasibility remains limited due to assumptions of high similarity between source and target models. This paper analyzes the core factors that contribute to transferability, including feature alignment, model architectural similarity, and overlap in the data distributions that each IDS examines. We propose a novel metric, the Transferability Feasibility Score (TFS), to assess the feasibility and reliability of such attacks based on these factors. Through experimental evidence, we demonstrate that TFS and actual attack success rates are highly correlated, addressing the gap between theoretical understanding and real-world impact. Our findings provide needed guidance for designing more realistic transferable adversarial attacks, developing robust defenses, and ultimately improving the security of machine learning-based IDS in critical systems.

Title: A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Medical Image Classification

Authors: Kerol Djoumessi, Samuel Ofosu Mensah, Philipp Berens
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08481
Pdf URL: https://arxiv.org/pdf/2504.08481
Copy Paste: [[2504.08481]] A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Medical Image Classification(https://arxiv.org/abs/2504.08481)
Keywords: transformer
Abstract: In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid CNN-ViT models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for medical image classification. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the model's decision process. We evaluated our method on two medical image classification tasks using color fundus images. Our model not only achieves state-of-the-art predictive performance compared to both black-box and interpretable models but also provides class-specific sparse evidence maps in a single forward pass. The code is available at: this https URL.

Title: An Early Experience with Confidential Computing Architecture for On-Device Model Protection

Authors: Sina Abdollahi, Mohammad Maheri, Sandra Siby, Marios Kogias, Hamed Haddadi
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2504.08508
Pdf URL: https://arxiv.org/pdf/2504.08508
Copy Paste: [[2504.08508]] An Early Experience with Confidential Computing Architecture for On-Device Model Protection(https://arxiv.org/abs/2504.08508)
Keywords: secure, privacy, protect, attack, membership infer
Abstract: Deploying machine learning (ML) models on user devices can improve privacy (by keeping data local) and reduce inference latency. Trusted Execution Environments (TEEs) are a practical solution for protecting proprietary models, yet existing TEE solutions have architectural constraints that hinder on-device model deployment. Arm Confidential Computing Architecture (CCA), a new Arm extension, addresses several of these limitations and shows promise as a secure platform for on-device ML. In this paper, we evaluate the performance-privacy trade-offs of deploying models within CCA, highlighting its potential to enable confidential and efficient ML applications. Our evaluations show that CCA can achieve an overhead of, at most, 22% in running models of different sizes and applications, including image classification, voice recognition, and chat assistants. This performance overhead comes with privacy benefits; for example, our framework can successfully protect the model against membership inference attack by an 8.3% reduction in the adversary's success rate. To support further research and early adoption, we make our code and methodology publicly available.

Title: Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Authors: Tommaso Galliena, Tommaso Apicella, Stefano Rosa, Pietro Morerio, Alessio Del Bue, Lorenzo Natale
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.08531
Pdf URL: https://arxiv.org/pdf/2504.08531
Copy Paste: [[2504.08531]] Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions(https://arxiv.org/abs/2504.08531)
Keywords: large language model
Abstract: We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at this https URL

Title: Explainability and Continual Learning meet Federated Learning at the Network Edge

Authors: Thomas Tsouparopoulos, Iordanis Koutsopoulos
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08536
Pdf URL: https://arxiv.org/pdf/2504.08536
Copy Paste: [[2504.08536]] Explainability and Continual Learning meet Federated Learning at the Network Edge(https://arxiv.org/abs/2504.08536)
Keywords: privacy, federate, interpretability, explainability
Abstract: As edge devices become more capable and pervasive in wireless networks, there is growing interest in leveraging their collective compute power for distributed learning. However, optimizing learning at the network edge entails unique challenges, particularly when moving beyond conventional settings and objectives. While Federated Learning (FL) has emerged as a key paradigm for distributed model training, critical challenges persist. First, existing approaches often overlook the trade-off between predictive accuracy and interpretability. Second, they struggle to integrate inherently explainable models such as decision trees because their non-differentiable structure makes them not amenable to backpropagation-based training algorithms. Lastly, they lack meaningful mechanisms for continual Machine Learning (ML) model adaptation through Continual Learning (CL) in resource-limited environments. In this paper, we pave the way for a set of novel optimization problems that emerge in distributed learning at the network edge with wirelessly interconnected edge devices, and we identify key challenges and future directions. Specifically, we discuss how Multi-objective optimization (MOO) can be used to address the trade-off between predictive accuracy and explainability when using complex predictive models. Next, we discuss the implications of integrating inherently explainable tree-based models into distributed learning settings. Finally, we investigate how CL strategies can be effectively combined with FL to support adaptive, lifelong learning when limited-size buffers are used to store past data for retraining. Our approach offers a cohesive set of tools for designing privacy-preserving, adaptive, and trustworthy ML solutions tailored to the demands of edge computing and intelligent services.

Title: Datasets for Lane Detection in Autonomous Driving: A Comprehensive Review

Authors: Jörg Gamerdinger, Sven Teufel, Oliver Bringmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08540
Pdf URL: https://arxiv.org/pdf/2504.08540
Copy Paste: [[2504.08540]] Datasets for Lane Detection in Autonomous Driving: A Comprehensive Review(https://arxiv.org/abs/2504.08540)
Keywords: robust
Abstract: Accurate lane detection is essential for automated driving, enabling safe and reliable vehicle navigation in a variety of road scenarios. Numerous datasets have been introduced to support the development and evaluation of lane detection algorithms, each differing in terms of the amount of data, sensor types, annotation granularity, environmental conditions, and scenario diversity. This paper provides a comprehensive review of over 30 publicly available lane detection datasets, systematically analysing their characteristics, advantages and limitations. We classify these datasets based on key factors such as sensor resolution, annotation types and diversity of road and weather conditions. By identifying existing challenges and research gaps, we highlight opportunities for future dataset improvements that can further drive innovation in robust lane detection. This survey serves as a resource for researchers seeking appropriate datasets for lane detection, and contributes to the broader goal of advancing autonomous driving.

Title: Discriminator-Free Direct Preference Optimization for Video Diffusion

Authors: Haoran Cheng, Qide Dong, Liang Peng, Zhizhou Sha, Weiguo Feng, Jinghui Xie, Zhao Song, Shilei Wen, Xiaofei He, Boxi Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08542
Pdf URL: https://arxiv.org/pdf/2504.08542
Copy Paste: [[2504.08542]] Discriminator-Free Direct Preference Optimization for Video Diffusion(https://arxiv.org/abs/2504.08542)
Keywords: diffusion
Abstract: Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.

Title: UoB-NLP at SemEval-2025 Task 11: Leveraging Adapters for Multilingual and Cross-Lingual Emotion Detection

Authors: Frances Laureano De Leon, Yixiao Wang, Yue Feng, Mark G. Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08543
Pdf URL: https://arxiv.org/pdf/2504.08543
Copy Paste: [[2504.08543]] UoB-NLP at SemEval-2025 Task 11: Leveraging Adapters for Multilingual and Cross-Lingual Emotion Detection(https://arxiv.org/abs/2504.08543)
Keywords: large language model
Abstract: Emotion detection in natural language processing is a challenging task due to the complexity of human emotions and linguistic diversity. While significant progress has been made in high-resource languages, emotion detection in low-resource languages remains underexplored. In this work, we address multilingual and cross-lingual emotion detection by leveraging adapter-based fine-tuning with multilingual pre-trained language models. Adapters introduce a small number of trainable parameters while keeping the pre-trained model weights fixed, offering a parameter-efficient approach to adaptation. We experiment with different adapter tuning strategies, including task-only adapters, target-language-ready task adapters, and language-family-based adapters. Our results show that target-language-ready task adapters achieve the best overall performance, particularly for low-resource African languages with our team ranking 7th for Tigrinya, and 8th for Kinyarwanda in Track A. In Track C, our system ranked 3rd for Amharic, and 4th for Oromo, Tigrinya, Kinyarwanda, Hausa, and Igbo. Our approach outperforms large language models in 11 languages and matches their performance in four others, despite our models having significantly fewer parameters. Furthermore, we find that adapter-based models retain cross-linguistic transfer capabilities while requiring fewer computational resources compared to full fine-tuning for each language.

Title: Slicing the Gaussian Mixture Wasserstein Distance

Authors: Moritz Piening, Robert Beinert
Subjects: cs.LG, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08544
Pdf URL: https://arxiv.org/pdf/2504.08544
Copy Paste: [[2504.08544]] Slicing the Gaussian Mixture Wasserstein Distance(https://arxiv.org/abs/2504.08544)
Keywords: generative
Abstract: Gaussian mixture models (GMMs) are widely used in machine learning for tasks such as clustering, classification, image reconstruction, and generative modeling. A key challenge in working with GMMs is defining a computationally efficient and geometrically meaningful metric. The mixture Wasserstein (MW) distance adapts the Wasserstein metric to GMMs and has been applied in various domains, including domain adaptation, dataset comparison, and reinforcement learning. However, its high computational cost -- arising from repeated Wasserstein distance computations involving matrix square root estimations and an expensive linear program -- limits its scalability to high-dimensional and large-scale problems. To address this, we propose multiple novel slicing-based approximations to the MW distance that significantly reduce computational complexity while preserving key optimal transport properties. From a theoretical viewpoint, we establish several weak and strong equivalences between the introduced metrics, and show the relations to the original MW distance and the well-established sliced Wasserstein distance. Furthermore, we validate the effectiveness of our approach through numerical experiments, demonstrating computational efficiency and applications in clustering, perceptual image comparison, and GMM minimization

Title: Shadow Erosion and Nighttime Adaptability for Camera-Based Automated Driving Applications

Authors: Mohamed Sabry, Gregory Schroeder, Joshua Varughese, Cristina Olaverri-Monreal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08551
Pdf URL: https://arxiv.org/pdf/2504.08551
Copy Paste: [[2504.08551]] Shadow Erosion and Nighttime Adaptability for Camera-Based Automated Driving Applications(https://arxiv.org/abs/2504.08551)
Keywords: segmentation
Abstract: Enhancement of images from RGB cameras is of particular interest due to its wide range of ever-increasing applications such as medical imaging, satellite imaging, automated driving, etc. In autonomous driving, various techniques are used to enhance image quality under challenging lighting conditions. These include artificial augmentation to improve visibility in poor nighttime conditions, illumination-invariant imaging to reduce the impact of lighting variations, and shadow mitigation to ensure consistent image clarity in bright daylight. This paper proposes a pipeline for Shadow Erosion and Nighttime Adaptability in images for automated driving applications while preserving color and texture details. The Shadow Erosion and Nighttime Adaptability pipeline is compared to the widely used CLAHE technique and evaluated based on illumination uniformity and visual perception quality metrics. The results also demonstrate a significant improvement over CLAHE, enhancing a YOLO-based drivable area segmentation algorithm.

Title: Banana Ripeness Level Classification using a Simple CNN Model Trained with Real and Synthetic Datasets

Authors: Luis Chuquimarca, Boris Vintimilla, Sergio Velastin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08568
Pdf URL: https://arxiv.org/pdf/2504.08568
Copy Paste: [[2504.08568]] Banana Ripeness Level Classification using a Simple CNN Model Trained with Real and Synthetic Datasets(https://arxiv.org/abs/2504.08568)
Keywords: robust
Abstract: The level of ripeness is essential in determining the quality of bananas. To correctly estimate banana maturity, the metrics of international marketing standards need to be considered. However, the process of assessing the maturity of bananas at an industrial level is still carried out using manual methods. The use of CNN models is an attractive tool to solve the problem, but there is a limitation regarding the availability of sufficient data to train these models reliably. On the other hand, in the state-of-the-art, existing CNN models and the available data have reported that the accuracy results are acceptable in identifying banana maturity. For this reason, this work presents the generation of a robust dataset that combines real and synthetic data for different levels of banana ripeness. In addition, it proposes a simple CNN architecture, which is trained with synthetic data and using the transfer learning technique, the model is improved to classify real data, managing to determine the level of maturity of the banana. The proposed CNN model is evaluated with several architectures, then hyper-parameter configurations are varied, and optimizers are used. The results show that the proposed CNN model reaches a high accuracy of 0.917 and a fast execution time.

Title: Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities

Authors: Maria Santos-Villafranca, Dustin Carrión-Ojeda, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero, Simone Schaub-Meyer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08578
Pdf URL: https://arxiv.org/pdf/2504.08578
Copy Paste: [[2504.08578]] Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities(https://arxiv.org/abs/2504.08578)
Keywords: robust
Abstract: Action recognition is an essential task in egocentric vision due to its wide range of applications across many fields. While deep learning methods have been proposed to address this task, most rely on a single modality, typically video. However, including additional modalities may improve the robustness of the approaches to common issues in egocentric videos, such as blurriness and occlusions. Recent efforts in multimodal egocentric action recognition often assume the availability of all modalities, leading to failures or performance drops when any modality is missing. To address this, we introduce an efficient multimodal knowledge distillation approach for egocentric action recognition that is robust to missing modalities (KARMMA) while still benefiting when multiple modalities are available. Our method focuses on resource-efficient development by leveraging pre-trained models as unimodal feature extractors in our teacher model, which distills knowledge into a much smaller and faster student model. Experiments on the Epic-Kitchens and Something-Something datasets demonstrate that our student model effectively handles missing modalities while reducing its accuracy drop in this scenario.

Title: Boosting multi-demographic federated learning for chest x-ray analysis using general-purpose self-supervised representations

Authors: Mahshad Lotfinia, Arash Tayebiarasteh, Samaneh Samiei, Mehdi Joodaki, Soroosh Tayebi Arasteh
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08584
Pdf URL: https://arxiv.org/pdf/2504.08584
Copy Paste: [[2504.08584]] Boosting multi-demographic federated learning for chest x-ray analysis using general-purpose self-supervised representations(https://arxiv.org/abs/2504.08584)
Keywords: privacy, federate, transformer
Abstract: Reliable artificial intelligence (AI) models for medical image analysis often depend on large and diverse labeled datasets. Federated learning (FL) offers a decentralized and privacy-preserving approach to training but struggles in highly non-independent and identically distributed (non-IID) settings, where institutions with more representative data may experience degraded performance. Moreover, existing large-scale FL studies have been limited to adult datasets, neglecting the unique challenges posed by pediatric data, which introduces additional non-IID variability. To address these limitations, we analyzed n=398,523 adult chest radiographs from diverse institutions across multiple countries and n=9,125 pediatric images, leveraging transfer learning from general-purpose self-supervised image representations to classify pneumonia and cases with no abnormality. Using state-of-the-art vision transformers, we found that FL improved performance only for smaller adult datasets (P<0.001) but degraded performance for larger datasets (P<0.064) and pediatric cases (P=0.242). However, equipping FL with self-supervised weights significantly enhanced outcomes across pediatric cases (P=0.031) and most adult datasets (P<0.008), except the largest dataset (P=0.052). These findings underscore the potential of easily deployable general-purpose self-supervised image representations to address non-IID challenges in clinical FL applications and highlight their promise for enhancing patient outcomes and advancing pediatric healthcare, where data scarcity and variability remain persistent obstacles.

Title: Playpen: An Environment for Exploring Learning Through Conversational Interaction

Authors: Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Momentè, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, Raffaella Bernardi, Raquel Fernández, Alexander Koller, Oliver Lemon, David Schlangen, Mario Giulianelli, Alessandro Suglia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08590
Pdf URL: https://arxiv.org/pdf/2504.08590
Copy Paste: [[2504.08590]] Playpen: An Environment for Exploring Learning Through Conversational Interaction(https://arxiv.org/abs/2504.08590)
Keywords: large language model
Abstract: Are we running out of learning signal? Predicting the next word in an existing text has turned out to be a powerful signal, at least at scale. But there are signs that we are running out of this resource. In recent months, interaction between learner and feedback-giver has come into focus, both for "alignment" (with a reward model judging the quality of instruction following attempts) and for improving "reasoning" (process- and outcome-based verifiers judging reasoning steps). In this paper, we explore to what extent synthetic interaction in what we call Dialogue Games -- goal-directed and rule-governed activities driven predominantly by verbal actions -- can provide a learning signal, and how this signal can be used. We introduce an environment for producing such interaction data (with the help of a Large Language Model as counterpart to the learner model), both offline and online. We investigate the effects of supervised fine-tuning on this data, as well as reinforcement learning setups such as DPO, and GRPO; showing that all of these approaches achieve some improvements in in-domain games, but only GRPO demonstrates the ability to generalise to out-of-domain games as well as retain competitive performance in reference-based tasks. We release the framework and the baseline training setups in the hope that this can foster research in this promising new direction.

Title: ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

Authors: Yongsheng Yu, Haitian Zheng, Zhifei Zhang, Jianming Zhang, Yuqian Zhou, Connelly Barnes, Yuchen Liu, Wei Xiong, Zhe Lin, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08591
Pdf URL: https://arxiv.org/pdf/2504.08591
Copy Paste: [[2504.08591]] ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration(https://arxiv.org/abs/2504.08591)
Keywords: diffusion, transformer, generative
Abstract: Recent progress in generative models has significantly improved image restoration capabilities, particularly through powerful diffusion models that offer remarkable recovery of semantic details and local fidelity. However, deploying these models at ultra-high resolutions faces a critical trade-off between quality and efficiency due to the computational demands of long-range attention mechanisms. To address this, we introduce ZipIR, a novel framework that enhances efficiency, scalability, and long-range modeling for high-res image restoration. ZipIR employs a highly compressed latent representation that compresses image 32x, effectively reducing the number of spatial tokens, and enabling the use of high-capacity models like the Diffusion Transformer (DiT). Toward this goal, we propose a Latent Pyramid VAE (LP-VAE) design that structures the latent space into sub-bands to ease diffusion training. Trained on full images up to 2K resolution, ZipIR surpasses existing diffusion-based methods, offering unmatched speed and quality in restoring high-resolution images from severely degraded inputs.

Title: Hands-On: Segmenting Individual Signs from Continuous Sequences

Authors: Low Jian He, Harry Walsh, Ozge Mercanoglu Sincan, Richard Bowden
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08593
Pdf URL: https://arxiv.org/pdf/2504.08593
Copy Paste: [[2504.08593]] Hands-On: Segmenting Individual Signs from Continuous Sequences(https://arxiv.org/abs/2504.08593)
Keywords: transformer, segmentation
Abstract: This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

Title: On Background Bias of Post-Hoc Concept Embeddings in Computer Vision DNNs

Authors: Gesina Schwalbe, Georgii Mikriukov, Edgar Heinert, Stavros Gerolymatos, Mert Keser, Alois Knoll, Matthias Rottmann, Annika Mütze
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08602
Pdf URL: https://arxiv.org/pdf/2504.08602
Copy Paste: [[2504.08602]] On Background Bias of Post-Hoc Concept Embeddings in Computer Vision DNNs(https://arxiv.org/abs/2504.08602)
Keywords: robust, segmentation
Abstract: The thriving research field of concept-based explainable artificial intelligence (C-XAI) investigates how human-interpretable semantic concepts embed in the latent spaces of deep neural networks (DNNs). Post-hoc approaches therein use a set of examples to specify a concept, and determine its embeddings in DNN latent space using data driven techniques. This proved useful to uncover biases between different target (foreground or concept) classes. However, given that the background is mostly uncontrolled during training, an important question has been left unattended so far: Are/to what extent are state-of-the-art, data-driven post-hoc C-XAI approaches themselves prone to biases with respect to their backgrounds? E.g., wild animals mostly occur against vegetation backgrounds, and they seldom appear on roads. Even simple and robust C-XAI methods might abuse this shortcut for enhanced performance. A dangerous performance degradation of the concept-corner cases of animals on the road could thus remain undiscovered. This work validates and thoroughly confirms that established Net2Vec-based concept segmentation techniques frequently capture background biases, including alarming ones, such as underperformance on road scenes. For the analysis, we compare 3 established techniques from the domain of background randomization on >50 concepts from 2 datasets, and 7 diverse DNN architectures. Our results indicate that even low-cost setups can provide both valuable insight and improved background robustness.

Title: A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

Authors: Julian Bäumler, Louis Blöcher, Lars-Joel Frey, Xian Chen, Markus Bayer, Christian Reuter
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08609
Pdf URL: https://arxiv.org/pdf/2504.08609
Copy Paste: [[2504.08609]] A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English(https://arxiv.org/abs/2504.08609)
Keywords: transformer
Abstract: The dissemination of online hate speech can have serious negative consequences for individuals, online communities, and entire societies. This and the large volume of hateful online content prompted both practitioners', i.e., in content moderation or law enforcement, and researchers' interest in machine learning models to automatically classify instances of hate speech. Whereas most scientific works address hate speech classification as a binary task, practice often requires a differentiation into sub-types, e.g., according to target, severity, or legality, which may overlap for individual content. Hence, researchers created datasets and machine learning models that approach hate speech classification in textual data as a multi-label problem. This work presents the first systematic and comprehensive survey of scientific literature on this emerging research landscape in English (N=46). We contribute with a concise overview of 28 datasets suited for training multi-label classification models that reveals significant heterogeneity regarding label-set, size, meta-concept, annotation process, and inter-annotator agreement. Our analysis of 24 publications proposing suitable classification models further establishes inconsistency in evaluation and a preference for architectures based on Bidirectional Encoder Representation from Transformers (BERT) and Recurrent Neural Networks (RNNs). We identify imbalanced training data, reliance on crowdsourcing platforms, small and sparse datasets, and missing methodological alignment as critical open issues and formulate ten recommendations for research.

Title: Enhancing knowledge retention for continual learning with domain-specific adapters and features gating

Authors: Mohamed Abbas Hedjazi, Oussama Hadjerci, Adel Hafiane
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.08613
Pdf URL: https://arxiv.org/pdf/2504.08613
Copy Paste: [[2504.08613]] Enhancing knowledge retention for continual learning with domain-specific adapters and features gating(https://arxiv.org/abs/2504.08613)
Keywords: transformer
Abstract: Continual learning empowers models to learn from a continuous stream of data while preserving previously acquired knowledge, effectively addressing the challenge of catastrophic forgetting. In this study, we propose a new approach that integrates adapters within the self-attention mechanisms of Vision Transformers to enhance knowledge retention when sequentially adding datasets from different domains. Unlike previous methods that continue learning with only one dataset, our approach introduces domain-specific output heads and feature gating, allowing the model to maintain high accuracy on previously learned tasks while incorporating only the essential information from multiple domains. The proposed method is compared to prominent parameter-efficient fine-tuning methods in the current state of the art. The results provide evidence that our method effectively alleviates the limitations of previous works. Furthermore, we conduct a comparative analysis using three datasets, CIFAR-100, Flowers102, and DTD, each representing a distinct domain, to investigate the impact of task order on model performance. Our findings underscore the critical role of dataset sequencing in shaping learning outcomes, demonstrating that strategic ordering can significantly improve the model's ability to adapt to evolving data distributions over time while preserving the integrity of previously learned knowledge.

Title: Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition

Authors: Lei Kang, Xuanshuo Fu, Lluis Gomez, Alicia Fornés, Ernest Valveny, Dimosthenis Karatzas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08616
Pdf URL: https://arxiv.org/pdf/2504.08616
Copy Paste: [[2504.08616]] Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition(https://arxiv.org/abs/2504.08616)
Keywords: privacy, attack, membership infer, transformer
Abstract: Handwritten Text Recognition (HTR) is essential for document analysis and digitization. However, handwritten data often contains user-identifiable information, such as unique handwriting styles and personal lexicon choices, which can compromise privacy and erode trust in AI services. Legislation like the ``right to be forgotten'' underscores the necessity for methods that can expunge sensitive information from trained models. Machine unlearning addresses this by selectively removing specific data from models without necessitating complete retraining. Yet, it frequently encounters a privacy-accuracy tradeoff, where safeguarding privacy leads to diminished model performance. In this paper, we introduce a novel two-stage unlearning strategy for a multi-head transformer-based HTR model, integrating pruning and random labeling. Our proposed method utilizes a writer classification head both as an indicator and a trigger for unlearning, while maintaining the efficacy of the recognition head. To our knowledge, this represents the first comprehensive exploration of machine unlearning within HTR tasks. We further employ Membership Inference Attacks (MIA) to evaluate the effectiveness of unlearning user-identifiable information. Extensive experiments demonstrate that our approach effectively preserves privacy while maintaining model accuracy, paving the way for new research directions in the document analysis community. Our code will be publicly available upon acceptance.

Title: A Hybrid Chaos-Based Cryptographic Framework for Post-Quantum Secure Communications

Authors: Kevin Song, Noorullah Imran, Jake Y. Chen, Allan C. Dobbins
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.08618
Pdf URL: https://arxiv.org/pdf/2504.08618
Copy Paste: [[2504.08618]] A Hybrid Chaos-Based Cryptographic Framework for Post-Quantum Secure Communications(https://arxiv.org/abs/2504.08618)
Keywords: secure, attack, robust
Abstract: We present CryptoChaos, a novel hybrid cryptographic framework that synergizes deterministic chaos theory with cutting-edge cryptographic primitives to achieve robust, post-quantum resilient encryption. CryptoChaos harnesses the intrinsic unpredictability of four discrete chaotic maps (Logistic, Chebyshev, Tent, and Henon) to generate a high-entropy, multidimensional key from a unified entropy pool. This key is derived through a layered process that combines SHA3-256 hashing with an ephemeral X25519 Diffie-Hellman key exchange and is refined using an HMAC-based key derivation function (HKDF). The resulting encryption key powers AES-GCM, providing both confidentiality and integrity. Comprehensive benchmarking against established symmetric ciphers confirms that CryptoChaos attains near-maximal Shannon entropy (approximately 8 bits per byte) and exhibits negligible adjacent-byte correlations, while robust performance on the NIST SP 800-22 test suite underscores its statistical rigor. Moreover, quantum simulations demonstrate that the additional complexity inherent in chaotic key generation dramatically elevates the resource requirements for Grover-based quantum attacks, with an estimated T gate count of approximately 2.1 x 10^9. The modular and interoperable design of CryptoChaos positions it as a promising candidate for high-assurance applications, ranging from secure communications and financial transactions to IoT systems, paving the way for next-generation post-quantum encryption standards.

Title: Efficient Mixture of Geographical Species for On Device Wildlife Monitoring

Authors: Emmanuel Azuh Mensah, Joban Mand, Yueheng Ou, Min Jang, Kurtis Heimerl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08620
Pdf URL: https://arxiv.org/pdf/2504.08620
Copy Paste: [[2504.08620]] Efficient Mixture of Geographical Species for On Device Wildlife Monitoring(https://arxiv.org/abs/2504.08620)
Keywords: transformer
Abstract: Efficient on-device models have become attractive for near-sensor insight generation, of particular interest to the ecological conservation community. For this reason, deep learning researchers are proposing more approaches to develop lower compute models. However, since vision transformers are very new to the edge use case, there are still unexplored approaches, most notably conditional execution of subnetworks based on input data. In this work, we explore the training of a single species detector which uses conditional computation to bias structured sub networks in a geographically-aware manner. We propose a method for pruning the expert model per location and demonstrate conditional computation performance on two geographically distributed datasets: iNaturalist and iWildcam.

Title: Enterprise-Grade Security for the Model Context Protocol (MCP): Frameworks and Mitigation Strategies

Authors: Vineeth Sai Narajala, Idan Habler
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08623
Pdf URL: https://arxiv.org/pdf/2504.08623
Copy Paste: [[2504.08623]] Enterprise-Grade Security for the Model Context Protocol (MCP): Frameworks and Mitigation Strategies(https://arxiv.org/abs/2504.08623)
Keywords: secure, security, attack
Abstract: The Model Context Protocol (MCP), introduced by Anthropic, provides a standardized framework for artificial intelligence (AI) systems to interact with external data sources and tools in real-time. While MCP offers significant advantages for AI integration and capability extension, it introduces novel security challenges that demand rigorous analysis and mitigation. This paper builds upon foundational research into MCP architecture and preliminary security assessments to deliver enterprise-grade mitigation frameworks and detailed technical implementation strategies. Through systematic threat modeling and analysis of MCP implementations and analysis of potential attack vectors, including sophisticated threats like tool poisoning, we present actionable security patterns tailored for MCP implementers and adopters. The primary contribution of this research lies in translating theoretical security concerns into a practical, implementable framework with actionable controls, thereby providing essential guidance for the secure enterprise adoption and governance of integrated AI systems.

Title: Deep Learning Methods for Detecting Thermal Runaway Events in Battery Production Lines

Authors: Athanasios Athanasopoulos, Matúš Mihalák, Marcin Pietrasik
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08632
Pdf URL: https://arxiv.org/pdf/2504.08632
Copy Paste: [[2504.08632]] Deep Learning Methods for Detecting Thermal Runaway Events in Battery Production Lines(https://arxiv.org/abs/2504.08632)
Keywords: explainability, transformer
Abstract: One of the key safety considerations of battery manufacturing is thermal runaway, the uncontrolled increase in temperature which can lead to fires, explosions, and emissions of toxic gasses. As such, development of automated systems capable of detecting such events is of considerable importance in both academic and industrial contexts. In this work, we investigate the use of deep learning for detecting thermal runaway in the battery production line of VDL Nedcar, a Dutch automobile manufacturer. Specifically, we collect data from the production line to represent both baseline (non thermal runaway) and thermal runaway conditions. Thermal runaway was simulated through the use of external heat and smoke sources. The data consisted of both optical and thermal images which were then preprocessed and fused before serving as input to our models. In this regard, we evaluated three deep-learning models widely used in computer vision including shallow convolutional neural networks, residual neural networks, and vision transformers on two performance metrics. Furthermore, we evaluated these models using explainability methods to gain insight into their ability to capture the relevant feature information from their inputs. The obtained results indicate that the use of deep learning is a viable approach to thermal runaway detection in battery production lines.

Title: Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging

Authors: Gabriele Lozupone, Alessandro Bria, Francesco Fontanella, Frederick J.A. Meijer, Claudio De Stefano, Henkjan Huisman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08635
Pdf URL: https://arxiv.org/pdf/2504.08635
Copy Paste: [[2504.08635]] Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging(https://arxiv.org/abs/2504.08635)
Keywords: robust, diffusion
Abstract: This study presents Latent Diffusion Autoencoder (LDAE), a novel encoder-decoder diffusion-based framework for efficient and meaningful unsupervised learning in medical imaging, focusing on Alzheimer disease (AD) using brain MR from the ADNI database as a case study. Unlike conventional diffusion autoencoders operating in image space, LDAE applies the diffusion process in a compressed latent representation, improving computational efficiency and making 3D medical imaging representation learning tractable. To validate the proposed approach, we explore two key hypotheses: (i) LDAE effectively captures meaningful semantic representations on 3D brain MR associated with AD and ageing, and (ii) LDAE achieves high-quality image generation and reconstruction while being computationally efficient. Experimental results support both hypotheses: (i) linear-probe evaluations demonstrate promising diagnostic performance for AD (ROC-AUC: 90%, ACC: 84%) and age prediction (MAE: 4.1 years, RMSE: 5.2 years); (ii) the learned semantic representations enable attribute manipulation, yielding anatomically plausible modifications; (iii) semantic interpolation experiments show strong reconstruction of missing scans, with SSIM of 0.969 (MSE: 0.0019) for a 6-month gap. Even for longer gaps (24 months), the model maintains robust performance (SSIM > 0.93, MSE < 0.004), indicating an ability to capture temporal progression trends; (iv) compared to conventional diffusion autoencoders, LDAE significantly increases inference throughput (20x faster) while also enhancing reconstruction quality. These findings position LDAE as a promising framework for scalable medical imaging applications, with the potential to serve as a foundation model for medical image analysis. Code available at this https URL

Title: Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Authors: Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.08641
Pdf URL: https://arxiv.org/pdf/2504.08641
Copy Paste: [[2504.08641]] Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization(https://arxiv.org/abs/2504.08641)
Keywords: diffusion, segmentation
Abstract: Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

Title: Title block detection and information extraction for enhanced building drawings search

Authors: Alessio Lombardi (1), Li Duan (2), Ahmed Elnagar (1), Ahmed Zaalouk (2), Khalid Ismail (2), Edlira Vakaj (2) ((1) Buro Happold, London (UK), (2) Birmingham City University (UK))
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08645
Pdf URL: https://arxiv.org/pdf/2504.08645
Copy Paste: [[2504.08645]] Title block detection and information extraction for enhanced building drawings search(https://arxiv.org/abs/2504.08645)
Keywords: extraction
Abstract: The architecture, engineering, and construction (AEC) industry still heavily relies on information stored in drawings for building construction, maintenance, compliance and error checks. However, information extraction (IE) from building drawings is often time-consuming and costly, especially when dealing with historical buildings. Drawing search can be simplified by leveraging the information stored in the title block portion of the drawing, which can be seen as drawing metadata. However, title block IE can be complex especially when dealing with historical drawings which do not follow existing standards for uniformity. This work performs a comparison of existing methods for this kind of IE task, and then proposes a novel title block detection and IE pipeline which outperforms existing methods, in particular when dealing with complex, noisy historical drawings. The pipeline is obtained by combining a lightweight Convolutional Neural Network and GPT-4o, the proposed inference pipeline detects building engineering title blocks with high accuracy, and then extract structured drawing metadata from the title blocks, which can be used for drawing search, filtering and grouping. The work demonstrates high accuracy and efficiency in IE for both vector (CAD) and hand-drawn (historical) drawings. A user interface (UI) that leverages the extracted metadata for drawing search is established and deployed on real projects, which demonstrates significant time savings. Additionally, an extensible domain-expert-annotated dataset for title block detection is developed, via an efficient AEC-friendly annotation workflow that lays the foundation for future work.

Title: MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction

Authors: Ian Noronha, Advait Prasad Jawaji, Juan Camilo Soto, Jiajun An, Yan Gu, Upinder Kaur
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.08646
Pdf URL: https://arxiv.org/pdf/2504.08646
Copy Paste: [[2504.08646]] MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction(https://arxiv.org/abs/2504.08646)
Keywords: robust
Abstract: Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at this https URL, inviting further exploration and development in this critical area.

Title: The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation

Authors: Masashi Hatano, Zhifan Zhu, Hideo Saito, Dima Damen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08654
Pdf URL: https://arxiv.org/pdf/2504.08654
Copy Paste: [[2504.08654]] The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation(https://arxiv.org/abs/2504.08654)
Keywords: diffusion, transformer
Abstract: Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera's view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting. Project page: this https URL

Title: Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

Authors: Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08672
Pdf URL: https://arxiv.org/pdf/2504.08672
Copy Paste: [[2504.08672]] Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning(https://arxiv.org/abs/2504.08672)
Keywords: robust
Abstract: Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at this https URL.

Title: Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yanghua Peng, Yuping Wang, Yuxuan Wang, Zhongkai Zhao, Zhuo Chen, Zuquan Song, Zhenheng Yang, Jiashi Feng, Jianchao Yang, Lu Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08685
Pdf URL: https://arxiv.org/pdf/2504.08685
Copy Paste: [[2504.08685]] Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model(https://arxiv.org/abs/2504.08685)
Keywords: diffusion
Abstract: This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at this https URL

Title: Fast-Slow-Thinking: Complex Task Solving with Large Language Models

Authors: Yiliu Sun, Yanfang Zhang, Zicheng Zhao, Sheng Wan, Dacheng Tao, Chen Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08690
Pdf URL: https://arxiv.org/pdf/2504.08690
Copy Paste: [[2504.08690]] Fast-Slow-Thinking: Complex Task Solving with Large Language Models(https://arxiv.org/abs/2504.08690)
Keywords: large language model
Abstract: Nowadays, Large Language Models (LLMs) have been gradually employed to solve complex tasks. To face the challenge, task decomposition has become an effective way, which proposes to divide a complex task into multiple simpler subtasks and then solve them separately so that the difficulty of the original task can be reduced. However, the performance of existing task decomposition methods can be suboptimal when the task contains overly complex logic and constraints. In this situation, the solution generated by LLMs may deviate from the original purpose of the task, or contain redundant or even erroneous content. Therefore, inspired by the fact that humans possess two thinking systems including fast thinking and slow thinking, this paper introduces a new task decomposition method termed ``Fast-Slow-Thinking'' (FST), which stimulates LLMs to solve tasks through the cooperation of Fast Thinking (FT) and Slow Thinking (ST) steps. Here FT focuses more on the general and concise aspect of the task, and ST focuses more on the details of the task. In FT, LLMs are prompted to remove the constraints of the original task, therefore simplifying it to a general and concise one. In ST, we recall the constraints removed in FT, so that LLMs can improve the answer generated in FT to meet the requirements of the original task. Therefore, our FST method enables LLMs to consider a complex problem via a human-like cognition process from coarse to fine, the effectiveness of which has been well demonstrated by the experiments on three types of tasks.

Title: TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning

Authors: Hang Ni, Fan Liu, Xinyu Ma, Lixin Su, Shuaiqiang Wang, Dawei Yin, Hui Xiong, Hao Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08694
Pdf URL: https://arxiv.org/pdf/2504.08694
Copy Paste: [[2504.08694]] TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning(https://arxiv.org/abs/2504.08694)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have shown promise in automating travel planning, yet they often fall short in addressing nuanced spatiotemporal rationality. While existing benchmarks focus on basic plan validity, they neglect critical aspects such as route efficiency, POI appeal, and real-time adaptability. This paper introduces TP-RAG, the first benchmark tailored for retrieval-augmented, spatiotemporal-aware travel planning. Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain annotated POIs, and 18,784 high-quality travel trajectory references sourced from online tourist documents, enabling dynamic and context-aware planning. Through extensive experiments, we reveal that integrating reference trajectories significantly improves spatial efficiency and POI rationality of the travel plan, while challenges persist in universality and robustness due to conflicting references and noisy data. To address these issues, we propose EvoRAG, an evolutionary framework that potently synergizes diverse retrieved trajectories with LLMs' intrinsic reasoning. EvoRAG achieves state-of-the-art performance, improving spatiotemporal compliance and reducing commonsense violation compared to ground-up and retrieval-augmented baselines. Our work underscores the potential of hybridizing Web knowledge with LLM-driven optimization, paving the way for more reliable and adaptive travel planning agents.

Title: Large Language Models as Span Annotators

Authors: Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondřej Plátek, Dimitra Gkatzia, Saad Mahamood, Ondřej Dušek, Simone Balloccu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08697
Pdf URL: https://arxiv.org/pdf/2504.08697
Copy Paste: [[2504.08697]] Large Language Models as Span Annotators(https://arxiv.org/abs/2504.08697)
Keywords: large language model
Abstract: For high-quality texts, single-score metrics seldom provide actionable feedback. In contrast, span annotation - pointing out issues in the text by annotating their spans - can guide improvements and provide insights. Until recently, span annotation was limited to human annotators or fine-tuned encoder models. In this study, we automate span annotation with large language models (LLMs). We compare expert or skilled crowdworker annotators with open and proprietary LLMs on three tasks: data-to-text generation evaluation, machine translation evaluation, and propaganda detection in human-written texts. In our experiments, we show that LLMs as span annotators are straightforward to implement and notably more cost-efficient than human annotators. The LLMs achieve moderate agreement with skilled human annotators, in some scenarios comparable to the average agreement among the annotators themselves. Qualitative analysis shows that reasoning models outperform their instruction-tuned counterparts and provide more valid explanations for annotations. We release the dataset of more than 40k model and human annotations for further research.

Title: Hypergraph Vision Transformers: Images are More than Nodes, More than Edges

Authors: Joshua Fixelle
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08710
Pdf URL: https://arxiv.org/pdf/2504.08710
Copy Paste: [[2504.08710]] Hypergraph Vision Transformers: Images are More than Nodes, More than Edges(https://arxiv.org/abs/2504.08710)
Keywords: extraction, transformer
Abstract: Recent advancements in computer vision have highlighted the scalability of Vision Transformers (ViTs) across various tasks, yet challenges remain in balancing adaptability, computational efficiency, and the ability to model higher-order relationships. Vision Graph Neural Networks (ViGs) offer an alternative by leveraging graph-based methodologies but are hindered by the computational bottlenecks of clustering algorithms used for edge generation. To address these issues, we propose the Hypergraph Vision Transformer (HgVT), which incorporates a hierarchical bipartite hypergraph structure into the vision transformer framework to capture higher-order semantic relationships while maintaining computational efficiency. HgVT leverages population and diversity regularization for dynamic hypergraph construction without clustering, and expert edge pooling to enhance semantic extraction and facilitate graph-based image retrieval. Empirical results demonstrate that HgVT achieves strong performance on image classification and retrieval, positioning it as an efficient framework for semantic-based vision tasks.

Title: Beyond Black-Box Predictions: Identifying Marginal Feature Effects in Tabular Transformer Networks

Authors: Anton Thielmann, Arik Reuter, Benjamin Saefken
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08712
Pdf URL: https://arxiv.org/pdf/2504.08712
Copy Paste: [[2504.08712]] Beyond Black-Box Predictions: Identifying Marginal Feature Effects in Tabular Transformer Networks(https://arxiv.org/abs/2504.08712)
Keywords: transformer
Abstract: In recent years, deep neural networks have showcased their predictive power across a variety of tasks. Beyond natural language processing, the transformer architecture has proven efficient in addressing tabular data problems and challenges the previously dominant gradient-based decision trees in these areas. However, this predictive power comes at the cost of intelligibility: Marginal feature effects are almost completely lost in the black-box nature of deep tabular transformer networks. Alternative architectures that use the additivity constraints of classical statistical regression models can maintain intelligible marginal feature effects, but often fall short in predictive power compared to their more complex counterparts. To bridge the gap between intelligibility and performance, we propose an adaptation of tabular transformer networks designed to identify marginal feature effects. We provide theoretical justifications that marginal feature effects can be accurately identified, and our ablation study demonstrates that the proposed model efficiently detects these effects, even amidst complex feature interactions. To demonstrate the model's predictive capabilities, we compare it to several interpretable as well as black-box models and find that it can match black-box performances while maintaining intelligibility. The source code is available at this https URL.

Title: Generating Fine Details of Entity Interactions

Authors: Xinyi Gu, Jiayuan Mao
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08714
Pdf URL: https://arxiv.org/pdf/2504.08714
Copy Paste: [[2504.08714]] Generating Fine Details of Entity Interactions(https://arxiv.org/abs/2504.08714)
Keywords: diffusion
Abstract: Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to the scarcity of training data for uncommon object interactions. This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios: (1) functional and action-based interactions, (2) compositional spatial relationships, and (3) multi-subject interactions. To address interaction generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, DetailScribe, built on Stable Diffusion 3.5, leverages LLMs to decompose interactions into finer-grained concepts, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement. Automatic and human evaluations show significantly improved image quality, demonstrating the potential of enhanced inference strategies. Our dataset and code are available at this https URL to facilitate future exploration of interaction-rich image generation.

Title: ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

Authors: Wissam Antoun, Benoît Sagot, Djamé Seddah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08716
Pdf URL: https://arxiv.org/pdf/2504.08716
Copy Paste: [[2504.08716]] ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance(https://arxiv.org/abs/2504.08716)
Keywords: transformer
Abstract: Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

Title: EMO-X: Efficient Multi-Person Pose and Shape Estimation in One-Stage

Authors: Haohang Jian, Jinlu Zhang, Junyi Wu, Zhigang Tu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.08718
Pdf URL: https://arxiv.org/pdf/2504.08718
Copy Paste: [[2504.08718]] EMO-X: Efficient Multi-Person Pose and Shape Estimation in One-Stage(https://arxiv.org/abs/2504.08718)
Keywords: transformer
Abstract: Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimate human pose, hand gesture, and facial expression from monocular images. Existing methods predominantly rely on Transformer-based architectures, which suffer from quadratic complexity in self-attention, leading to substantial computational overhead, especially in multi-person scenarios. Recently, Mamba has emerged as a promising alternative to Transformers due to its efficient global modeling capability. However, it remains limited in capturing fine-grained local dependencies, which are essential for precise EHPS. To address these issues, we propose EMO-X, the Efficient Multi-person One-stage model for multi-person EHPS. Specifically, we explore a Scan-based Global-Local Decoder (SGLD) that integrates global context with skeleton-aware local features to iteratively enhance human tokens. Our EMO-X leverages the superior global modeling capability of Mamba and designs a local bidirectional scan mechanism for skeleton-aware local refinement. Comprehensive experiments demonstrate that EMO-X strikes an excellent balance between efficiency and accuracy. Notably, it achieves a significant reduction in computational complexity, requiring 69.8% less inference time compared to state-of-the-art (SOTA) methods, while outperforming most of them in accuracy.

Title: SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

Authors: Krishna C. Puvvada, Faisal Ladhak, Santiago Akle Serrano, Cheng-Ping Hsieh, Shantanu Acharya, Somshubra Majumdar, Fei Jia, Samuel Kriman, Simeng Sun, Dima Rekesh, Boris Ginsburg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08719
Pdf URL: https://arxiv.org/pdf/2504.08719
Copy Paste: [[2504.08719]] SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling(https://arxiv.org/abs/2504.08719)
Keywords: robust, transformer
Abstract: We present a decoder-only Transformer architecture that robustly generalizes to sequence lengths substantially longer than those seen during training. Our model, SWAN-GPT, interleaves layers without positional encodings (NoPE) and sliding-window attention layers equipped with rotary positional encodings (SWA-RoPE). Experiments demonstrate strong performance on sequence lengths significantly longer than the training length without the need for additional long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by a straightforward dynamic scaling of attention scores during inference. In addition, SWAN-GPT is more computationally efficient than standard GPT architectures, resulting in cheaper training and higher throughput. Further, we demonstrate that existing pre-trained decoder-only models can be efficiently converted to the SWAN architecture with minimal continued training, enabling longer contexts. Overall, our work presents an effective approach for scaling language models to longer contexts in a robust and efficient manner.

Title: Steering CLIP's vision transformer with sparse autoencoders

Authors: Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, Blake Aaron Richards
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08729
Pdf URL: https://arxiv.org/pdf/2504.08729
Copy Paste: [[2504.08729]] Steering CLIP's vision transformer with sparse autoencoders(https://arxiv.org/abs/2504.08729)
Keywords: defense, attack, transformer
Abstract: While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15\% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.