2025-05-09

Title: How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks

Authors: Yusen Wu, Junwu Xiong, Xiaotie Deng
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2505.04628
Pdf URL: https://arxiv.org/pdf/2505.04628
Copy Paste: [[2505.04628]] How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks(https://arxiv.org/abs/2505.04628)
Keywords: large language model
Abstract: Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.

Title: MatMMFuse: Multi-Modal Fusion model for Material Property Prediction

Authors: Abhiroop Bhattacharya, Sylvain G. Cloutier
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2505.04634
Pdf URL: https://arxiv.org/pdf/2505.04634
Copy Paste: [[2505.04634]] MatMMFuse: Multi-Modal Fusion model for Material Property Prediction(https://arxiv.org/abs/2505.04634)
Keywords: large language model
Abstract: The recent progress of using graph based encoding of crystal structures for high throughput material property prediction has been quite successful. However, using a single modality model prevents us from exploiting the advantages of an enhanced features space by combining different representations. Specifically, pre-trained Large language models(LLMs) can encode a large amount of knowledge which is beneficial for training of models. Moreover, the graph encoder is able to learn the local features while the text encoder is able to learn global information such as space group and crystal symmetry. In this work, we propose Material Multi-Modal Fusion(MatMMFuse), a fusion based model which uses a multi-head attention mechanism for the combination of structure aware embedding from the Crystal Graph Convolution Network (CGCNN) and text embeddings from the SciBERT model. We train our model in an end-to-end framework using data from the Materials Project Dataset. We show that our proposed model shows an improvement compared to the vanilla CGCNN and SciBERT model for all four key properties: formation energy, band gap, energy above hull and fermi energy. Specifically, we observe an improvement of 40% compared to the vanilla CGCNN model and 68% compared to the SciBERT model for predicting the formation energy per atom. Importantly, we demonstrate the zero shot performance of the trained model on small curated datasets of Perovskites, Chalcogenides and the Jarvis Dataset. The results show that the proposed model exhibits better zero shot performance than the individual plain vanilla CGCNN and SciBERT model. This enables researchers to deploy the model for specialized industrial applications where collection of training data is prohibitively expensive.

Title: Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs

Authors: Dongxing Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04637
Pdf URL: https://arxiv.org/pdf/2505.04637
Copy Paste: [[2505.04637]] Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs(https://arxiv.org/abs/2505.04637)
Keywords: large language model
Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual-linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models' capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering, +5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.

Title: Language translation, and change of accent for speech-to-speech task using diffusion model

Authors: Abhishek Mishra, Ritesh Sur Chowdhury, Vartul Bahuguna, Isha Pandey, Ganesh Ramakrishnan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04639
Pdf URL: https://arxiv.org/pdf/2505.04639
Copy Paste: [[2505.04639]] Language translation, and change of accent for speech-to-speech task using diffusion model(https://arxiv.org/abs/2505.04639)
Keywords: diffusion, generative
Abstract: Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another, typically focusing on either language translation or accent adaptation. However, effective cross-cultural communication requires handling both aspects simultaneously - translating content while adapting the speaker's accent to match the target language context. In this work, we propose a unified approach for simultaneous speech translation and change of accent, a task that remains underexplored in current literature. Our method reformulates the problem as a conditional generation task, where target speech is generated based on phonemes and guided by target speech features. Leveraging the power of diffusion models, known for high-fidelity generative capabilities, we adapt text-to-image diffusion strategies by conditioning on source speech transcriptions and generating Mel spectrograms representing the target speech with desired linguistic and accentual attributes. This integrated framework enables joint optimization of translation and accent adaptation, offering a more parameter-efficient and effective model compared to traditional pipelines.

Title: Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation

Authors: Hannes Waldetoft, Jakob Torgander, Måns Magnusson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04643
Pdf URL: https://arxiv.org/pdf/2505.04643
Copy Paste: [[2505.04643]] Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation(https://arxiv.org/abs/2505.04643)
Keywords: transformer
Abstract: Estimating population parameters in finite populations of text documents can be challenging when obtaining the labels for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators using the model predictions as an auxiliary variable. The applicability is demonstrated in Swedish hate crime statistics based on Swedish police reports. Estimates of the yearly number of hate crimes and the police's under-reporting are derived using the Hansen-Hurwitz estimator, difference estimation, and stratified random sampling estimation. We conclude that if labeled training data is available, the proposed method can provide very efficient estimates with reduced time spent on manual annotation.

Title: ChatGPT for automated grading of short answer questions in mechanical ventilation

Authors: Tejas Jade, Alex Yartsev
Subjects: cs.CL, cs.LG, stat.CO
Abstract URL: https://arxiv.org/abs/2505.04645
Pdf URL: https://arxiv.org/pdf/2505.04645
Copy Paste: [[2505.04645]] ChatGPT for automated grading of short answer questions in mechanical ventilation(https://arxiv.org/abs/2505.04645)
Keywords: large language model
Abstract: Standardised tests using short answer questions (SAQs) are common in postgraduate education. Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses in ways aligning with applying SAQ grading rubrics, making them attractive for automated grading. We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students (557 short-answer responses) enrolled in an online course on mechanical ventilation (2020--2024). Deidentified responses to three case-based scenarios were presented to ChatGPT with a standardised grading prompt and rubric. Outputs were analysed using mixed-effects modelling, variance component analysis, intraclass correlation coefficients (ICCs), Cohen's kappa, Kendall's W, and Bland--Altman statistics. ChatGPT awarded systematically lower marks than human graders with a mean difference (bias) of -1.34 on a 10-point scale. ICC values indicated poor individual-level agreement (ICC1 = 0.086), and Cohen's kappa (-0.0786) suggested no meaningful agreement. Variance component analysis showed minimal variability among the five ChatGPT sessions (G-value = 0.87), indicating internal consistency but divergence from the human grader. The poorest agreement was observed for evaluative and analytic items, whereas checklist and prescriptive rubric items had less disagreement. We caution against the use of LLMs in grading postgraduate coursework. Over 60% of ChatGPT-assigned grades differed from human grades by more than acceptable boundaries for high-stakes assessments.

Title: FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights

Authors: Chengzhang Yu, Yiming Zhang, Zhixin Liu, Zenghui Ding, Yining Sun, Zhanpeng Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04649
Pdf URL: https://arxiv.org/pdf/2505.04649
Copy Paste: [[2505.04649]] FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights(https://arxiv.org/abs/2505.04649)
Keywords: robust, large language model
Abstract: The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME's effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.

Title: Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions

Authors: Adithya Kulkarni, Fatimah Alotaibi, Xinyue Zeng, Longfeng Wu, Tong Zeng, Barry Menglong Yao, Minqian Liu, Shuaicheng Zhang, Lifu Huang, Dawei Zhou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04651
Pdf URL: https://arxiv.org/pdf/2505.04651
Copy Paste: [[2505.04651]] Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions(https://arxiv.org/abs/2505.04651)
Keywords: interpretability, generative, large language model
Abstract: Large Language Models (LLMs) are transforming scientific hypothesis generation and validation by enabling information synthesis, latent relationship discovery, and reasoning augmentation. This survey provides a structured overview of LLM-driven approaches, including symbolic frameworks, generative models, hybrid systems, and multi-agent architectures. We examine techniques such as retrieval-augmented generation, knowledge-graph completion, simulation, causal inference, and tool-assisted reasoning, highlighting trade-offs in interpretability, novelty, and domain alignment. We contrast early symbolic discovery systems (e.g., BACON, KEKADA) with modern LLM pipelines that leverage in-context learning and domain adaptation via fine-tuning, retrieval, and symbolic grounding. For validation, we review simulation, human-AI collaboration, causal modeling, and uncertainty quantification, emphasizing iterative assessment in open-world contexts. The survey maps datasets across biomedicine, materials science, environmental science, and social science, introducing new resources like AHTech and CSKG-600. Finally, we outline a roadmap emphasizing novelty-aware generation, multimodal-symbolic integration, human-in-the-loop systems, and ethical safeguards, positioning LLMs as agents for principled, scalable scientific discovery.

Title: Advancing Conversational Diagnostic AI with Multimodal Reasoning

Authors: Khaled Saab, Jan Freyberg, Chunjong Park, Tim Strother, Yong Cheng, Wei-Hung Weng, David G.T. Barrett, David Stutz, Nenad Tomasev, Anil Palepu, Valentin Liévin, Yash Sharma, Roma Ruparel, Abdullah Ahmed, Elahe Vedadi, Kimberly Kanada, Cian Hughes, Yun Liu, Geoff Brown, Yang Gao, Sean Li, S. Sara Mahdavi, James Manyika, Katherine Chou, Yossi Matias, Avinatan Hassidim, Dale R. Webster, Pushmeet Kohli, S.M. Ali Eslami, Joëlle Barral, Adam Rodman, Vivek Natarajan, Mike Schaekermann, Tao Tu, Alan Karthikesalingam, Ryutaro Tanno
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04653
Pdf URL: https://arxiv.org/pdf/2505.04653
Copy Paste: [[2505.04653]] Advancing Conversational Diagnostic AI with Multimodal Reasoning(https://arxiv.org/abs/2505.04653)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.

Title: A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient

Authors: Yehor Tereshchenko, Mika Hämäläinen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04654
Pdf URL: https://arxiv.org/pdf/2505.04654
Copy Paste: [[2505.04654]] A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient(https://arxiv.org/abs/2505.04654)
Keywords: robust, large language model
Abstract: Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).

Title: Integration of Large Language Models and Traditional Deep Learning for Social Determinants of Health Prediction

Authors: Paul Landes, Jimeng Sun, Adam Cross
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04655
Pdf URL: https://arxiv.org/pdf/2505.04655
Copy Paste: [[2505.04655]] Integration of Large Language Models and Traditional Deep Learning for Social Determinants of Health Prediction(https://arxiv.org/abs/2505.04655)
Keywords: large language model
Abstract: Social Determinants of Health (SDoH) are economic, social and personal circumstances that affect or influence an individual's health status. SDoHs have shown to be correlated to wellness outcomes, and therefore, are useful to physicians in diagnosing diseases and in decision-making. In this work, we automatically extract SDoHs from clinical text using traditional deep learning and Large Language Models (LLMs) to find the advantages and disadvantages of each on an existing publicly available dataset. Our models outperform a previous reference point on a multilabel SDoH classification by 10 points, and we present a method and model to drastically speed up classification (12X execution time) by eliminating expensive LLM processing. The method we present combines a more nimble and efficient solution that leverages the power of the LLM for precision and traditional deep learning methods for efficiency. We also show highly performant results on a dataset supplemented with synthetic data and several traditional deep learning models that outperform LLMs. Our models and methods offer the next iteration of automatic prediction of SDoHs that impact at-risk patients.

Title: AI-Generated Fall Data: Assessing LLMs and Diffusion Model for Wearable Fall Detection

Authors: Sana Alamgeer, Yasine Souissi, Anne H. H. Ngu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.04660
Pdf URL: https://arxiv.org/pdf/2505.04660
Copy Paste: [[2505.04660]] AI-Generated Fall Data: Assessing LLMs and Diffusion Model for Wearable Fall Detection(https://arxiv.org/abs/2505.04660)
Keywords: diffusion, large language model
Abstract: Training fall detection systems is challenging due to the scarcity of real-world fall data, particularly from elderly individuals. To address this, we explore the potential of Large Language Models (LLMs) for generating synthetic fall data. This study evaluates text-to-motion (T2M, SATO, ParCo) and text-to-text models (GPT4o, GPT4, Gemini) in simulating realistic fall scenarios. We generate synthetic datasets and integrate them with four real-world baseline datasets to assess their impact on fall detection performance using a Long Short-Term Memory (LSTM) model. Additionally, we compare LLM-generated synthetic data with a diffusion-based method to evaluate their alignment with real accelerometer distributions. Results indicate that dataset characteristics significantly influence the effectiveness of synthetic data, with LLM-generated data performing best in low-frequency settings (e.g., 20Hz) while showing instability in high-frequency datasets (e.g., 200Hz). While text-to-motion models produce more realistic biomechanical data than text-to-text models, their impact on fall detection varies. Diffusion-based synthetic data demonstrates the closest alignment to real data but does not consistently enhance model performance. An ablation study further confirms that the effectiveness of synthetic data depends on sensor placement and fall representation. These findings provide insights into optimizing synthetic data generation for fall detection models.

Title: Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising

Authors: Haoyang Feng, Yanjun Dai, Yuan Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04665
Pdf URL: https://arxiv.org/pdf/2505.04665
Copy Paste: [[2505.04665]] Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising(https://arxiv.org/abs/2505.04665)
Keywords: security, privacy, protect, transformer, large language model
Abstract: Although large language models have demonstrated the potential for personalized advertising recommendations in experimental environments, in actual operations, how advertising recommendation systems can be combined with measures such as user privacy protection and data security is still an area worthy of in-depth discussion. To this end, this paper studies the personalized risks and regulatory strategies of large language models in digital advertising. This study first outlines the principles of Large Language Model (LLM), especially the self-attention mechanism based on the Transformer architecture, and how to enable the model to understand and generate natural language text. Then, the BERT (Bidirectional Encoder Representations from Transformers) model and the attention mechanism are combined to construct an algorithmic model for personalized advertising recommendations and user factor risk protection. The specific steps include: data collection and preprocessing, feature selection and construction, using large language models such as BERT for advertising semantic embedding, and ad recommendations based on user portraits. Then, local model training and data encryption are used to ensure the security of user privacy and avoid the leakage of personal data. This paper designs an experiment for personalized advertising recommendation based on a large language model of BERT and verifies it with real user data. The experimental results show that BERT-based advertising push can effectively improve the click-through rate and conversion rate of advertisements. At the same time, through local model training and privacy protection mechanisms, the risk of user privacy leakage can be reduced to a certain extent.

Title: Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes

Authors: Mohammad Aqib, Mohd Hamza, Qipei Mei, Ying Hei Chui
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04666
Pdf URL: https://arxiv.org/pdf/2505.04666
Copy Paste: [[2505.04666]] Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes(https://arxiv.org/abs/2505.04666)
Keywords: protect, robust, large language model
Abstract: Building codes are regulations that establish standards for the design, construction, and safety of buildings to ensure structural integrity, fire protection, and accessibility. They are often extensive, complex, and subject to frequent updates, making manual querying challenging and time-consuming. Key difficulties include navigating large volumes of text, interpreting technical language, and identifying relevant clauses across different sections. A potential solution is to build a Question-Answering (QA) system that answers user queries based on building codes. Among the various methods for building a QA system, Retrieval-Augmented Generation (RAG) stands out in performance. RAG consists of two components: a retriever and a language model. This study focuses on identifying a suitable retriever method for building codes and optimizing the generational capability of the language model using fine-tuning techniques. We conducted a detailed evaluation of various retrieval methods by performing the retrieval on the National Building Code of Canada (NBCC) and explored the impact of domain-specific fine-tuning on several language models using the dataset derived from NBCC. Our analysis included a comparative assessment of different retrievers and the performance of both pre-trained and fine-tuned models to determine the efficacy and domain-specific adaptation of language models using fine-tuning on the NBCC dataset. Experimental results showed that Elasticsearch proved to be the most robust retriever among all. The findings also indicate that fine-tuning language models on an NBCC-specific dataset can enhance their ability to generate contextually relevant responses. When combined with context retrieved by a powerful retriever like Elasticsearch, this improvement in LLM performance can optimize the RAG system, enabling it to better navigate the complexities of the NBCC.

Title: Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards

Authors: Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, Guoliang Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04671
Pdf URL: https://arxiv.org/pdf/2505.04671
Copy Paste: [[2505.04671]] Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards(https://arxiv.org/abs/2505.04671)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have significantly improved performance on the Text-to-SQL task by leveraging their powerful reasoning capabilities. To enhance accuracy during the reasoning process, external Process Reward Models (PRMs) can be introduced during training and inference to provide fine-grained supervision. However, if misused, PRMs may distort the reasoning trajectory and lead to suboptimal or incorrect SQL this http URL address this challenge, we propose Reward-SQL, a framework that systematically explores how to incorporate PRMs into the Text-to-SQL reasoning process effectively. Our approach follows a "cold start, then PRM supervision" paradigm. Specifically, we first train the model to decompose SQL queries into structured stepwise reasoning chains using common table expressions (Chain-of-CTEs), establishing a strong and interpretable reasoning baseline. Then, we investigate four strategies for integrating PRMs, and find that combining PRM as an online training signal (GRPO) with PRM-guided inference (e.g., best-of-N sampling) yields the best results. Empirically, on the BIRD benchmark, Reward-SQL enables models supervised by a 7B PRM to achieve a 13.1% performance gain across various guidance strategies. Notably, our GRPO-aligned policy model based on Qwen2.5-Coder-7B-Instruct achieves 68.9% accuracy on the BIRD development set, outperforming all baseline methods under the same model size. These results demonstrate the effectiveness of Reward-SQL in leveraging reward-based supervision for Text-to-SQL reasoning. Our code is publicly available.

Title: Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from H&E Whole Slide Images of Cutaneous Squamous Cell Carcinoma

Authors: Lucas Sancéré, Carina Lorenz, Doris Helbig, Oana-Diana Persa, Sonja Dengler, Alexander Kreuter, Martim Laimer, Anne Fröhlich, Jennifer Landsberg, Johannes Brägelmann, Katarzyna Bozek
Subjects: cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.04672
Pdf URL: https://arxiv.org/pdf/2505.04672
Copy Paste: [[2505.04672]] Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from H&E Whole Slide Images of Cutaneous Squamous Cell Carcinoma(https://arxiv.org/abs/2505.04672)
Keywords: extraction, transformer, segmentation
Abstract: Recent advancements in digital pathology have enabled comprehensive analysis of Whole-Slide Images (WSI) from tissue samples, leveraging high-resolution microscopy and computational capabilities. Despite this progress, there is a lack of labeled datasets and open source pipelines specifically tailored for analysis of skin tissue. Here we propose Histo-Miner, a deep learning-based pipeline for analysis of skin WSIs and generate two datasets with labeled nuclei and tumor regions. We develop our pipeline for the analysis of patient samples of cutaneous squamous cell carcinoma (cSCC), a frequent non-melanoma skin cancer. Utilizing the two datasets, comprising 47,392 annotated cell nuclei and 144 tumor-segmented WSIs respectively, both from cSCC patients, Histo-Miner employs convolutional neural networks and vision transformers for nucleus segmentation and classification as well as tumor region segmentation. Performance of trained models positively compares to state of the art with multi-class Panoptic Quality (mPQ) of 0.569 for nucleus segmentation, macro-averaged F1 of 0.832 for nucleus classification and mean Intersection over Union (mIoU) of 0.884 for tumor region segmentation. From these predictions we generate a compact feature vector summarizing tissue morphology and cellular interactions, which can be used for various downstream tasks. Here, we use Histo-Miner to predict cSCC patient response to immunotherapy based on pre-treatment WSIs from 45 patients. Histo-Miner identifies percentages of lymphocytes, the granulocyte to lymphocyte ratio in tumor vicinity and the distances between granulocytes and plasma cells in tumors as predictive features for therapy response. This highlights the applicability of Histo-Miner to clinically relevant scenarios, providing direct interpretation of the classification and insights into the underlying biology.

Title: REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM

Authors: Madhur Jindal, Saurabh Deshpande
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04673
Pdf URL: https://arxiv.org/pdf/2505.04673
Copy Paste: [[2505.04673]] REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM(https://arxiv.org/abs/2505.04673)
Keywords: defense, attack, large language model
Abstract: Vision Large Language Models (VLLMs) represent a significant advancement in artificial intelligence by integrating image-processing capabilities with textual understanding, thereby enhancing user interactions and expanding application domains. However, their increased complexity introduces novel safety and ethical challenges, particularly in multi-modal and multi-turn conversations. Traditional safety evaluation frameworks, designed for text-based, single-turn interactions, are inadequate for addressing these complexities. To bridge this gap, we introduce the REVEAL (Responsible Evaluation of Vision-Enabled AI LLMs) Framework, a scalable and automated pipeline for evaluating image-input harms in VLLMs. REVEAL includes automated image mining, synthetic adversarial data generation, multi-turn conversational expansion using crescendo attack strategies, and comprehensive harm assessment through evaluators like GPT-4o. We extensively evaluated five state-of-the-art VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, and Pixtral, across three important harm categories: sexual harm, violence, and misinformation. Our findings reveal that multi-turn interactions result in significantly higher defect rates compared to single-turn evaluations, highlighting deeper vulnerabilities in VLLMs. Notably, GPT-4o demonstrated the most balanced performance as measured by our Safety-Usability Index (SUI) followed closely by Pixtral. Additionally, misinformation emerged as a critical area requiring enhanced contextual defenses. Llama-3.2 exhibited the highest MT defect rate ($16.55 \%$) while Qwen2-VL showed the highest MT refusal rate ($19.1 \%$).

Title: Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Authors: Divyansh Srivastava, Xiang Zhang, He Wen, Chenru Wen, Zhuowen Tu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04718
Pdf URL: https://arxiv.org/pdf/2505.04718
Copy Paste: [[2505.04718]] Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers(https://arxiv.org/abs/2505.04718)
Keywords: diffusion, transformer, large language model
Abstract: We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

Title: False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims

Authors: Evangelia Christodoulou, Annika Reinke, Pascaline Andrè, Patrick Godau, Piotr Kalinowski, Rola Houhou, Selen Erkan, Carole H. Sudre, Ninon Burgos, Sofiène Boutaj, Sophie Loizillon, Maëlys Solal, Veronika Cheplygina, Charles Heitz, Michal Kozubek, Michela Antonelli, Nicola Rieke, Antoine Gilson, Leon D. Mayer, Minu D. Tizabi, M. Jorge Cardoso, Amber Simpson, Annette Kopp-Schneider, Gaël Varoquaux, Olivier Colliot, Lena Maier-Hein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04720
Pdf URL: https://arxiv.org/pdf/2505.04720
Copy Paste: [[2505.04720]] False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims(https://arxiv.org/abs/2505.04720)
Keywords: segmentation
Abstract: Performance comparisons are fundamental in medical imaging Artificial Intelligence (AI) research, often driving claims of superiority based on relative improvements in common performance metrics. However, such claims frequently rely solely on empirical mean performance. In this paper, we investigate whether newly proposed methods genuinely outperform the state of the art by analyzing a representative cohort of medical imaging papers. We quantify the probability of false claims based on a Bayesian approach that leverages reported results alongside empirically estimated model congruence to estimate whether the relative ranking of methods is likely to have occurred by chance. According to our results, the majority (>80%) of papers claims outperformance when introducing a new method. Our analysis further revealed a high probability (>5%) of false outperformance claims in 86% of classification papers and 53% of segmentation papers. These findings highlight a critical flaw in current benchmarking practices: claims of outperformance in medical imaging AI are frequently unsubstantiated, posing a risk of misdirecting future research efforts.

Title: SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

Authors: Jingyang Deng, Ran Chen, Jo-Ku Cheng, Jinwen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04723
Pdf URL: https://arxiv.org/pdf/2505.04723
Copy Paste: [[2505.04723]] SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding(https://arxiv.org/abs/2505.04723)
Keywords: large language model
Abstract: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize domain-specific tasks; 3) distillation-enhanced speculative decoding accelerates inference via logit distillation between 72B target and 7B draft models, achieving 1.39-1.52$\times$ speedup without quality loss. Experimental results demonstrate that our domain-specific pre-training phase maintains 99.8% of original general language capabilities while significantly improving domain performance, resulting in a 1.08$\times$ improvement in Rouge-1 score and a 1.17$\times$ enhancement in BLEU-4 score. Ablation studies further show that domain-progressive SFT outperforms single-stage training, achieving 1.02$\times$ improvement in Rouge-1 and 1.06$\times$ in BLEU-4. Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.

Title: Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting

Authors: Shai Feldman, Stephen Bates, Yaniv Romano
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.04733
Pdf URL: https://arxiv.org/pdf/2505.04733
Copy Paste: [[2505.04733]] Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting(https://arxiv.org/abs/2505.04733)
Keywords: robust
Abstract: We introduce a framework for robust uncertainty quantification in situations where labeled training data are corrupted, through noisy or missing labels. We build on conformal prediction, a statistical tool for generating prediction sets that cover the test label with a pre-specified probability. The validity of conformal prediction, however, holds under the i.i.d assumption, which does not hold in our setting due to the corruptions in the data. To account for this distribution shift, the privileged conformal prediction (PCP) method proposed leveraging privileged information (PI) -- additional features available only during training -- to re-weight the data distribution, yielding valid prediction sets under the assumption that the weights are accurate. In this work, we analyze the robustness of PCP to inaccuracies in the weights. Our analysis indicates that PCP can still yield valid uncertainty estimates even when the weights are poorly estimated. Furthermore, we introduce uncertain imputation (UI), a new conformal method that does not rely on weight estimation. Instead, we impute corrupted labels in a way that preserves their uncertainty. Our approach is supported by theoretical guarantees and validated empirically on both synthetic and real benchmarks. Finally, we show that these techniques can be integrated into a triply robust framework, ensuring statistically valid predictions as long as at least one underlying method is valid.

Title: SetONet: A Deep Set-based Operator Network for Solving PDEs with permutation invariant variable input sampling

Authors: Stepan Tretiakov, Xingjian Li, Krishna Kumar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.04738
Pdf URL: https://arxiv.org/pdf/2505.04738
Copy Paste: [[2505.04738]] SetONet: A Deep Set-based Operator Network for Solving PDEs with permutation invariant variable input sampling(https://arxiv.org/abs/2505.04738)
Keywords: robust
Abstract: Neural operators, particularly the Deep Operator Network (DeepONet), have shown promise in learning mappings between function spaces for solving differential equations. However, standard DeepONet requires input functions to be sampled at fixed locations, limiting its applicability in scenarios with variable sensor configurations, missing data, or irregular grids. We introduce the Set Operator Network (SetONet), a novel architecture that integrates Deep Sets principles into the DeepONet framework to address this limitation. The core innovation lies in the SetONet branch network, which processes the input function as an unordered \emph{set} of location-value pairs. This design ensures permutation invariance with respect to the input points, making SetONet inherently robust to variations in the number and locations of sensors. SetONet learns richer, spatially-aware input representations by explicitly processing spatial coordinates and function values. We demonstrate SetONet's effectiveness on several benchmark problems, including derivative/anti-derivative operators, 1D Darcy flow, and 2D elasticity. Results show that SetONet successfully learns operators under variable input sampling conditions where standard DeepONet fails. Furthermore, SetONet is architecturally robust to sensor drop-off; unlike standard DeepONet, which requires methods like interpolation to function with missing data. Notably, SetONet can achieve comparable or improved accuracy over DeepONet on fixed grids, particularly for nonlinear problems, likely due to its enhanced input representation. SetONet provides a flexible and robust extension to the neural operator toolkit, significantly broadening the applicability of operator learning to problems with variable or incomplete input data.

Title: Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer

Authors: Sainath Dey, Mitul Goswami, Jashika Sethi, Prasant Kumar Pattnaik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04740
Pdf URL: https://arxiv.org/pdf/2505.04740
Copy Paste: [[2505.04740]] Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer(https://arxiv.org/abs/2505.04740)
Keywords: extraction, transformer, segmentation
Abstract: This study addresses the inherent limitations of Multi-Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov-Arnold Network (KAN)-ViT (Hyb-KAN ViT), a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient-KAN (Eff-KAN), which replaces MLP layers with spline functions and Wavelet-KAN (Wav-KAN), leveraging orthogonal wavelet transforms for multi-resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial-frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet-1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state-of-the-art performance with Hyb-KAN ViT. Ablation studies validate the efficacy of wavelet-driven spectral priors in segmentation and spline-based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi-scale representation in vision architectures.

Title: When Bad Data Leads to Good Models

Authors: Kenneth Li, Yida Chen, Fernanda Viégas, Martin Wattenberg
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.04741
Pdf URL: https://arxiv.org/pdf/2505.04741
Copy Paste: [[2505.04741]] When Bad Data Leads to Good Models(https://arxiv.org/abs/2505.04741)
Keywords: large language model
Abstract: In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.

Title: A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models

Authors: Pedro Pinacho-Davidson, Fernando Gutierrez, Pablo Zapata, Rodolfo Vergara, Pablo Aqueveque
Subjects: cs.CR, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.04784
Pdf URL: https://arxiv.org/pdf/2505.04784
Copy Paste: [[2505.04784]] A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models(https://arxiv.org/abs/2505.04784)
Keywords: secure, security, attack, generative, large language model
Abstract: The emergence of Generative AI (Gen AI) and Large Language Models (LLMs) has enabled more advanced chatbots capable of human-like interactions. However, these conversational agents introduce a broader set of operational risks that extend beyond traditional cybersecurity considerations. In this work, we propose a novel, instrumented risk-assessment metric that simultaneously evaluates potential threats to three key stakeholders: the service-providing organization, end users, and third parties. Our approach incorporates the technical complexity required to induce erroneous behaviors in the chatbot--ranging from non-induced failures to advanced prompt-injection attacks--as well as contextual factors such as the target industry, user age range, and vulnerability severity. To validate our metric, we leverage Garak, an open-source framework for LLM vulnerability testing. We further enhance Garak to capture a variety of threat vectors (e.g., misinformation, code hallucinations, social engineering, and malicious code generation). Our methodology is demonstrated in a scenario involving chatbots that employ retrieval-augmented generation (RAG), showing how the aggregated risk scores guide both short-term mitigation and longer-term improvements in model design and deployment. The results underscore the importance of multi-dimensional risk assessments in operationalizing secure, reliable AI-driven conversational systems.

Title: Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay

Authors: Sriram Mandalika, Harsha Vardhan, Athira Nambiar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04787
Pdf URL: https://arxiv.org/pdf/2505.04787
Copy Paste: [[2505.04787]] Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay(https://arxiv.org/abs/2505.04787)
Keywords: generative
Abstract: Continual Learning entails progressively acquiring knowledge from new data while retaining previously acquired knowledge, thereby mitigating ``Catastrophic Forgetting'' in neural networks. Our work presents a novel uncertainty-driven Unsupervised Continual Learning framework using Generative Replay, namely ``Replay to Remember (R2R)''. The proposed R2R architecture efficiently uses unlabelled and synthetic labelled data in a balanced proportion using a cluster-level uncertainty-driven feedback mechanism and a VLM-powered generative replay module. Unlike traditional memory-buffer methods that depend on pretrained models and pseudo-labels, our R2R framework operates without any prior training. It leverages visual features from unlabeled data and adapts continuously using clustering-based uncertainty estimation coupled with dynamic thresholding. Concurrently, a generative replay mechanism along with DeepSeek-R1 powered CLIP VLM produces labelled synthetic data representative of past experiences, resembling biological visual thinking that replays memory to remember and act in new, unseen tasks. Extensive experimental analyses are carried out in CIFAR-10, CIFAR-100, CINIC-10, SVHN and TinyImageNet datasets. Our proposed R2R approach improves knowledge retention, achieving a state-of-the-art performance of 98.13%, 73.06%, 93.41%, 95.18%, 59.74%, respectively, surpassing state-of-the-art performance by over 4.36%.

Title: Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

Authors: Bangyan Liao, Zhenjun Zhao, Haoang Li, Yi Zhou, Yingping Zeng, Hao Li, Peidong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04788
Pdf URL: https://arxiv.org/pdf/2505.04788
Copy Paste: [[2505.04788]] Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World(https://arxiv.org/abs/2505.04788)
Keywords: robust
Abstract: Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a ``soft'' association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs' locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called \textbf{GlobustVP}), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that \textbf{GlobustVP} achieves a favorable balance between efficiency, robustness, and global optimality compared to previous works. The code is publicly available at this https URL.

Title: DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition

Authors: Kailash A. Hambarde, Nzakiese Mbongo, Pavan Kumar MP, Satish Mekewad, Carolina Fernandes, Gökhan Silahtaroğlu, Alice Nithya, Pawan Wasnik, MD. Rashidunnabi, Pranita Samale, Hugo Proença
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04793
Pdf URL: https://arxiv.org/pdf/2505.04793
Copy Paste: [[2505.04793]] DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition(https://arxiv.org/abs/2505.04793)
Keywords: biometric
Abstract: Person reidentification (ReID) technology has been considered to perform relatively well under controlled, ground-level conditions, but it breaks down when deployed in challenging real-world settings. Evidently, this is due to extreme data variability factors such as resolution, viewpoint changes, scale variations, occlusions, and appearance shifts from clothing or session drifts. Moreover, the publicly available data sets do not realistically incorporate such kinds and magnitudes of variability, which limits the progress of this technology. This paper introduces DetReIDX, a large-scale aerial-ground person dataset, that was explicitly designed as a stress test to ReID under real-world conditions. DetReIDX is a multi-session set that includes over 13 million bounding boxes from 509 identities, collected in seven university campuses from three continents, with drone altitudes between 5.8 and 120 meters. More important, as a key novelty, DetReIDX subjects were recorded in (at least) two sessions on different days, with changes in clothing, daylight and location, making it suitable to actually evaluate long-term person ReID. Plus, data were annotated from 16 soft biometric attributes and multitask labels for detection, tracking, ReID, and action recognition. In order to provide empirical evidence of DetReIDX usefulness, we considered the specific tasks of human detection and ReID, where SOTA methods catastrophically degrade performance (up to 80% in detection accuracy and over 70% in Rank-1 ReID) when exposed to DetReIDXs conditions. The dataset, annotations, and official evaluation protocols are publicly available at this https URL

Title: Robust ML Auditing using Prior Knowledge

Authors: Jade Garcia Bourrée, Augustin Godinot, Martijn De Vos, Milos Vujasinovic, Sayan Biswas, Gilles Tredan, Erwan Le Merrer, Anne-Marie Kermarrec
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.04796
Pdf URL: https://arxiv.org/pdf/2505.04796
Copy Paste: [[2505.04796]] Robust ML Auditing using Prior Knowledge(https://arxiv.org/abs/2505.04796)
Keywords: robust, fair
Abstract: The rapid adoption of ML decision-making systems across products and services has led to a set of regulations on how such systems should behave and be built. Among all the technical challenges to enforcing these regulations, one crucial, yet under-explored problem is the risk of manipulation while these systems are being audited for fairness. This manipulation occurs when a platform deliberately alters its answers to a regulator to pass an audit without modifying its answers to other users. In this paper, we introduce a novel approach to manipulation-proof auditing by taking into account the auditor's prior knowledge of the task solved by the platform. We first demonstrate that regulators must not rely on public priors (e.g. a public dataset), as platforms could easily fool the auditor in such cases. We then formally establish the conditions under which an auditor can prevent audit manipulations using prior knowledge about the ground truth. Finally, our experiments with two standard datasets exemplify the maximum level of unfairness a platform can hide before being detected as malicious. Our formalization and generalization of manipulation-proof auditing with a prior opens up new research directions for more robust fairness audits.

Title: Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems

Authors: Jian Cui, Zichuan Li, Luyi Xing, Xiaojing Liao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.04799
Pdf URL: https://arxiv.org/pdf/2505.04799
Copy Paste: [[2505.04799]] Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems(https://arxiv.org/abs/2505.04799)
Keywords: secure, privacy, attack, large language model
Abstract: Multi-agent collaboration systems (MACS), powered by large language models (LLMs), solve complex problems efficiently by leveraging each agent's specialization and communication between agents. However, the inherent exchange of information between agents and their interaction with external environments, such as LLM, tools, and users, inevitably introduces significant risks of sensitive data leakage, including vulnerabilities to attacks like prompt injection and reconnaissance. Existing MACS fail to enable privacy controls, making it challenging to manage sensitive information securely. In this paper, we take the first step to address the MACS's data leakage threat at the system development level through a privacy-enhanced development paradigm, Maris. Maris enables rigorous message flow control within MACS by embedding reference monitors into key multi-agent conversation components. We implemented Maris as an integral part of AutoGen, a widely adopted open-source multi-agent development framework. Then, we evaluate Maris for its effectiveness and performance overhead on privacy-critical MACS use cases, including healthcare, supply chain optimization, and personalized recommendation system. The result shows that Maris achieves satisfactory effectiveness, performance overhead and practicability for adoption.

Title: ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling

Authors: Xiao Wang, Jong-Youl Choi, Takuya Kurihaya, Isaac Lyngaas, Hong-Jun Yoon, Ming Fan, Nasik Muhammad Nafi, Aristeidis Tsaris, Ashwin M. Aji, Maliha Hossain, Mohamed Wahib, Dali Wang, Peter Thornton, Prasanna Balaprakash, Moetasim Ashfaq, Dan Lu
Subjects: cs.LG, astro-ph.EP, cs.AI, cs.DC, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2505.04802
Pdf URL: https://arxiv.org/pdf/2505.04802
Copy Paste: [[2505.04802]] ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling(https://arxiv.org/abs/2505.04802)
Keywords: robust, transformer
Abstract: Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 32,768 GPUs, achieving up to 1.8 ExaFLOPS sustained throughput and 92-98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. On 7 km resolution benchmarks, ORBIT-2 achieves high accuracy with R^2 scores in the range of 0.98 to 0.99 against observation data.

Title: Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs

Authors: Chetan Pathade
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2505.04806
Pdf URL: https://arxiv.org/pdf/2505.04806
Copy Paste: [[2505.04806]] Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs(https://arxiv.org/abs/2505.04806)
Keywords: security, attack, robust, large language model
Abstract: Large Language Models (LLMs) are increasingly integrated into consumer and enterprise applications. Despite their capabilities, they remain susceptible to adversarial attacks such as prompt injection and jailbreaks that override alignment safeguards. This paper provides a systematic investigation of jailbreak strategies against various state-of-the-art LLMs. We categorize over 1,400 adversarial prompts, analyze their success against GPT-4, Claude 2, Mistral 7B, and Vicuna, and examine their generalizability and construction logic. We further propose layered mitigation strategies and recommend a hybrid red-teaming and sandboxing approach for robust LLM security.

Title: Guide your favorite protein sequence generative model

Authors: Junhao Xiong, Hunter Nisonoff, Ishan Gaur, Jennifer Listgarten
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.04823
Pdf URL: https://arxiv.org/pdf/2505.04823
Copy Paste: [[2505.04823]] Guide your favorite protein sequence generative model(https://arxiv.org/abs/2505.04823)
Keywords: diffusion, generative
Abstract: Generative machine learning models have begun to transform protein engineering, yet no principled framework for conditioning on auxiliary information in a plug-and-play manner exists; one may want to iteratively incorporate experimental feedback, or make use of an existing classifier -- such as for predicting enzyme commission number -- in order to guide the sampling of the generative model to generate sequences with desired properties. Herein, we present ProteinGuide, a rigorous and general framework to achieve just that: through unifying a broad class of protein generative models that includes masked language, (order-agnostic) autoregressive, diffusion and flow-matching models, we provide an approach to statistically condition pre-trained protein generative models. We demonstrate applicability of our approach by guiding each of two commonly used protein generative models, ProteinMPNN and ESM3, to generate amino acid and structure token sequences conditioned on several user-specified properties, namely, enhanced stability and CATH-labeled fold generation.

Title: Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?

Authors: Shashank Agnihotri, David Schader, Nico Sharei, Mehmet Ege Kaçar, Margret Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04835
Pdf URL: https://arxiv.org/pdf/2505.04835
Copy Paste: [[2505.04835]] Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?(https://arxiv.org/abs/2505.04835)
Keywords: robust, segmentation
Abstract: Deep learning (DL) models are widely used in real-world applications but remain vulnerable to distribution shifts, especially due to weather and lighting changes. Collecting diverse real-world data for testing the robustness of DL models is resource-intensive, making synthetic corruptions an attractive alternative for robustness testing. However, are synthetic corruptions a reliable proxy for real-world corruptions? To answer this, we conduct the largest benchmarking study on semantic segmentation models, comparing performance on real-world corruptions and synthetic corruptions datasets. Our results reveal a strong correlation in mean performance, supporting the use of synthetic corruptions for robustness evaluation. We further analyze corruption-specific correlations, providing key insights to understand when synthetic corruptions succeed in representing real-world corruptions. Open-source Code: this https URL

Title: Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Authors: Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04842
Pdf URL: https://arxiv.org/pdf/2505.04842
Copy Paste: [[2505.04842]] Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers(https://arxiv.org/abs/2505.04842)
Keywords: generative
Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

Title: Osiris: A Lightweight Open-Source Hallucination Detection System

Authors: Alex Shan, John Bauer, Christopher D. Manning
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04844
Pdf URL: https://arxiv.org/pdf/2505.04844
Copy Paste: [[2505.04844]] Osiris: A Lightweight Open-Source Hallucination Detection System(https://arxiv.org/abs/2505.04844)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) systems have gained widespread adoption by application builders because they leverage sources of truth to enable Large Language Models (LLMs) to generate more factually sound responses. However, hallucinations, instances of LLM responses that are unfaithful to the provided context, often prevent these systems from being deployed in production environments. Current hallucination detection methods typically involve human evaluation or the use of closed-source models to review RAG system outputs for hallucinations. Both human evaluators and closed-source models suffer from scaling issues due to their high costs and slow inference speeds. In this work, we introduce a perturbed multi-hop QA dataset with induced hallucinations. Via supervised fine-tuning on our dataset, we achieve better recall with a 7B model than GPT-4o on the RAGTruth hallucination detection benchmark and offer competitive performance on precision and accuracy, all while using a fraction of the parameters. Code is released at our repository.

Title: Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model

Authors: Navin Ranjan, Andreas Savakis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04861
Pdf URL: https://arxiv.org/pdf/2505.04861
Copy Paste: [[2505.04861]] Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model(https://arxiv.org/abs/2505.04861)
Keywords: segmentation
Abstract: The Segment Anything Model (SAM) is a popular vision foundation model; however, its high computational and memory demands make deployment on resource-constrained devices challenging. While Post-Training Quantization (PTQ) is a practical approach for reducing computational overhead, existing PTQ methods rely on fixed bit-width quantization, leading to suboptimal accuracy and efficiency. To address this limitation, we propose Mix-QSAM, a mixed-precision PTQ framework for SAM. First, we introduce a layer-wise importance score, derived using Kullback-Leibler (KL) divergence, to quantify each layer's contribution to the model's output. Second, we introduce cross-layer synergy, a novel metric based on causal mutual information, to capture dependencies between adjacent layers. This ensures that highly interdependent layers maintain similar bit-widths, preventing abrupt precision mismatches that degrade feature propagation and numerical stability. Using these metrics, we formulate an Integer Quadratic Programming (IQP) problem to determine optimal bit-width allocation under model size and bit-operation constraints, assigning higher precision to critical layers while minimizing bit-width in less influential layers. Experimental results demonstrate that Mix-QSAM consistently outperforms existing PTQ methods on instance segmentation and object detection tasks, achieving up to 20% higher average precision under 6-bit and 4-bit mixed-precision settings, while maintaining computational efficiency.

Title: Auto-regressive transformation for image alignment

Authors: Kanggeon Lee, Soochahn Lee, Kyoung Mu Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04864
Pdf URL: https://arxiv.org/pdf/2505.04864
Copy Paste: [[2505.04864]] Auto-regressive transformation for image alignment(https://arxiv.org/abs/2505.04864)
Keywords: robust
Abstract: Existing methods for image alignment struggle in cases involving feature-sparse regions, extreme scale and field-of-view differences, and large deformations, often resulting in suboptimal accuracy. Robustness to these challenges improves through iterative refinement of the transformation field while focusing on critical regions in multi-scale image representations. We thus propose Auto-Regressive Transformation (ART), a novel method that iteratively estimates the coarse-to-fine transformations within an auto-regressive framework. Leveraging hierarchical multi-scale features, our network refines the transformations using randomly sampled points at each scale. By incorporating guidance from the cross-attention layer, the model focuses on critical regions, ensuring accurate alignment even in challenging, feature-limited conditions. Extensive experiments across diverse datasets demonstrate that ART significantly outperforms state-of-the-art methods, establishing it as a powerful new method for precise image alignment with broad applicability.

Title: Federated Learning for Cyber Physical Systems: A Comprehensive Survey

Authors: Minh K. Quan, Pubudu N. Pathirana, Mayuri Wijayasundara, Sujeeva Setunge, Dinh C. Nguyen, Christopher G. Brinton, David J. Love, H. Vincent Poor
Subjects: cs.LG, cs.AI, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2505.04873
Pdf URL: https://arxiv.org/pdf/2505.04873
Copy Paste: [[2505.04873]] Federated Learning for Cyber Physical Systems: A Comprehensive Survey(https://arxiv.org/abs/2505.04873)
Keywords: security, privacy, federate
Abstract: The integration of machine learning (ML) in cyber physical systems (CPS) is a complex task due to the challenges that arise in terms of real-time decision making, safety, reliability, device heterogeneity, and data privacy. There are also open research questions that must be addressed in order to fully realize the potential of ML in CPS. Federated learning (FL), a distributed approach to ML, has become increasingly popular in recent years. It allows models to be trained using data from decentralized sources. This approach has been gaining popularity in the CPS field, as it integrates computer, communication, and physical processes. Therefore, the purpose of this work is to provide a comprehensive analysis of the most recent developments of FL-CPS, including the numerous application areas, system topologies, and algorithms developed in recent years. The paper starts by discussing recent advances in both FL and CPS, followed by their integration. Then, the paper compares the application of FL in CPS with its applications in the internet of things (IoT) in further depth to show their connections and distinctions. Furthermore, the article scrutinizes how FL is utilized in critical CPS applications, e.g., intelligent transportation systems, cybersecurity services, smart cities, and smart healthcare solutions. The study also includes critical insights and lessons learned from various FL-CPS implementations. The paper's concluding section delves into significant concerns and suggests avenues for further research in this fast-paced and dynamic era.

Title: Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection

Authors: Tharindu Fernando, Clinton Fookes, Sridha Sridharan, Simon Denman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04888
Pdf URL: https://arxiv.org/pdf/2505.04888
Copy Paste: [[2505.04888]] Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection(https://arxiv.org/abs/2505.04888)
Keywords: generative
Abstract: Remarkable advancements in generative AI technology have given rise to a spectrum of novel deepfake categories with unprecedented leaps in their realism, and deepfakes are increasingly becoming a nuisance to law enforcement authorities and the general public. In particular, we observe alarming levels of confusion, deception, and loss of faith regarding multimedia content within society caused by face deepfakes, and existing deepfake detectors are struggling to keep up with the pace of improvements in deepfake generation. This is primarily due to their reliance on specific forgery artifacts, which limits their ability to generalise and detect novel deepfake types. To combat the spread of malicious face deepfakes, this paper proposes a new strategy that leverages coarse-to-fine spatial information, semantic information, and their interactions while ensuring feature distinctiveness and reducing the redundancy of the modelled features. A novel feature orthogonality-based disentanglement strategy is introduced to ensure branch-level and cross-branch feature disentanglement, which allows us to integrate multiple feature vectors without adding complexity to the feature space or compromising generalisation. Comprehensive experiments on three public benchmarks: FaceForensics++, Celeb-DF, and the Deepfake Detection Challenge (DFDC) show that these design choices enable the proposed approach to outperform current state-of-the-art methods by 5% on the Celeb-DF dataset and 7% on the DFDC dataset in a cross-dataset evaluation setting.

Title: FedRE: Robust and Effective Federated Learning with Privacy Preference

Authors: Tianzhe Xiao, Yichen Li, Yu Zhou, Yining Qi, Yi Liu, Wei Wang, Haozhao Wang, Yi Wang, Ruixuan Li
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.04889
Pdf URL: https://arxiv.org/pdf/2505.04889
Copy Paste: [[2505.04889]] FedRE: Robust and Effective Federated Learning with Privacy Preference(https://arxiv.org/abs/2505.04889)
Keywords: privacy, protect, robust, federate
Abstract: Despite Federated Learning (FL) employing gradient aggregation at the server for distributed training to prevent the privacy leakage of raw data, private information can still be divulged through the analysis of uploaded gradients from clients. Substantial efforts have been made to integrate local differential privacy (LDP) into the system to achieve a strict privacy guarantee. However, existing methods fail to take practical issues into account by merely perturbing each sample with the same mechanism while each client may have their own privacy preferences on privacy-sensitive information (PSI), which is not uniformly distributed across the raw data. In such a case, excessive privacy protection from private-insensitive information can additionally introduce unnecessary noise, which may degrade the model performance. In this work, we study the PSI within data and develop FedRE, that can simultaneously achieve robustness and effectiveness benefits with LDP protection. More specifically, we first define PSI with regard to the privacy preferences of each client. Then, we optimize the LDP by allocating less privacy budget to gradients with higher PSI in a layer-wise manner, thus providing a stricter privacy guarantee for PSI. Furthermore, to mitigate the performance degradation caused by LDP, we design a parameter aggregation mechanism based on the distribution of the perturbed information. We conducted experiments with text tamper detection on T-SROIE and DocTamper datasets, and FedRE achieves competitive performance compared to state-of-the-art methods.

Title: Clustering with Communication: A Variational Framework for Single Cell Representation Learning

Authors: Cong Qi, Yeqing Chen, Jie Zhang, Wei Zhi
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.04891
Pdf URL: https://arxiv.org/pdf/2505.04891
Copy Paste: [[2505.04891]] Clustering with Communication: A Variational Framework for Single Cell Representation Learning(https://arxiv.org/abs/2505.04891)
Keywords: generative
Abstract: Single-cell RNA sequencing (scRNA-seq) has revealed complex cellular heterogeneity, but recent studies emphasize that understanding biological function also requires modeling cell-cell communication (CCC), the signaling interactions mediated by ligand-receptor pairs that coordinate cellular behavior. Tools like CellChat have demonstrated that CCC plays a critical role in processes such as cell differentiation, tissue regeneration, and immune response, and that transcriptomic data inherently encodes rich information about intercellular signaling. We propose CCCVAE, a novel variational autoencoder framework that incorporates CCC signals into single-cell representation learning. By leveraging a communication-aware kernel derived from ligand-receptor interactions and a sparse Gaussian process, CCCVAE encodes biologically informed priors into the latent space. Unlike conventional VAEs that treat each cell independently, CCCVAE encourages latent embeddings to reflect both transcriptional similarity and intercellular signaling context. Empirical results across four scRNA-seq datasets show that CCCVAE improves clustering performance, achieving higher evaluation scores than standard VAE baselines. This work demonstrates the value of embedding biological priors into deep generative models for unsupervised single-cell analysis.

Title: Memory Under Siege: A Comprehensive Survey of Side-Channel Attacks on Memory

Authors: MD Mahady Hassan, Shanto Roy, Reza Rahaeimehr
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.04896
Pdf URL: https://arxiv.org/pdf/2505.04896
Copy Paste: [[2505.04896]] Memory Under Siege: A Comprehensive Survey of Side-Channel Attacks on Memory(https://arxiv.org/abs/2505.04896)
Keywords: security, defense, attack
Abstract: Side-channel attacks on memory (SCAM) exploit unintended data leaks from memory subsystems to infer sensitive information, posing significant threats to system security. These attacks exploit vulnerabilities in memory access patterns, cache behaviors, and other microarchitectural features to bypass traditional security measures. The purpose of this research is to examine SCAM, classify various attack techniques, and evaluate existing defense mechanisms. It guides researchers and industry professionals in improving memory security and mitigating emerging threats. We begin by identifying the major vulnerabilities in the memory system that are frequently exploited in SCAM, such as cache timing, speculative execution, \textit{Rowhammer}, and other sophisticated approaches. Next, we outline a comprehensive taxonomy that systematically classifies these attacks based on their types, target systems, attack vectors, and adversarial capabilities required to execute them. In addition, we review the current landscape of mitigation strategies, emphasizing their strengths and limitations. This work aims to provide a comprehensive overview of memory-based side-channel attacks with the goal of providing significant insights for researchers and practitioners to better understand, detect, and mitigate SCAM risks.

Title: OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging

Authors: Sifan Song, Siyeop Yoon, Pengfei Jin, Sekeun Kim, Matthew Tivnan, Yujin Oh, Runqi Meng, Ling Chen, Zhiliang Lyu, Dufan Wu, Ning Guo, Xiang Li, Quanzheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04899
Pdf URL: https://arxiv.org/pdf/2505.04899
Copy Paste: [[2505.04899]] OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging(https://arxiv.org/abs/2505.04899)
Keywords: interpretability, segmentation
Abstract: Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.

Title: Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

Authors: Xi Yang, Songsong Duan, Nannan Wang, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04905
Pdf URL: https://arxiv.org/pdf/2505.04905
Copy Paste: [[2505.04905]] Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization(https://arxiv.org/abs/2505.04905)
Keywords: transformer, segmentation
Abstract: Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03\% and 66.85\% Top-1 Loc, respectively.

Title: VaCDA: Variational Contrastive Alignment-based Scalable Human Activity Recognition

Authors: Soham Khisa, Avijoy Chakma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.04907
Pdf URL: https://arxiv.org/pdf/2505.04907
Copy Paste: [[2505.04907]] VaCDA: Variational Contrastive Alignment-based Scalable Human Activity Recognition(https://arxiv.org/abs/2505.04907)
Keywords: robust
Abstract: Technological advancements have led to the rise of wearable devices with sensors that continuously monitor user activities, generating vast amounts of unlabeled data. This data is challenging to interpret, and manual annotation is labor-intensive and error-prone. Additionally, data distribution is often heterogeneous due to device placement, type, and user behavior variations. As a result, traditional transfer learning methods perform suboptimally, making it difficult to recognize daily activities. To address these challenges, we use a variational autoencoder (VAE) to learn a shared, low-dimensional latent space from available sensor data. This space generalizes data across diverse sensors, mitigating heterogeneity and aiding robust adaptation to the target domain. We integrate contrastive learning to enhance feature representation by aligning instances of the same class across domains while separating different classes. We propose Variational Contrastive Domain Adaptation (VaCDA), a multi-source domain adaptation framework combining VAEs and contrastive learning to improve feature representation and reduce heterogeneity between source and target domains. We evaluate VaCDA on multiple publicly available datasets across three heterogeneity scenarios: cross-person, cross-position, and cross-device. VaCDA outperforms the baselines in cross-position and cross-device scenarios.

Title: SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

Authors: Shun Taguchi, Hideki Deguchi, Takumi Hamazaki, Hiroyuki Sakai
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.04911
Pdf URL: https://arxiv.org/pdf/2505.04911
Copy Paste: [[2505.04911]] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models(https://arxiv.org/abs/2505.04911)
Keywords: large language model
Abstract: This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes intuitive visual and positional cues but also achieves state-of-the-art zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across several metrics. The proposed method effectively eliminates the need for specialized 3D inputs and fine-tuning, offering a simpler and more scalable alternative to conventional approaches.

Title: GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Authors: Tong Wang, Ting Liu, Xiaochao Qu, Chengjing Wu, Luoqi Liu, Xiaolin Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04915
Pdf URL: https://arxiv.org/pdf/2505.04915
Copy Paste: [[2505.04915]] GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing(https://arxiv.org/abs/2505.04915)
Keywords: extraction, diffusion
Abstract: Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fréchet inception distance by 53.28\%.

Title: An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education

Authors: Ramteja Sajja, Yusuf Sermet, Ibrahim Demir
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.04916
Pdf URL: https://arxiv.org/pdf/2505.04916
Copy Paste: [[2505.04916]] An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education(https://arxiv.org/abs/2505.04916)
Keywords: large language model
Abstract: Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.

Title: A Simple Detector with Frame Dynamics is a Strong Tracker

Authors: Chenxu Peng, Chenxu Wang, Minrui Zou, Danyang Li, Zhengpeng Yang, Yimian Dai, Ming-Ming Cheng, Xiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04917
Pdf URL: https://arxiv.org/pdf/2505.04917
Copy Paste: [[2505.04917]] A Simple Detector with Frame Dynamics is a Strong Tracker(https://arxiv.org/abs/2505.04917)
Keywords: robust
Abstract: Infrared object tracking plays a crucial role in Anti-Unmanned Aerial Vehicle (Anti-UAV) applications. Existing trackers often depend on cropped template regions and have limited motion modeling capabilities, which pose challenges when dealing with tiny targets. To address this, we propose a simple yet effective infrared tiny-object tracker that enhances tracking performance by integrating global detection and motion-aware learning with temporal priors. Our method is based on object detection and achieves significant improvements through two key innovations. First, we introduce frame dynamics, leveraging frame difference and optical flow to encode both prior target features and motion characteristics at the input level, enabling the model to better distinguish the target from background clutter. Second, we propose a trajectory constraint filtering strategy in the post-processing stage, utilizing spatio-temporal priors to suppress false positives and enhance tracking robustness. Extensive experiments show that our method consistently outperforms existing approaches across multiple metrics in challenging infrared UAV tracking scenarios. Notably, we achieve state-of-the-art performance in the 4th Anti-UAV Challenge, securing 1st place in Track 1 and 2nd place in Track 2.

Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Authors: Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.04921
Pdf URL: https://arxiv.org/pdf/2505.04921
Copy Paste: [[2505.04921]] Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models(https://arxiv.org/abs/2505.04921)
Keywords: robust
Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.

Title: Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training

Authors: Xingzeng Lan, Xing Duan, Chen Chen, Weiyu Lin, Bo Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04922
Pdf URL: https://arxiv.org/pdf/2505.04922
Copy Paste: [[2505.04922]] Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training(https://arxiv.org/abs/2505.04922)
Keywords: secure, privacy, biometric
Abstract: Palmprint recognition is a secure and privacy-friendly method of biometric identification. One of the major challenges to improve palmprint recognition accuracy is the scarcity of palmprint data. Recently, a popular line of research revolves around the synthesis of virtual palmprints for large-scale pre-training purposes. In this paper, we propose a novel synthesis method named Canny2Palm that extracts palm textures with Canny edge detector and uses them to condition a Pix2Pix network for realistic palmprint generation. By re-assembling palmprint textures from different identities, we are able to create new identities by seeding the generator with new assemblies. Canny2Palm not only synthesizes realistic data following the distribution of real palmprints but also enables controllable diversity to generate large-scale new identities. On open-set palmprint recognition benchmarks, models pre-trained with Canny2Palm synthetic data outperform the state-of-the-art with up to 7.2% higher identification accuracy. Moreover, the performance of models pre-trained with Canny2Palm continues to improve given 10,000 synthetic IDs while those with existing methods already saturate, demonstrating the potential of our method for large-scale pre-training.

Title: Fair Uncertainty Quantification for Depression Prediction

Authors: Yonghong Li, Xiuzhuang Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04931
Pdf URL: https://arxiv.org/pdf/2505.04931
Copy Paste: [[2505.04931]] Fair Uncertainty Quantification for Depression Prediction(https://arxiv.org/abs/2505.04931)
Keywords: fair
Abstract: Trustworthy depression prediction based on deep learning, incorporating both predictive reliability and algorithmic fairness across diverse demographic groups, is crucial for clinical application. Recently, achieving reliable depression predictions through uncertainty quantification has attracted increasing attention. However, few studies have focused on the fairness of uncertainty quantification (UQ) in depression prediction. In this work, we investigate the algorithmic fairness of UQ, namely Equal Opportunity Coverage (EOC) fairness, and propose Fair Uncertainty Quantification (FUQ) for depression prediction. FUQ pursues reliable and fair depression predictions through group-based analysis. Specifically, we first group all the participants by different sensitive attributes and leverage conformal prediction to quantify uncertainty within each demographic group, which provides a theoretically guaranteed and valid way to quantify uncertainty for depression prediction and facilitates the investigation of fairness across different demographic groups. Furthermore, we propose a fairness-aware optimization strategy that formulates fairness as a constrained optimization problem under EOC constraints. This enables the model to preserve predictive reliability while adapting to the heterogeneous uncertainty levels across demographic groups, thereby achieving optimal fairness. Through extensive evaluations on several visual and audio depression datasets, our approach demonstrates its effectiveness.

Title: FF-PNet: A Pyramid Network Based on Feature and Field for Brain Image Registration

Authors: Ying Zhang, Shuai Guo, Chenxi Sun, Yuchen Zhu, Jinhai Xiang
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2505.04938
Pdf URL: https://arxiv.org/pdf/2505.04938
Copy Paste: [[2505.04938]] FF-PNet: A Pyramid Network Based on Feature and Field for Brain Image Registration(https://arxiv.org/abs/2505.04938)
Keywords: extraction
Abstract: In recent years, deformable medical image registration techniques have made significant progress. However, existing models still lack efficiency in parallel extraction of coarse and fine-grained features. To address this, we construct a new pyramid registration network based on feature and deformation field (FF-PNet). For coarse-grained feature extraction, we design a Residual Feature Fusion Module (RFFM), for fine-grained image deformation, we propose a Residual Deformation Field Fusion Module (RDFFM). Through the parallel operation of these two modules, the model can effectively handle complex image deformations. It is worth emphasizing that the encoding stage of FF-PNet only employs traditional convolutional neural networks without any attention mechanisms or multilayer perceptrons, yet it still achieves remarkable improvements in registration accuracy, fully demonstrating the superior feature decoding capabilities of RFFM and RDFFM. We conducted extensive experiments on the LPBA and OASIS datasets. The results show our network consistently outperforms popular methods in metrics like the Dice Similarity Coefficient.

Title: Building-Guided Pseudo-Label Learning for Cross-Modal Building Damage Mapping

Authors: Jiepan Li, He Huang, Yu Sheng, Yujun Guo, Wei He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04941
Pdf URL: https://arxiv.org/pdf/2505.04941
Copy Paste: [[2505.04941]] Building-Guided Pseudo-Label Learning for Cross-Modal Building Damage Mapping(https://arxiv.org/abs/2505.04941)
Keywords: secure, extraction, segmentation
Abstract: Accurate building damage assessment using bi-temporal multi-modal remote sensing images is essential for effective disaster response and recovery planning. This study proposes a novel Building-Guided Pseudo-Label Learning Framework to address the challenges of mapping building damage from pre-disaster optical and post-disaster SAR images. First, we train a series of building extraction models using pre-disaster optical images and building labels. To enhance building segmentation, we employ multi-model fusion and test-time augmentation strategies to generate pseudo-probabilities, followed by a low-uncertainty pseudo-label training method for further refinement. Next, a change detection model is trained on bi-temporal cross-modal images and damaged building labels. To improve damage classification accuracy, we introduce a building-guided low-uncertainty pseudo-label refinement strategy, which leverages building priors from the previous step to guide pseudo-label generation for damaged buildings, reducing uncertainty and enhancing reliability. Experimental results on the 2025 IEEE GRSS Data Fusion Contest dataset demonstrate the effectiveness of our approach, which achieved the highest mIoU score (54.28%) and secured first place in the competition.

Title: Chain-of-Thought Tokens are Computer Program Variables

Authors: Fangwei Zhu, Peiyi Wang, Zhifang Sui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04955
Pdf URL: https://arxiv.org/pdf/2505.04955
Copy Paste: [[2505.04955]] Chain-of-Thought Tokens are Computer Program Variables(https://arxiv.org/abs/2505.04955)
Keywords: large language model
Abstract: Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at this https URL.

Title: Graffe: Graph Representation Learning via Diffusion Probabilistic Models

Authors: Dingshuo Chen, Shuchen Xue, Liuji Chen, Yingheng Wang, Qiang Liu, Shu Wu, Zhi-Ming Ma, Liang Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04956
Pdf URL: https://arxiv.org/pdf/2505.04956
Copy Paste: [[2505.04956]] Graffe: Graph Representation Learning via Diffusion Probabilistic Models(https://arxiv.org/abs/2505.04956)
Keywords: diffusion, generative
Abstract: Diffusion probabilistic models (DPMs), widely recognized for their potential to generate high-quality samples, tend to go unnoticed in representation learning. While recent progress has highlighted their potential for capturing visual semantics, adapting DPMs to graph representation learning remains in its infancy. In this paper, we introduce Graffe, a self-supervised diffusion model proposed for graph representation learning. It features a graph encoder that distills a source graph into a compact representation, which, in turn, serves as the condition to guide the denoising process of the diffusion decoder. To evaluate the effectiveness of our model, we first explore the theoretical foundations of applying diffusion models to representation learning, proving that the denoising objective implicitly maximizes the conditional mutual information between data and its representation. Specifically, we prove that the negative logarithm of the denoising score matching loss is a tractable lower bound for the conditional mutual information. Empirically, we conduct a series of case studies to validate our theoretical insights. In addition, Graffe delivers competitive results under the linear probing setting on node and graph classification tasks, achieving state-of-the-art performance on 9 of the 11 real-world datasets. These findings indicate that powerful generative models, especially diffusion models, serve as an effective tool for graph representation learning.

Title: ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis

Authors: Onkar Susladkar, Gayatri Deshmukh, Yalcin Tur, Ulas Bagci
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04963
Pdf URL: https://arxiv.org/pdf/2505.04963
Copy Paste: [[2505.04963]] ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis(https://arxiv.org/abs/2505.04963)
Keywords: diffusion, segmentation
Abstract: Synthesizing medical images remains challenging due to limited annotated pathological data, modality domain gaps, and the complexity of representing diffuse pathologies such as liver cirrhosis. Existing methods often struggle to maintain anatomical fidelity while accurately modeling pathological features, frequently relying on priors derived from natural images or inefficient multi-step sampling. In this work, we introduce ViCTr (Vital Consistency Transfer), a novel two-stage framework that combines a rectified flow trajectory with a Tweedie-corrected diffusion process to achieve high-fidelity, pathology-aware image synthesis. First, we pretrain ViCTr on the ATLAS-8k dataset using Elastic Weight Consolidation (EWC) to preserve critical anatomical structures. We then fine-tune the model adversarially with Low-Rank Adaptation (LoRA) modules for precise control over pathology severity. By reformulating Tweedie's formula within a linear trajectory framework, ViCTr supports one-step sampling, reducing inference from 50 steps to just 4, without sacrificing anatomical realism. We evaluate ViCTr on BTCV (CT), AMOS (MRI), and CirrMRI600+ (cirrhosis) datasets. Results demonstrate state-of-the-art performance, achieving a Medical Frechet Inception Distance (MFID) of 17.01 for cirrhosis synthesis 28% lower than existing approaches and improving nnUNet segmentation by +3.8% mDSC when used for data augmentation. Radiologist reviews indicate that ViCTr-generated liver cirrhosis MRIs are clinically indistinguishable from real scans. To our knowledge, ViCTr is the first method to provide fine-grained, pathology-aware MRI synthesis with graded severity control, closing a critical gap in AI-driven medical imaging research.

Title: DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

Authors: Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, Zhongchao Shi, Gao Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04965
Pdf URL: https://arxiv.org/pdf/2505.04965
Copy Paste: [[2505.04965]] DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding(https://arxiv.org/abs/2505.04965)
Keywords: robust, large language model
Abstract: Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in egocentric 3D visual grounding. Our method also achieves 1st place and receives the Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.

Title: ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Authors: Wanjiang Weng, Xiaofeng Tan, Hongsong Wang, Pan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04974
Pdf URL: https://arxiv.org/pdf/2505.04974
Copy Paste: [[2505.04974]] ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment(https://arxiv.org/abs/2505.04974)
Keywords: diffusion
Abstract: Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods. Project page: this https URL.

Title: ChainMarks: Securing DNN Watermark with Cryptographic Chain

Authors: Brian Choi, Shu Wang, Isabelle Choi, Kun Sun
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04977
Pdf URL: https://arxiv.org/pdf/2505.04977
Copy Paste: [[2505.04977]] ChainMarks: Securing DNN Watermark with Cryptographic Chain(https://arxiv.org/abs/2505.04977)
Keywords: secure, security, protect, attack, robust, watermark
Abstract: With the widespread deployment of deep neural network (DNN) models, dynamic watermarking techniques are being used to protect the intellectual property of model owners. However, recent studies have shown that existing watermarking schemes are vulnerable to watermark removal and ambiguity attacks. Besides, the vague criteria for determining watermark presence further increase the likelihood of such attacks. In this paper, we propose a secure DNN watermarking scheme named ChainMarks, which generates secure and robust watermarks by introducing a cryptographic chain into the trigger inputs and utilizes a two-phase Monte Carlo method for determining watermark presence. First, ChainMarks generates trigger inputs as a watermark dataset by repeatedly applying a hash function over a secret key, where the target labels associated with trigger inputs are generated from the digital signature of model owner. Then, the watermarked model is produced by training a DNN over both the original and watermark datasets. To verify watermarks, we compare the predicted labels of trigger inputs with the target labels and determine ownership with a more accurate decision threshold that considers the classification probability of specific models. Experimental results show that ChainMarks exhibits higher levels of robustness and security compared to state-of-the-art watermarking schemes. With a better marginal utility, ChainMarks provides a higher probability guarantee of watermark presence in DNN models with the same level of watermark accuracy.

Title: Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization

Authors: Zhuang Qi, Sijin Zhou, Lei Meng, Han Hu, Han Yu, Xiangxu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04979
Pdf URL: https://arxiv.org/pdf/2505.04979
Copy Paste: [[2505.04979]] Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization(https://arxiv.org/abs/2505.04979)
Keywords: federate
Abstract: Attribute bias in federated learning (FL) typically leads local models to optimize inconsistently due to the learning of non-causal associations, resulting degraded performance. Existing methods either use data augmentation for increasing sample diversity or knowledge distillation for learning invariant representations to address this problem. However, they lack a comprehensive analysis of the inference paths, and the interference from confounding factors limits their performance. To address these limitations, we propose the \underline{Fed}erated \underline{D}econfounding and \underline{D}ebiasing \underline{L}earning (FedDDL) method. It constructs a structured causal graph to analyze the model inference process, and performs backdoor adjustment to eliminate confounding paths. Specifically, we design an intra-client deconfounding learning module for computer vision tasks to decouple background and objects, generating counterfactual samples that establish a connection between the background and any label, which stops the model from using the background to infer the label. Moreover, we design an inter-client debiasing learning module to construct causal prototypes to reduce the proportion of the background in prototype components. Notably, it bridges the gap between heterogeneous representations via causal prototypical regularization. Extensive experiments on 2 benchmarking datasets demonstrate that \methodname{} significantly enhances the model capability to focus on main objects in unseen data, leading to 4.5\% higher Top-1 Accuracy on average over 9 state-of-the-art existing methods.

Title: Graph Neural Network Aided Deep Reinforcement Learning for Resource Allocation in Dynamic Terahertz UAV Networks

Authors: Zhifeng Hu, Chong Han
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.04981
Pdf URL: https://arxiv.org/pdf/2505.04981
Copy Paste: [[2505.04981]] Graph Neural Network Aided Deep Reinforcement Learning for Resource Allocation in Dynamic Terahertz UAV Networks(https://arxiv.org/abs/2505.04981)
Keywords: security, robust
Abstract: Terahertz (THz) unmanned aerial vehicle (UAV) networks with flexible topologies and ultra-high data rates are expected to empower numerous applications in security surveillance, disaster response, and environmental monitoring, among others. However, the dynamic topologies hinder the efficient long-term joint power and antenna array resource allocation for THz links among UAVs. Furthermore, the continuous nature of power and the discrete nature of antennas cause this joint resource allocation problem to be a mixed-integer nonlinear programming (MINLP) problem with non-convexity and NP-hardness. Inspired by recent rapid advancements in deep reinforcement learning (DRL), a graph neural network (GNN) aided DRL algorithm for resource allocation in the dynamic THz UAV network with an emphasis on self-node features (GLOVE) is proposed in this paper, with the aim of resource efficiency (RE) maximization. When training the allocation policy for each UAV, GLOVE learns the relationship between this UAV and its neighboring UAVs via GNN, while also emphasizing the important self-node features of this UAV. In addition, a multi-task structure is leveraged by GLOVE to cooperatively train resource allocation decisions for the power and sub-arrays of all UAVs. Experimental results illustrate that GLOVE outperforms benchmark schemes in terms of the highest RE and the lowest latency. Moreover, unlike the benchmark methods with severe packet loss, GLOVE maintains zero packet loss during the entire training process, demonstrating its better robustness under the highly dynamic THz UAV network.

Title: Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Authors: Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, Dongyan Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04993
Pdf URL: https://arxiv.org/pdf/2505.04993
Copy Paste: [[2505.04993]] Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes(https://arxiv.org/abs/2505.04993)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on pre-defined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for the responsible deployment of powerful LLMs.

Title: Rethinking Invariance in In-context Learning

Authors: Lizhe Fang, Yifei Wang, Khashayar Gatmiry, Lei Fang, Yisen Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04994
Pdf URL: https://arxiv.org/pdf/2505.04994
Copy Paste: [[2505.04994]] Rethinking Invariance in In-context Learning(https://arxiv.org/abs/2505.04994)
Keywords: large language model
Abstract: In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at this https URL.

Title: StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps

Authors: Lang Nie, Chunyu Lin, Kang Liao, Yun Zhang, Shuaicheng Liu, Yao Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05001
Pdf URL: https://arxiv.org/pdf/2505.05001
Copy Paste: [[2505.05001]] StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps(https://arxiv.org/abs/2505.05001)
Keywords: robust
Abstract: We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.

Title: Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort

Authors: Hendrik Möller, Hanna Schön, Alina Dima, Benjamin Keinert-Weth, Robert Graf, Matan Atad, Johannes Paetzold, Friederike Jungmann, Rickmer Braren, Florian Kofler, Bjoern Menze, Daniel Rueckert, Jan S. Kirschke
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05004
Pdf URL: https://arxiv.org/pdf/2505.05004
Copy Paste: [[2505.05004]] Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort(https://arxiv.org/abs/2505.05004)
Keywords: segmentation
Abstract: Thoracolumbar stump ribs are one of the essential indicators of thoracolumbar transitional vertebrae or enumeration anomalies. While some studies manually assess these anomalies and describe the ribs qualitatively, this study aims to automate thoracolumbar stump rib detection and analyze their morphology quantitatively. To this end, we train a high-resolution deep-learning model for rib segmentation and show significant improvements compared to existing models (Dice score 0.997 vs. 0.779, p-value < 0.01). In addition, we use an iterative algorithm and piece-wise linear interpolation to assess the length of the ribs, showing a success rate of 98.2%. When analyzing morphological features, we show that stump ribs articulate more posteriorly at the vertebrae (-19.2 +- 3.8 vs -13.8 +- 2.5, p-value < 0.01), are thinner (260.6 +- 103.4 vs. 563.6 +- 127.1, p-value < 0.01), and are oriented more downwards and sideways within the first centimeters in contrast to full-length ribs. We show that with partially visible ribs, these features can achieve an F1-score of 0.84 in differentiating stump ribs from regular ones. We publish the model weights and masks for public use.

Title: Adaptive Contextual Embedding for Robust Far-View Borehole Detection

Authors: Xuesong Liu, Tianyu Hao, Emmett J. Ientilucci
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05008
Pdf URL: https://arxiv.org/pdf/2505.05008
Copy Paste: [[2505.05008]] Adaptive Contextual Embedding for Robust Far-View Borehole Detection(https://arxiv.org/abs/2505.05008)
Keywords: robust, extraction
Abstract: In controlled blasting operations, accurately detecting densely distributed tiny boreholes from far-view imagery is critical for operational safety and efficiency. However, existing detection methods often struggle due to small object scales, highly dense arrangements, and limited distinctive visual features of boreholes. To address these challenges, we propose an adaptive detection approach that builds upon existing architectures (e.g., YOLO) by explicitly leveraging consistent embedding representations derived through exponential moving average (EMA)-based statistical updates. Our method introduces three synergistic components: (1) adaptive augmentation utilizing dynamically updated image statistics to robustly handle illumination and texture variations; (2) embedding stabilization to ensure consistent and reliable feature extraction; and (3) contextual refinement leveraging spatial context for improved detection accuracy. The pervasive use of EMA in our method is particularly advantageous given the limited visual complexity and small scale of boreholes, allowing stable and robust representation learning even under challenging visual conditions. Experiments on a challenging proprietary quarry-site dataset demonstrate substantial improvements over baseline YOLO-based architectures, highlighting our method's effectiveness in realistic and complex industrial scenarios.

Title: An Agent-Based Modeling Approach to Free-Text Keyboard Dynamics for Continuous Authentication

Authors: Roberto Dillon, Arushi
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.05015
Pdf URL: https://arxiv.org/pdf/2505.05015
Copy Paste: [[2505.05015]] An Agent-Based Modeling Approach to Free-Text Keyboard Dynamics for Continuous Authentication(https://arxiv.org/abs/2505.05015)
Keywords: security, robust, biometric
Abstract: Continuous authentication systems leveraging free-text keyboard dynamics offer a promising additional layer of security in a multifactor authentication setup that can be used in a transparent way with no impact on user experience. This study investigates the efficacy of behavioral biometrics by employing an Agent-Based Model (ABM) to simulate diverse typing profiles across mechanical and membrane keyboards. Specifically, we generated synthetic keystroke data from five unique agents, capturing features related to dwell time, flight time, and error rates within sliding 5-second windows updated every second. Two machine learning approaches, One-Class Support Vector Machine (OC-SVM) and Random Forest (RF), were evaluated for user verification. Results revealed a stark contrast in performance: while One-Class SVM failed to differentiate individual users within each group, Random Forest achieved robust intra-keyboard user recognition (Accuracy > 0.7) but struggled to generalize across keyboards for the same user, highlighting the significant impact of keyboard hardware on typing behavior. These findings suggest that: (1) keyboard-specific user profiles may be necessary for reliable authentication, and (2) ensemble methods like RF outperform One-Class SVM in capturing fine-grained user-specific patterns.

Title: The Pitfalls of Growing Group Complexity: LLMs and Social Choice-Based Aggregation for Group Recommendations

Authors: Cedric Waterschoot, Nava Tintarev, Francesco Barile
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.05016
Pdf URL: https://arxiv.org/pdf/2505.05016
Copy Paste: [[2505.05016]] The Pitfalls of Growing Group Complexity: LLMs and Social Choice-Based Aggregation for Group Recommendations(https://arxiv.org/abs/2505.05016)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly applied in recommender systems aimed at both individuals and groups. Previously, Group Recommender Systems (GRS) often used social choice-based aggregation strategies to derive a single recommendation based on the preferences of multiple people. In this paper, we investigate under which conditions language models can perform these strategies correctly based on zero-shot learning and analyse whether the formatting of the group scenario in the prompt affects accuracy. We specifically focused on the impact of group complexity (number of users and items), different LLMs, different prompting conditions, including In-Context learning or generating explanations, and the formatting of group preferences. Our results show that performance starts to deteriorate when considering more than 100 ratings. However, not all language models were equally sensitive to growing group complexity. Additionally, we showed that In-Context Learning (ICL) can significantly increase the performance at higher degrees of group complexity, while adding other prompt modifications, specifying domain cues or prompting for explanations, did not impact accuracy. We conclude that future research should include group complexity as a factor in GRS evaluation due to its effect on LLM performance. Furthermore, we showed that formatting the group scenarios differently, such as rating lists per user or per item, affected accuracy. All in all, our study implies that smaller LLMs are capable of generating group recommendations under the right conditions, making the case for using smaller models that require less computing power and costs.

Title: Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization

Authors: Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Jiang Zong, Hao Peng, Jianwei Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05017
Pdf URL: https://arxiv.org/pdf/2505.05017
Copy Paste: [[2505.05017]] Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization(https://arxiv.org/abs/2505.05017)
Keywords: large language model
Abstract: Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage'' influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at this https URL.

Title: Generating Reliable Synthetic Clinical Trial Data: The Role of Hyperparameter Optimization and Domain Constraints

Authors: Waldemar Hahn, Jan-Niklas Eckardt, Christoph Röllig, Martin Sedlmayr, Jan Moritz Middeke, Markus Wolfien
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05019
Pdf URL: https://arxiv.org/pdf/2505.05019
Copy Paste: [[2505.05019]] Generating Reliable Synthetic Clinical Trial Data: The Role of Hyperparameter Optimization and Domain Constraints(https://arxiv.org/abs/2505.05019)
Keywords: privacy, robust, generative
Abstract: The generation of synthetic clinical trial data offers a promising approach to mitigating privacy concerns and data accessibility limitations in medical research. However, ensuring that synthetic datasets maintain high fidelity, utility, and adherence to domain-specific constraints remains a key challenge. While hyperparameter optimization (HPO) has been shown to improve generative model performance, the effectiveness of different optimization strategies for synthetic clinical data remains unclear. This study systematically evaluates four HPO strategies across eight generative models, comparing single-metric optimization against compound metric optimization approaches. Our results demonstrate that HPO consistently improves synthetic data quality, with TVAE, CTGAN, and CTAB-GAN+ achieving improvements of up to 60%, 39%, and 38%, respectively. Compound metric optimization outperformed single-metric strategies, producing more balanced and generalizable synthetic datasets. Interestingly, HPO alone is insufficient to ensure clinically valid synthetic data, as all models exhibited violations of fundamental survival constraints. Preprocessing and postprocessing played a crucial role in reducing these violations, as models lacking robust processing steps produced invalid data in up to 61% of cases. These findings underscore the necessity of integrating explicit domain knowledge alongside HPO to create high quality synthetic datasets. Our study provides actionable recommendations for improving synthetic data generation, with future research needed to refine metric selection and validate these findings on larger datasets to enhance clinical applicability.

Title: Generative Models for Long Time Series: Approximately Equivariant Recurrent Network Structures for an Adjusted Training Scheme

Authors: Ruwen Fulek, Markus Lange-Hegermann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05020
Pdf URL: https://arxiv.org/pdf/2505.05020
Copy Paste: [[2505.05020]] Generative Models for Long Time Series: Approximately Equivariant Recurrent Network Structures for an Adjusted Training Scheme(https://arxiv.org/abs/2505.05020)
Keywords: generative
Abstract: We present a simple yet effective generative model for time series data based on a Variational Autoencoder (VAE) with recurrent layers, referred to as the Recurrent Variational Autoencoder with Subsequent Training (RVAE-ST). Our method introduces an adapted training scheme that progressively increases the sequence length, addressing the challenge recurrent layers typically face when modeling long sequences. By leveraging the recurrent architecture, the model maintains a constant number of parameters regardless of sequence length. This design encourages approximate time-shift equivariance and enables efficient modeling of long-range temporal dependencies. Rather than introducing a fundamentally new architecture, we show that a carefully composed combination of known components can match or outperform state-of-the-art generative models on several benchmark datasets. Our model performs particularly well on time series that exhibit quasi-periodic structure,while remaining competitive on datasets with more irregular or partially non-stationary behavior. We evaluate its performance using ELBO, Fréchet Distance, discriminative scores, and visualizations of the learned embeddings.

Title: SOAP: Style-Omniscient Animatable Portraits

Authors: Tingting Liao, Yujian Zheng, Adilbek Karmanov, Liwen Hu, Leyang Jin, Yuliang Xiu, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05022
Pdf URL: https://arxiv.org/pdf/2505.05022
Copy Paste: [[2505.05022]] SOAP: Style-Omniscient Animatable Portraits(https://arxiv.org/abs/2505.05022)
Keywords: diffusion
Abstract: Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at this https URL.

Title: Split Matching for Inductive Zero-shot Semantic Segmentation

Authors: Jialei Chen, Xu Zheng, Dongyue Li, Chong Yi, Seigo Ito, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05023
Pdf URL: https://arxiv.org/pdf/2505.05023
Copy Paste: [[2505.05023]] Split Matching for Inductive Zero-shot Semantic Segmentation(https://arxiv.org/abs/2505.05023)
Keywords: segmentation
Abstract: Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query-based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region-level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class-level similarity and mask-level consistency. Additionally, we introduce a Multi-scale Feature Enhancement (MFE) module that refines decoder features through residual multi-scale aggregation, improving the model's ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.

Title: G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness

Authors: Jaehyun Jeon, Janghan Yoon, Minsoo Kim, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05026
Pdf URL: https://arxiv.org/pdf/2505.05026
Copy Paste: [[2505.05026]] G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness(https://arxiv.org/abs/2505.05026)
Keywords: robust
Abstract: Evaluating user interface (UI) design effectiveness extends beyond aesthetics to influencing user behavior, a principle central to Design Persuasiveness. A/B testing is the predominant method for determining which UI variations drive higher user engagement, but it is costly and time-consuming. While recent Vision-Language Models (VLMs) can process automated UI analysis, current approaches focus on isolated design attributes rather than comparative persuasiveness-the key factor in optimizing user interactions. To address this, we introduce WiserUI-Bench, a benchmark designed for Pairwise UI Design Persuasiveness Assessment task, featuring 300 real-world UI image pairs labeled with A/B test results and expert rationales. Additionally, we propose G-FOCUS, a novel inference-time reasoning strategy that enhances VLM-based persuasiveness assessment by reducing position bias and improving evaluation accuracy. Experimental results show that G-FOCUS surpasses existing inference strategies in consistency and accuracy for pairwise UI evaluation. Through promoting VLM-driven evaluation of UI persuasiveness, our work offers an approach to complement A/B testing, propelling progress in scalable UI preference modeling and design optimization. Code and data will be released publicly.

Title: Dequantified Diffusion Schrödinger Bridge for Density Ratio Estimation

Authors: Wei Chen, Shigui Li, Jiacheng Li, Junmei Yang, John Paisley, Delu Zeng
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.05034
Pdf URL: https://arxiv.org/pdf/2505.05034
Copy Paste: [[2505.05034]] Dequantified Diffusion Schrödinger Bridge for Density Ratio Estimation(https://arxiv.org/abs/2505.05034)
Keywords: robust, diffusion
Abstract: Density ratio estimation is fundamental to tasks involving $f$-divergences, yet existing methods often fail under significantly different distributions or inadequately overlap supports, suffering from the \textit{density-chasm} and the \textit{support-chasm} problems. Additionally, prior approaches yield divergent time scores near boundaries, leading to instability. We propose $\text{D}^3\text{RE}$, a unified framework for robust and efficient density ratio estimation. It introduces the Dequantified Diffusion-Bridge Interpolant (DDBI), which expands support coverage and stabilizes time scores via diffusion bridges and Gaussian dequantization. Building on DDBI, the Dequantified Schrödinger-Bridge Interpolant (DSBI) incorporates optimal transport to solve the Schrödinger bridge problem, enhancing accuracy and efficiency. Our method offers uniform approximation and bounded time scores in theory, and outperforms baselines empirically in mutual information and density estimation tasks.

Title: xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition

Authors: Mani Kumar Tellamekala, Shashank Jaiswal, Thomas Smith, Timur Alamev, Gary McKeown, Anthony Brown, Michel Valstar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05043
Pdf URL: https://arxiv.org/pdf/2505.05043
Copy Paste: [[2505.05043]] xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition(https://arxiv.org/abs/2505.05043)
Keywords: robust
Abstract: Recognising expressive behaviours in face videos is a long-standing challenge in Affective Computing. Despite significant advancements in recent years, it still remains a challenge to build a robust and reliable system for naturalistic and in-the-wild facial expressive behaviour analysis in real time. This paper addresses two key challenges in building such a system: (1). The paucity of large-scale labelled facial affect video datasets with extensive coverage of the 2D emotion space, and (2). The difficulty of extracting facial video features that are discriminative, interpretable, robust, and computationally efficient. Toward addressing these challenges, we introduce xTrace, a robust tool for facial expressive behaviour analysis and predicting continuous values of dimensional emotions, namely valence and arousal, from in-the-wild face videos. To address challenge (1), our affect recognition model is trained on the largest facial affect video data set, containing ~450k videos that cover most emotion zones in the dimensional emotion space, making xTrace highly versatile in analysing a wide spectrum of naturalistic expressive behaviours. To address challenge (2), xTrace uses facial affect descriptors that are not only explainable, but can also achieve a high degree of accuracy and robustness with low computational complexity. The key components of xTrace are benchmarked against three existing tools: MediaPipe, OpenFace, and Augsburg Affect Toolbox. On an in-the-wild validation set composed of 50k videos, xTrace achieves 0.86 mean CCC and 0.13 mean absolute error values. We present a detailed error analysis of affect predictions from xTrace, illustrating (a). its ability to recognise emotions with high accuracy across most bins in the 2D emotion space, (b). its robustness to non-frontal head pose angles, and (c). a strong correlation between its uncertainty estimates and its accuracy.

Title: UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model

Authors: Timo Kaiser, Thomas Norrenbrock, Bodo Rosenhahn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05049
Pdf URL: https://arxiv.org/pdf/2505.05049
Copy Paste: [[2505.05049]] UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model(https://arxiv.org/abs/2505.05049)
Keywords: segmentation
Abstract: The introduction of the Segment Anything Model (SAM) has paved the way for numerous semantic segmentation applications. For several tasks, quantifying the uncertainty of SAM is of particular interest. However, the ambiguous nature of the class-agnostic foundation model SAM challenges current uncertainty quantification (UQ) approaches. This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation jointly respecting aleatoric, epistemic, and the newly introduced task uncertainty. We use this formulation to train USAM, a lightweight post-hoc UQ method. Our model traces the root of uncertainty back to under-parameterised models, insufficient prompts or image ambiguities. Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k, DAVIS, and COCO datasets, offering a computationally cheap and easy-to-use UQ alternative that can support user-prompting, enhance semi-supervised pipelines, or balance the tradeoff between accuracy and cost efficiency.

Title: CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Authors: Manik Sheokand, Parth Sawant
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05063
Pdf URL: https://arxiv.org/pdf/2505.05063
Copy Paste: [[2505.05063]] CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts(https://arxiv.org/abs/2505.05063)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only prompts, overlooking the real-world scenario where multilingual developers often use code-mixed language while interacting with LLMs. To address this gap, we introduce CodeMixBench, a novel benchmark designed to evaluate the robustness of LLMs on code generation from code-mixed prompts. Built upon BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the natural language parts of prompts across three language pairs: Hinglish (Hindi-English), Spanish-English, and Chinese Pinyin-English. We comprehensively evaluate a diverse set of open-source code generation models ranging from 1.5B to 15B parameters. Our results show that code-mixed prompts consistently degrade Pass@1 performance compared to their English-only counterparts, with performance drops increasing under higher CMD levels for smaller models. CodeMixBench provides a realistic evaluation framework for studying multilingual code generation and highlights new challenges and directions for building robust code generation models that generalize well across diverse linguistic settings.

Title: WaterDrum: Watermarking for Data-centric Unlearning Metric

Authors: Xinyang Lu, Xinyuan Niu, Gregory Kang Ruey Lau, Bui Thi Cam Nhung, Rachael Hwee Ling Sim, Fanyu Wen, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05064
Pdf URL: https://arxiv.org/pdf/2505.05064
Copy Paste: [[2505.05064]] WaterDrum: Watermarking for Data-centric Unlearning Metric(https://arxiv.org/abs/2505.05064)
Keywords: robust, watermark, large language model
Abstract: Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have semantically similar content, (b) retraining the model from scratch on the retain set is impractical, and/or (c) the model owner can improve the unlearning metric without directly performing unlearning on the LLM. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming these limitations. We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum. Our code is available at this https URL and our new benchmark datasets are released at this https URL.

Title: Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization

Authors: Ajwad Abrar, Farzana Tabassum, Sabbir Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05070
Pdf URL: https://arxiv.org/pdf/2505.05070
Copy Paste: [[2505.05070]] Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization(https://arxiv.org/abs/2505.05070)
Keywords: large language model
Abstract: Consumer Health Queries (CHQs) in Bengali (Bangla), a low-resource language, often contain extraneous details, complicating efficient medical responses. This study investigates the zero-shot performance of nine advanced large language models (LLMs): GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet, Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro, Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, in summarizing Bangla CHQs. Using the BanglaCHQ-Summ dataset comprising 2,350 annotated query-summary pairs, we benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model. Mixtral-8x22b-Instruct emerged as the top performing model in ROUGE-1 and ROUGE-L, while Bangla T5 excelled in ROUGE-2. The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training. This work underscores the potential of LLMs in addressing challenges in low-resource languages, providing scalable solutions for healthcare query summarization.

Title: Visual Affordances: Enabling Robots to Understand Object Functionality

Authors: Tommaso Apicella, Alessio Xompero, Andrea Cavallaro
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.05074
Pdf URL: https://arxiv.org/pdf/2505.05074
Copy Paste: [[2505.05074]] Visual Affordances: Enabling Robots to Understand Object Functionality(https://arxiv.org/abs/2505.05074)
Keywords: fair, segmentation
Abstract: Human-robot interaction for assistive technologies relies on the prediction of affordances, which are the potential actions a robot can perform on objects. Predicting object affordances from visual perception is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand-object interaction synthesis. In this work, we highlight the reproducibility issue in these redefinitions, making comparative benchmarks unfair and unreliable. To address this problem, we propose a unified formulation for visual affordance prediction, provide a comprehensive and systematic review of previous works highlighting strengths and limitations of methods and datasets, and analyse what challenges reproducibility. To favour transparency, we introduce the Affordance Sheet, a document to detail the proposed solution, the datasets, and the validation. As the physical properties of an object influence the interaction with the robot, we present a generic framework that links visual affordance prediction to the physical world. Using the weight of an object as an example for this framework, we discuss how estimating object mass can affect the affordance prediction. Our approach bridges the gap between affordance perception and robot actuation, and accounts for the complete information about objects of interest and how the robot interacts with them to accomplish its task.

Title: PIDiff: Image Customization for Personalized Identities with Diffusion Models

Authors: Jinyu Gu, Haipeng Liu, Meng Wang, Yang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05081
Pdf URL: https://arxiv.org/pdf/2505.05081
Copy Paste: [[2505.05081]] PIDiff: Image Customization for Personalized Identities with Diffusion Models(https://arxiv.org/abs/2505.05081)
Keywords: extraction, diffusion, generative
Abstract: Text-to-image generation for personalized identities aims at incorporating the specific identity into images using a text prompt and an identity image. Based on the powerful generative capabilities of DDPMs, many previous works adopt additional prompts, such as text embeddings and CLIP image embeddings, to represent the identity information, while they fail to disentangle the identity information and background information. As a result, the generated images not only lose key identity characteristics but also suffer from significantly reduced diversity. To address this issue, previous works have combined the W+ space from StyleGAN with diffusion models, leveraging this space to provide a more accurate and comprehensive representation of identity features through multi-level feature extraction. However, the entanglement of identity and background information in in-the-wild images during training prevents accurate identity localization, resulting in severe semantic interference between identity and background. In this paper, we propose a novel fine-tuning-based diffusion model for personalized identities text-to-image generation, named PIDiff, which leverages the W+ space and an identity-tailored fine-tuning strategy to avoid semantic entanglement and achieves accurate feature extraction and localization. Style editing can also be achieved by PIDiff through preserving the characteristics of identity features in the W+ space, which vary from coarse to fine. Through the combination of the proposed cross-attention block and parameter optimization strategy, PIDiff preserves the identity information and maintains the generation capability for in-the-wild images of the pre-trained model during inference. Our experimental results validate the effectiveness of our method in this task.

Title: ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model

Authors: Sagnik Bhattacharya, Abhiram R. Gorle, Ahmed Mohsin, Ahsan Bilal, Connor Ding, Amit Kumar Singh Yadav, Tsachy Weissman
Subjects: cs.LG, cs.IT, math.PR
Abstract URL: https://arxiv.org/abs/2505.05082
Pdf URL: https://arxiv.org/pdf/2505.05082
Copy Paste: [[2505.05082]] ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model(https://arxiv.org/abs/2505.05082)
Keywords: diffusion, generative
Abstract: Existing methods for generative modeling of discrete data, such as symbolic music tokens, face two primary challenges: (1) they either embed discrete inputs into continuous state-spaces or (2) rely on variational losses that only approximate the true negative log-likelihood. Previous efforts have individually targeted these limitations. While information-theoretic Gaussian diffusion models alleviate the suboptimality of variational losses, they still perform modeling in continuous domains. In this work, we introduce the Information-Theoretic Discrete Poisson Diffusion Model (ItDPDM), which simultaneously addresses both limitations by directly operating in a discrete state-space via a Poisson diffusion process inspired by photon arrival processes in camera sensors. We introduce a novel Poisson Reconstruction Loss (PRL) and derive an exact relationship between PRL and the true negative log-likelihood, thereby eliminating the need for approximate evidence lower bounds. Experiments conducted on the Lakh MIDI symbolic music dataset and the CIFAR-10 image benchmark demonstrate that ItDPDM delivers significant improvements, reducing test NLL by up to 80% compared to prior baselines, while also achieving faster convergence.

Title: Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction

Authors: Xiaowei Zhu, Yubing Ren, Yanan Cao, Xixun Lin, Fang Fang, Yangxi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05084
Pdf URL: https://arxiv.org/pdf/2505.05084
Copy Paste: [[2505.05084]] Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction(https://arxiv.org/abs/2505.05084)
Keywords: attack, robust, large language model
Abstract: The rapid advancement of large language models has raised significant concerns regarding their potential misuse by malicious actors. As a result, developing effective detectors to mitigate these risks has become a critical priority. However, most existing detection methods focus excessively on detection accuracy, often neglecting the societal risks posed by high false positive rates (FPRs). This paper addresses this issue by leveraging Conformal Prediction (CP), which effectively constrains the upper bound of FPRs. While directly applying CP constrains FPRs, it also leads to a significant reduction in detection performance. To overcome this trade-off, this paper proposes a Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction (MCP), which both enforces the FPR constraint and improves detection performance. This paper also introduces RealDet, a high-quality dataset that spans a wide range of domains, ensuring realistic calibration and enabling superior detection performance when combined with MCP. Empirical evaluations demonstrate that MCP effectively constrains FPRs, significantly enhances detection performance, and increases robustness against adversarial attacks across multiple detectors and datasets.

Title: Beyond Low-rank Decomposition: A Shortcut Approach for Efficient On-Device Learning

Authors: Le-Trung Nguyen, Ael Quelennec, Van-Tam Nguyen, Enzo Tartaglione
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05086
Pdf URL: https://arxiv.org/pdf/2505.05086
Copy Paste: [[2505.05086]] Beyond Low-rank Decomposition: A Shortcut Approach for Efficient On-Device Learning(https://arxiv.org/abs/2505.05086)
Keywords: privacy
Abstract: On-device learning has emerged as a promising direction for AI development, particularly because of its potential to reduce latency issues and mitigate privacy risks associated with device-server communication, while improving energy efficiency. Despite these advantages, significant memory and computational constraints still represent major challenges for its deployment. Drawing on previous studies on low-rank decomposition methods that address activation memory bottlenecks in backpropagation, we propose a novel shortcut approach as an alternative. Our analysis and experiments demonstrate that our method can reduce activation memory usage, even up to $120.09\times$ compared to vanilla training, while also reducing overall training FLOPs up to $1.86\times$ when evaluated on traditional benchmarks.

Title: DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions

Authors: Shashank Agnihotri, Amaan Ansari, Annika Dackermann, Fabian Rösch, Margret Keuper
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05091
Pdf URL: https://arxiv.org/pdf/2505.05091
Copy Paste: [[2505.05091]] DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions(https://arxiv.org/abs/2505.05091)
Keywords: attack, robust
Abstract: Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field. To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: this https URL

Title: Balancing Client Participation in Federated Learning Using AoI

Authors: Alireza Javani, Zhiying Wang
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2505.05099
Pdf URL: https://arxiv.org/pdf/2505.05099
Copy Paste: [[2505.05099]] Balancing Client Participation in Federated Learning Using AoI(https://arxiv.org/abs/2505.05099)
Keywords: privacy, federate, fair
Abstract: Federated Learning (FL) offers a decentralized framework that preserves data privacy while enabling collaborative model training across distributed clients. However, FL faces significant challenges due to limited communication resources, statistical heterogeneity, and the need for balanced client participation. This paper proposes an Age of Information (AoI)-based client selection policy that addresses these challenges by minimizing load imbalance through controlled selection intervals. Our method employs a decentralized Markov scheduling policy, allowing clients to independently manage participation based on age-dependent selection probabilities, which balances client updates across training rounds with minimal central oversight. We provide a convergence proof for our method, demonstrating that it ensures stable and efficient model convergence. Specifically, we derive optimal parameters for the Markov selection model to achieve balanced and consistent client participation, highlighting the benefits of AoI in enhancing convergence stability. Through extensive simulations, we demonstrate that our AoI-based method, particularly the optimal Markov variant, improves convergence over the FedAvg selection approach across both IID and non-IID data settings by $7.5\%$ and up to $20\%$. Our findings underscore the effectiveness of AoI-based scheduling for scalable, fair, and efficient FL systems across diverse learning environments.

Title: MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models

Authors: Hongyang Zhu, Haipeng Liu, Bo Fu, Yang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05101
Pdf URL: https://arxiv.org/pdf/2505.05101
Copy Paste: [[2505.05101]] MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models(https://arxiv.org/abs/2505.05101)
Keywords: robust, diffusion, segmentation
Abstract: Multi-object editing aims to modify multiple objects or regions in complex scenes while preserving structural coherence. This task faces significant challenges in scenarios involving overlapping or interacting objects: (1) Inaccurate localization of target objects due to attention misalignment, leading to incomplete or misplaced edits; (2) Attribute-object mismatch, where color or texture changes fail to align with intended regions due to cross-attention leakage, creating semantic conflicts (\textit{e.g.}, color bleeding into non-target areas). Existing methods struggle with these challenges: approaches relying on global cross-attention mechanisms suffer from attention dilution and spatial interference between objects, while mask-based methods fail to bind attributes to geometrically accurate regions due to feature entanglement in multi-object scenarios. To address these limitations, we propose a training-free, inference-stage optimization approach that enables precise localized image manipulation in complex multi-object scenes, named MDE-Edit. MDE-Edit optimizes the noise latent feature in diffusion models via two key losses: Object Alignment Loss (OAL) aligns multi-layer cross-attention with segmentation masks for precise object positioning, and Color Consistency Loss (CCL) amplifies target attribute attention within masks while suppressing leakage to adjacent regions. This dual-loss design ensures localized and coherent multi-object edits. Extensive experiments demonstrate that MDE-Edit outperforms state-of-the-art methods in editing accuracy and visual quality, offering a robust solution for complex multi-object image manipulation tasks.

Title: A Weighted Byzantine Fault Tolerance Consensus Driven Trusted Multiple Large Language Models Network

Authors: Haoxiang Luo, Gang Sun, Yinqiu Liu, Dongcheng Zhao, Dusit Niyato, Hongfang Yu, Schahram Dustdar
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2505.05103
Pdf URL: https://arxiv.org/pdf/2505.05103
Copy Paste: [[2505.05103]] A Weighted Byzantine Fault Tolerance Consensus Driven Trusted Multiple Large Language Models Network(https://arxiv.org/abs/2505.05103)
Keywords: security, robust, large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of applications. However, individual LLMs often produce inconsistent, biased, or hallucinated outputs due to limitations in their training corpora and model architectures. Recently, collaborative frameworks such as the Multi-LLM Network (MultiLLMN) have been introduced, enabling multiple LLMs to interact and jointly respond to user queries. Nevertheless, MultiLLMN architectures raise critical concerns regarding the reliability and security of the generated content, particularly in open environments where malicious or compromised LLMs may be present. Moreover, reliance on centralized coordination undermines system efficiency and introduces single points of failure. In this paper, we propose a novel Trusted MultiLLMN framework, driven by a Weighted Byzantine Fault Tolerance (WBFT) blockchain consensus mechanism, to ensure the reliability, security, and efficiency of multi-LLM collaboration. In WBFT, voting weights are adaptively assigned to each LLM based on its response quality and trustworthiness, incentivizing reliable behavior, and reducing the impact of malicious nodes. Extensive simulations demonstrate that WBFT significantly improves both consensus security and efficiency compared to classical and modern consensus mechanisms, particularly under wireless network conditions. Furthermore, our evaluations reveal that Trusted MultiLLMN supported by WBFT can deliver higher-quality and more credible responses than both single LLMs and conventional MultiLLMNs, thereby providing a promising path toward building robust, decentralized AI collaboration networks.

Title: Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

Authors: Boyi Deng, Yu Wan, Yidan Zhang, Baosong Yang, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05111
Pdf URL: https://arxiv.org/pdf/2505.05111
Copy Paste: [[2505.05111]] Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders(https://arxiv.org/abs/2505.05111)
Keywords: large language model
Abstract: The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs.

Title: Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach

Authors: Xuyang Chen, Keyu Yan, Lin Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05126
Pdf URL: https://arxiv.org/pdf/2505.05126
Copy Paste: [[2505.05126]] Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach(https://arxiv.org/abs/2505.05126)
Keywords: diffusion
Abstract: Offline reinforcement learning (RL) aims to learn decision-making policies from fixed datasets without online interactions, providing a practical solution where online data collection is expensive or risky. However, offline RL often suffers from distribution shift, resulting in inaccurate evaluation and substantial overestimation on out-of-distribution (OOD) actions. To address this, existing approaches incorporate conservatism by indiscriminately discouraging all OOD actions, thereby hindering the agent's ability to generalize and exploit beneficial ones. In this paper, we propose Advantage-based Diffusion Actor-Critic (ADAC), a novel method that systematically evaluates OOD actions using the batch-optimal value function. Based on this evaluation, ADAC defines an advantage function to modulate the Q-function update, enabling more precise assessment of OOD action quality. We design a custom PointMaze environment and collect datasets to visually reveal that advantage modulation can effectively identify and select superior OOD actions. Extensive experiments show that ADAC achieves state-of-the-art performance on almost all tasks in the D4RL benchmark, with particularly clear margins on the more challenging tasks.

Title: Automated vision-based assistance tools in bronchoscopy: stenosis severity estimation

Authors: Clara Tomasini, Javier Rodriguez-Puigvert, Dinora Polanco, Manuel Viñuales, Luis Riazuelo, Ana Cristina Murillo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05136
Pdf URL: https://arxiv.org/pdf/2505.05136
Copy Paste: [[2505.05136]] Automated vision-based assistance tools in bronchoscopy: stenosis severity estimation(https://arxiv.org/abs/2505.05136)
Keywords: robust
Abstract: Purpose: Subglottic stenosis refers to the narrowing of the subglottis, the airway between the vocal cords and the trachea. Its severity is typically evaluated by estimating the percentage of obstructed airway. This estimation can be obtained from CT data or through visual inspection by experts exploring the region. However, visual inspections are inherently subjective, leading to less consistent and robust diagnoses. No public methods or datasets are currently available for automated evaluation of this condition from bronchoscopy video. Methods: We propose a pipeline for automated subglottic stenosis severity estimation during the bronchoscopy exploration, without requiring the physician to traverse the stenosed region. Our approach exploits the physical effect of illumination decline in endoscopy to segment and track the lumen and obtain a 3D model of the airway. This 3D model is obtained from a single frame and is used to measure the airway narrowing. Results: Our pipeline is the first to enable automated and robust subglottic stenosis severity measurement using bronchoscopy images. The results show consistency with ground-truth estimations from CT scans and expert estimations, and reliable repeatability across multiple estimations on the same patient. Our evaluation is performed on our new Subglottic Stenosis Dataset of real bronchoscopy procedures data. Conclusion: We demonstrate how to automate evaluation of subglottic stenosis severity using only bronchoscopy. Our approach can assist with and shorten diagnosis and monitoring procedures, with automated and repeatable estimations and less exploration time, and save radiation exposure to patients as no CT is required. Additionally, we release the first public benchmark for subglottic stenosis severity assessment.

Title: Research on Anomaly Detection Methods Based on Diffusion Models

Authors: Yi Chen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.05137
Pdf URL: https://arxiv.org/pdf/2505.05137
Copy Paste: [[2505.05137]] Research on Anomaly Detection Methods Based on Diffusion Models(https://arxiv.org/abs/2505.05137)
Keywords: security, robust, extraction, diffusion
Abstract: Anomaly detection is a fundamental task in machine learning and data mining, with significant applications in cybersecurity, industrial fault diagnosis, and clinical disease monitoring. Traditional methods, such as statistical modeling and machine learning-based approaches, often face challenges in handling complex, high-dimensional data distributions. In this study, we explore the potential of diffusion models for anomaly detection, proposing a novel framework that leverages the strengths of diffusion probabilistic models (DPMs) to effectively identify anomalies in both image and audio data. The proposed method models the distribution of normal data through a diffusion process and reconstructs input data via reverse diffusion, using a combination of reconstruction errors and semantic discrepancies as anomaly indicators. To enhance the framework's performance, we introduce multi-scale feature extraction, attention mechanisms, and wavelet-domain representations, enabling the model to capture fine-grained structures and global dependencies in the data. Extensive experiments on benchmark datasets, including MVTec AD and UrbanSound8K, demonstrate that our method outperforms state-of-the-art anomaly detection techniques, achieving superior accuracy and robustness across diverse data modalities. This research highlights the effectiveness of diffusion models in anomaly detection and provides a robust and efficient solution for real-world applications.

Title: Understanding In-context Learning of Addition via Activation Subspaces

Authors: Xinyan Hu, Kayo Yin, Michael I. Jordan, Jacob Steinhardt, Lijie Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05145
Pdf URL: https://arxiv.org/pdf/2505.05145
Copy Paste: [[2505.05145]] Understanding In-context Learning of Addition via Activation Subspaces(https://arxiv.org/abs/2505.05145)
Keywords: transformer
Abstract: To perform in-context learning, language models must extract signals from individual few-shot examples, aggregate these into a learned prediction rule, and then apply this rule to new examples. How is this implemented in the forward pass of modern transformer models? To study this, we consider a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We find that Llama-3-8B attains high accuracy on this task for a range of $k$, and localize its few-shot ability to just three attention heads via a novel optimization approach. We further show the extracted signals lie in a six-dimensional subspace, where four of the dimensions track the unit digit and the other two dimensions track overall magnitude. We finally examine how these heads extract information from individual few-shot examples, identifying a self-correction mechanism in which mistakes from earlier examples are suppressed by later examples. Our results demonstrate how tracking low-dimensional subspaces across a forward pass can provide insight into fine-grained computational structures.

Title: A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition

Authors: Hussain Ahmad, Qingyang Zeng, Jing Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05148
Pdf URL: https://arxiv.org/pdf/2505.05148
Copy Paste: [[2505.05148]] A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition(https://arxiv.org/abs/2505.05148)
Keywords: extraction
Abstract: The emergence of multimodal content, particularly text and images on social media, has positioned Multimodal Named Entity Recognition (MNER) as an increasingly important area of research within Natural Language Processing. Despite progress in high-resource languages such as English, MNER remains underexplored for low-resource languages like Urdu. The primary challenges include the scarcity of annotated multimodal datasets and the lack of standardized baselines. To address these challenges, we introduce the U-MNER framework and release the Twitter2015-Urdu dataset, a pioneering resource for Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated with Urdu-specific grammar rules. We establish benchmark baselines by evaluating both text-based and multimodal models on this dataset, providing comparative analyses to support future research on Urdu MNER. The U-MNER framework integrates textual and visual context using Urdu-BERT for text embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion Module to align and fuse information. Our model achieves state-of-the-art performance on the Twitter2015-Urdu dataset, laying the groundwork for further MNER research in low-resource languages.

Title: FedTDP: A Privacy-Preserving and Unified Framework for Trajectory Data Preparation via Federated Learning

Authors: Zhihao Zeng, Ziquan Fang, Wei Shao, Lu Chen, Yunjun Gao
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.05155
Pdf URL: https://arxiv.org/pdf/2505.05155
Copy Paste: [[2505.05155]] FedTDP: A Privacy-Preserving and Unified Framework for Trajectory Data Preparation via Federated Learning(https://arxiv.org/abs/2505.05155)
Keywords: secure, privacy, protect, federate, large language model
Abstract: Trajectory data, which capture the movement patterns of people and vehicles over time and space, are crucial for applications like traffic optimization and urban planning. However, issues such as noise and incompleteness often compromise data quality, leading to inaccurate trajectory analyses and limiting the potential of these applications. While Trajectory Data Preparation (TDP) can enhance data quality, existing methods suffer from two key limitations: (i) they do not address data privacy concerns, particularly in federated settings where trajectory data sharing is prohibited, and (ii) they typically design task-specific models that lack generalizability across diverse TDP scenarios. To overcome these challenges, we propose FedTDP, a privacy-preserving and unified framework that leverages the capabilities of Large Language Models (LLMs) for TDP in federated environments. Specifically, we: (i) design a trajectory privacy autoencoder to secure data transmission and protect privacy, (ii) introduce a trajectory knowledge enhancer to improve model learning of TDP-related knowledge, enabling the development of TDP-oriented LLMs, and (iii) propose federated parallel optimization to enhance training efficiency by reducing data transmission and enabling parallel model training. Experiments on 6 real datasets and 10 mainstream TDP tasks demonstrate that FedTDP consistently outperforms 13 state-of-the-art baselines.

Title: Bandit Max-Min Fair Allocation

Authors: Tsubasa Harada, Shinji Ito, Hanna Sumita
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05169
Pdf URL: https://arxiv.org/pdf/2505.05169
Copy Paste: [[2505.05169]] Bandit Max-Min Fair Allocation(https://arxiv.org/abs/2505.05169)
Keywords: fair
Abstract: In this paper, we study a new decision-making problem called the bandit max-min fair allocation (BMMFA) problem. The goal of this problem is to maximize the minimum utility among agents with additive valuations by repeatedly assigning indivisible goods to them. One key feature of this problem is that each agent's valuation for each item can only be observed through the semi-bandit feedback, while existing work supposes that the item values are provided at the beginning of each round. Another key feature is that the algorithm's reward function is not additive with respect to rounds, unlike most bandit-setting problems. Our first contribution is to propose an algorithm that has an asymptotic regret bound of $O(m\sqrt{T}\ln T/n + m\sqrt{T \ln(mnT)})$, where $n$ is the number of agents, $m$ is the number of items, and $T$ is the time horizon. This is based on a novel combination of bandit techniques and a resource allocation algorithm studied in the literature on competitive analysis. Our second contribution is to provide the regret lower bound of $\Omega(m\sqrt{T}/n)$. When $T$ is sufficiently larger than $n$, the gap between the upper and lower bounds is a logarithmic factor of $T$.

Title: Stochastic Variational Propagation: Local, Scalable and Efficient Alternative to Backpropagation

Authors: Bojian Yin, Federico Corradi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05181
Pdf URL: https://arxiv.org/pdf/2505.05181
Copy Paste: [[2505.05181]] Stochastic Variational Propagation: Local, Scalable and Efficient Alternative to Backpropagation(https://arxiv.org/abs/2505.05181)
Keywords: transformer
Abstract: Backpropagation (BP) is the cornerstone of deep learning, but its reliance on global gradient synchronization limits scalability and imposes significant memory overhead. We propose Stochastic Variational Propagation (SVP), a scalable alternative that reframes training as hierarchical variational inference. SVP treats layer activations as latent variables and optimizes local Evidence Lower Bounds (ELBOs), enabling independent, local updates while preserving global coherence. However, directly applying KL divergence in layer-wise ELBOs risks inter-layer's representation collapse due to excessive compression. To prevent this, SVP projects activations into low-dimensional spaces via fixed random matrices, ensuring information preservation and representational diversity. Combined with a feature alignment loss for inter-layer consistency, SVP achieves competitive accuracy with BP across diverse architectures (MLPs, CNNs, Transformers) and datasets (MNIST to ImageNet), reduces memory usage by up to 4x, and significantly improves scalability. More broadly, SVP introduces a probabilistic perspective to deep representation learning, opening pathways toward more modular and interpretable neural network design.

Title: PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting

Authors: Elad Feldman, Jacob Shams, Dudi Biton, Alfred Chen, Shaoyuan Xie, Satoru Koda, Yisroel Mirsky, Asaf Shabtai, Yuval Elovici, Ben Nassi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05183
Pdf URL: https://arxiv.org/pdf/2505.05183
Copy Paste: [[2505.05183]] PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting(https://arxiv.org/abs/2505.05183)
Keywords: security, robust
Abstract: The safety of autonomous cars has come under scrutiny in recent years, especially after 16 documented incidents involving Teslas (with autopilot engaged) crashing into parked emergency vehicles (police cars, ambulances, and firetrucks). While previous studies have revealed that strong light sources often introduce flare artifacts in the captured image, which degrade the image quality, the impact of flare on object detection performance remains unclear. In this research, we unveil PaniCar, a digital phenomenon that causes an object detector's confidence score to fluctuate below detection thresholds when exposed to activated emergency vehicle lighting. This vulnerability poses a significant safety risk, and can cause autonomous vehicles to fail to detect objects near emergency vehicles. In addition, this vulnerability could be exploited by adversaries to compromise the security of advanced driving assistance systems (ADASs). We assess seven commercial ADASs (Tesla Model 3, "manufacturer C", HP, Pelsee, AZDOME, Imagebon, Rexing), four object detectors (YOLO, SSD, RetinaNet, Faster R-CNN), and 14 patterns of emergency vehicle lighting to understand the influence of various technical and environmental factors. We also evaluate four SOTA flare removal methods and show that their performance and latency are insufficient for real-time driving constraints. To mitigate this risk, we propose Caracetamol, a robust framework designed to enhance the resilience of object detectors against the effects of activated emergency vehicle lighting. Our evaluation shows that on YOLOv3 and Faster RCNN, Caracetamol improves the models' average confidence of car detection by 0.20, the lower confidence bound by 0.33, and reduces the fluctuation range by 0.33. In addition, Caracetamol is capable of processing frames at a rate of between 30-50 FPS, enabling real-time ADAS car detection.

Title: Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models

Authors: Wei Peng, Kang Liu, Jianchen Hu, Meng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05189
Pdf URL: https://arxiv.org/pdf/2505.05189
Copy Paste: [[2505.05189]] Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models(https://arxiv.org/abs/2505.05189)
Keywords: large language model
Abstract: Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14\% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06\% in base classes and 75.97\% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20\%, 3.78\%, and 8.04\%, respectively. Our code are available at \underline{this https URL}.

Title: Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Authors: Yixin Cheng, Hongcheng Guo, Yangming Li, Leonid Sigal
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2505.05190
Pdf URL: https://arxiv.org/pdf/2505.05190
Copy Paste: [[2505.05190]] Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks(https://arxiv.org/abs/2505.05190)
Keywords: attack, robust, watermark, large language model
Abstract: Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

Title: Concept-Based Unsupervised Domain Adaptation

Authors: Xinyue Xu, Yueying Hu, Hui Tang, Yi Qin, Lu Mi, Hao Wang, Xiaomeng Li
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.05195
Pdf URL: https://arxiv.org/pdf/2505.05195
Copy Paste: [[2505.05195]] Concept-Based Unsupervised Domain Adaptation(https://arxiv.org/abs/2505.05195)
Keywords: robust, interpretability
Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by explaining predictions through human-understandable concepts but typically assume that training and test data share the same distribution. This assumption often fails under domain shifts, leading to degraded performance and poor generalization. To address these limitations and improve the robustness of CBMs, we propose the Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed to: (1) align concept representations across domains using adversarial training, (2) introduce a relaxation threshold to allow minor domain-specific differences in concept distributions, thereby preventing performance drop due to over-constraints of these distributions, (3) infer concepts directly in the target domain without requiring labeled concept data, enabling CBMs to adapt to diverse domains, and (4) integrate concept learning into conventional domain adaptation (DA) with theoretical guarantees, improving interpretability and establishing new benchmarks for DA. Experiments demonstrate that our approach significantly outperforms the state-of-the-art CBM and DA methods on real-world datasets.

Title: EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

Authors: Haizhen Xie, Kunpeng Du, Qiangyu Yan, Sen Lu, Jianhong Han, Hanting Chen, Hailin Hu, Jie Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05209
Pdf URL: https://arxiv.org/pdf/2505.05209
Copy Paste: [[2505.05209]] EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution(https://arxiv.org/abs/2505.05209)
Keywords: robust, diffusion, transformer
Abstract: Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $\Psi$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.

Title: HQC-NBV: A Hybrid Quantum-Classical View Planning Approach

Authors: Xiaotong Yu, Chang Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05212
Pdf URL: https://arxiv.org/pdf/2505.05212
Copy Paste: [[2505.05212]] HQC-NBV: A Hybrid Quantum-Classical View Planning Approach(https://arxiv.org/abs/2505.05212)
Keywords: robust
Abstract: Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive experiments demonstrate that quantum-specific components provide measurable performance advantages. Compared to the classical methods, our approach achieves up to 49.2% higher exploration efficiency across diverse environments. Our analysis of entanglement architecture and coherence-preserving terms provides insights into the mechanisms of quantum advantage in robotic exploration tasks. This work represents a significant advancement in integrating quantum computing into robotic perception systems, offering a paradigm-shifting solution for various robot vision tasks.

Title: Diffusion Model Quantization: A Review

Authors: Qian Zeng, Chenggong Hu, Mingli Song, Jie Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05215
Pdf URL: https://arxiv.org/pdf/2505.05215
Copy Paste: [[2505.05215]] Diffusion Model Quantization: A Review(https://arxiv.org/abs/2505.05215)
Keywords: diffusion, transformer, generative
Abstract: Recent success of large text-to-image models has empirically underscored the exceptional performance of diffusion models in generative tasks. To facilitate their efficient deployment on resource-constrained edge devices, model quantization has emerged as a pivotal technique for both compression and acceleration. This survey offers a thorough review of the latest advancements in diffusion model quantization, encapsulating and analyzing the current state of the art in this rapidly advancing domain. First, we provide an overview of the key challenges encountered in the quantization of diffusion models, including those based on U-Net architectures and Diffusion Transformers (DiT). We then present a comprehensive taxonomy of prevalent quantization techniques, engaging in an in-depth discussion of their underlying principles. Subsequently, we perform a meticulous analysis of representative diffusion model quantization schemes from both qualitative and quantitative perspectives. From a quantitative standpoint, we rigorously benchmark a variety of methods using widely recognized datasets, delivering an extensive evaluation of the most recent and impactful research in the field. From a qualitative standpoint, we categorize and synthesize the effects of quantization errors, elucidating these impacts through both visual analysis and trajectory examination. In conclusion, we outline prospective avenues for future research, proposing novel directions for the quantization of generative models in practical applications. The list of related papers, corresponding codes, pre-trained models and comparison results are publicly available at the survey project homepage this https URL.

Title: GFlowNets for Active Learning Based Resource Allocation in Next Generation Wireless Networks

Authors: Charbel Bou Chaaya, Mehdi Bennis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05224
Pdf URL: https://arxiv.org/pdf/2505.05224
Copy Paste: [[2505.05224]] GFlowNets for Active Learning Based Resource Allocation in Next Generation Wireless Networks(https://arxiv.org/abs/2505.05224)
Keywords: generative
Abstract: In this work, we consider the radio resource allocation problem in a wireless system with various integrated functionalities, such as communication, sensing and computing. We design suitable resource management techniques that can simultaneously cater to those heterogeneous requirements, and scale appropriately with the high-dimensional and discrete nature of the problem. We propose a novel active learning framework where resource allocation patterns are drawn sequentially, evaluated in the environment, and then used to iteratively update a surrogate model of the environment. Our method leverages a generative flow network (GFlowNet) to sample favorable solutions, as such models are trained to generate compositional objects proportionally to their training reward, hence providing an appropriate coverage of its modes. As such, GFlowNet generates diverse and high return resource management designs that update the surrogate model and swiftly discover suitable solutions. We provide simulation results showing that our method can allocate radio resources achieving 20% performance gains against benchmarks, while requiring less than half of the number of acquisition rounds.

Title: QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

Authors: Mengze Hong, Wailing Ng, Di Jiang, Chen Jason Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05225
Pdf URL: https://arxiv.org/pdf/2505.05225
Copy Paste: [[2505.05225]] QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation(https://arxiv.org/abs/2505.05225)
Keywords: federate, large language model
Abstract: The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, with data selections grounded in 24 Chinese qualifications to closely align with national policies and working standards. Through comprehensive evaluation, the Qwen2.5 model outperformed the more advanced GPT-4o, with Chinese LLMs consistently surpassing non-Chinese models, highlighting the importance of localized domain knowledge in meeting qualification requirements. The best performance of 75.26% reveals the current gaps in domain coverage within model capabilities. Furthermore, we present the failure of LLM collaboration with crowdsourcing mechanisms and suggest the opportunities for multi-domain RAG knowledge enhancement and vertical domain LLM training with Federated Learning.

Title: Does CLIP perceive art the same way we do?

Authors: Andrea Asperti, Leonardo Dessì, Maria Chiara Tonetti, Nico Wu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.05229
Pdf URL: https://arxiv.org/pdf/2505.05229
Copy Paste: [[2505.05229]] Does CLIP perceive art the same way we do?(https://arxiv.org/abs/2505.05229)
Keywords: interpretability, generative
Abstract: CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it "see" the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.

Title: Latte: Transfering LLMs` Latent-level Knowledge for Few-shot Tabular Learning

Authors: Ruxue Shi, Hengrui Gu, Hangting Ye, Yiwei Dai, Xu Shen, Xin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05237
Pdf URL: https://arxiv.org/pdf/2505.05237
Copy Paste: [[2505.05237]] Latte: Transfering LLMs` Latent-level Knowledge for Few-shot Tabular Learning(https://arxiv.org/abs/2505.05237)
Keywords: extraction, large language model
Abstract: Few-shot tabular learning, in which machine learning models are trained with a limited amount of labeled data, provides a cost-effective approach to addressing real-world challenges. The advent of Large Language Models (LLMs) has sparked interest in leveraging their pre-trained knowledge for few-shot tabular learning. Despite promising results, existing approaches either rely on test-time knowledge extraction, which introduces undesirable latency, or text-level knowledge, which leads to unreliable feature engineering. To overcome these limitations, we propose Latte, a training-time knowledge extraction framework that transfers the latent prior knowledge within LLMs to optimize a more generalized downstream model. Latte enables general knowledge-guided downstream tabular learning, facilitating the weighted fusion of information across different feature values while reducing the risk of overfitting to limited labeled data. Furthermore, Latte is compatible with existing unsupervised pre-training paradigms and effectively utilizes available unlabeled samples to overcome the performance limitations imposed by an extremely small labeled dataset. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of Latte, establishing it as a state-of-the-art approach in this domain

Title: PADriver: Towards Personalized Autonomous Driving

Authors: Genghua Kou, Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Ziheng Zhang, Osamu Yoshie, Tiancai Wang, Ying Li, Xiangyu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05240
Pdf URL: https://arxiv.org/pdf/2505.05240
Copy Paste: [[2505.05240]] PADriver: Towards Personalized Autonomous Driving(https://arxiv.org/abs/2505.05240)
Keywords: large language model
Abstract: In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.

Title: T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction

Authors: Kun Peng, Chaodong Tong, Cong Cao, Hao Peng, Qian Li, Guanlin Wu, Lei Jiang, Yanbing Liu, Philip S. Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05271
Pdf URL: https://arxiv.org/pdf/2505.05271
Copy Paste: [[2505.05271]] T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction(https://arxiv.org/abs/2505.05271)
Keywords: extraction, fair, transformer
Abstract: Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability to capture relations can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.

Title: MTL-UE: Learning to Learn Nothing for Multi-Task Learning

Authors: Yi Yu, Song Xia, Siyuan Yang, Chenqi Kong, Wenhan Yang, Shijian Lu, Yap-Peng Tan, Alex C. Kot
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2505.05279
Pdf URL: https://arxiv.org/pdf/2505.05279
Copy Paste: [[2505.05279]] MTL-UE: Learning to Learn Nothing for Multi-Task Learning(https://arxiv.org/abs/2505.05279)
Keywords: attack, robust
Abstract: Most existing unlearnable strategies focus on preventing unauthorized users from training single-task learning (STL) models with personal data. Nevertheless, the paradigm has recently shifted towards multi-task data and multi-task learning (MTL), targeting generalist and foundation models that can handle multiple tasks simultaneously. Despite their growing importance, MTL data and models have been largely neglected while pursuing unlearnable strategies. This paper presents MTL-UE, the first unified framework for generating unlearnable examples for multi-task data and MTL models. Instead of optimizing perturbations for each sample, we design a generator-based structure that introduces label priors and class-wise feature embeddings which leads to much better attacking performance. In addition, MTL-UE incorporates intra-task and inter-task embedding regularization to increase inter-class separation and suppress intra-class variance which enhances the attack robustness greatly. Furthermore, MTL-UE is versatile with good supports for dense prediction tasks in MTL. It is also plug-and-play allowing integrating existing surrogate-dependent unlearnable methods with little adaptation. Extensive experiments show that MTL-UE achieves superior attacking performance consistently across 4 MTL datasets, 3 base UE methods, 5 model backbones, and 5 MTL task-weighting strategies.

Title: QUIC-Exfil: Exploiting QUIC's Server Preferred Address Feature to Perform Data Exfiltration Attacks

Authors: Thomas Grübl, Weijie Niu, Jan von der Assen, Burkhard Stiller
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.05292
Pdf URL: https://arxiv.org/pdf/2505.05292
Copy Paste: [[2505.05292]] QUIC-Exfil: Exploiting QUIC's Server Preferred Address Feature to Perform Data Exfiltration Attacks(https://arxiv.org/abs/2505.05292)
Keywords: secure, attack
Abstract: The QUIC protocol is now widely adopted by major tech companies and accounts for a significant fraction of today's Internet traffic. QUIC's multiplexing capabilities, encrypted headers, dynamic IP address changes, and encrypted parameter negotiations make the protocol not only more efficient, secure, and censorship-resistant, but also practically unmanageable by firewalls. This opens doors for attackers who may exploit certain traits of the QUIC protocol to perform targeted attacks, such as data exfiltration attacks. Whereas existing data exfiltration techniques, such as TLS and DNS-based exfiltration, can be detected on a firewall level, QUIC-based data exfiltration is more difficult to detect, since changes in IP addresses and ports are inherent to the protocol's normal behavior. To show the feasibility of a QUIC-based data exfiltration attack, we introduce a novel method leveraging the server preferred address feature of the QUIC protocol and, thus, allows an attacker to exfiltrate sensitive data from an infected machine to a malicious server, disguised as a server-side connection migration. The attack is implemented as a proof of concept tool in Rust. We evaluated the performance of five anomaly detection classifiers - Random Forest, Multi-Layer Perceptron, Support Vector Machine, Autoencoder, and Isolation Forest - trained on datasets collected from three network traffic scenarios. The classifiers were trained on over 700K benign and malicious QUIC packets and 786 connection migration events, but were unable to detect the data exfiltration attempts. Furthermore, post-analysis of the traffic captures did not reveal any identifiable fingerprint. As part of our evaluation, we also interviewed five leading firewall vendors and found that, as of today, no major firewall vendor implements functionality capable of distinguishing between benign and malicious QUIC connection migrations.

Title: Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design

Authors: Elena Musi, Nadin Kokciyan, Khalid Al-Khatib, Davide Ceolin, Emmanuelle Dietz, Klara Gutekunst, Annette Hautli-Janisz, Cristian Manuel Santibañez Yañez, Jodi Schneider, Jonas Scholz, Cor Steging, Jacky Visser, Henning Wachsmuth
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2505.05298
Pdf URL: https://arxiv.org/pdf/2505.05298
Copy Paste: [[2505.05298]] Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design(https://arxiv.org/abs/2505.05298)
Keywords: large language model
Abstract: In this position paper, we advocate for the development of conversational technology that is inherently designed to support and facilitate argumentative processes. We argue that, at present, large language models (LLMs) are inadequate for this purpose, and we propose an ideal technology design aimed at enhancing argumentative skills. This involves re-framing LLMs as tools to exercise our critical thinking rather than replacing them. We introduce the concept of 'reasonable parrots' that embody the fundamental principles of relevance, responsibility, and freedom, and that interact through argumentative dialogical moves. These principles and moves arise out of millennia of work in argumentation theory and should serve as the starting point for LLM-based technology that incorporates basic principles of argumentation.

Title: Scalable Chain of Thoughts via Elastic Reasoning

Authors: Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05315
Pdf URL: https://arxiv.org/pdf/2505.05315
Copy Paste: [[2505.05315]] Scalable Chain of Thoughts via Elastic Reasoning(https://arxiv.org/abs/2505.05315)
Keywords: robust
Abstract: Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases--thinking and solution--with independently allocated budgets. At test time, Elastic Reasoning prioritize that completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Elastic Reasoning offers a principled and practical solution to the pressing challenge of controllable reasoning at scale.

Title: Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects

Authors: Agnese Chiatti, Sara Bernardini, Lara Shibelski Godoy Piccolo, Viola Schiaffonati, Matteo Matteucci
Subjects: cs.CV, cs.AI, cs.CY, cs.HC, cs.RO
Abstract URL: https://arxiv.org/abs/2505.05318
Pdf URL: https://arxiv.org/pdf/2505.05318
Copy Paste: [[2505.05318]] Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects(https://arxiv.org/abs/2505.05318)
Keywords: protect
Abstract: The rapid adoption of Vision Language Models (VLMs), pre-trained on large image-text and video-text datasets, calls for protecting and informing users about when to trust these systems. This survey reviews studies on trust dynamics in user-VLM interactions, through a multi-disciplinary taxonomy encompassing different cognitive science capabilities, collaboration modes, and agent behaviours. Literature insights and findings from a workshop with prospective VLM users inform preliminary requirements for future VLM trust studies.

Title: Feature-Augmented Deep Networks for Multiscale Building Segmentation in High-Resolution UAV and Satellite Imagery

Authors: Chintan B. Maniyar, Minakshi Kumar, Gengchen Mai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05321
Pdf URL: https://arxiv.org/pdf/2505.05321
Copy Paste: [[2505.05321]] Feature-Augmented Deep Networks for Multiscale Building Segmentation in High-Resolution UAV and Satellite Imagery(https://arxiv.org/abs/2505.05321)
Keywords: robust, segmentation
Abstract: Accurate building segmentation from high-resolution RGB imagery remains challenging due to spectral similarity with non-building features, shadows, and irregular building geometries. In this study, we present a comprehensive deep learning framework for multiscale building segmentation using RGB aerial and satellite imagery with spatial resolutions ranging from 0.4m to 2.7m. We curate a diverse, multi-sensor dataset and introduce feature-augmented inputs by deriving secondary representations including Principal Component Analysis (PCA), Visible Difference Vegetation Index (VDVI), Morphological Building Index (MBI), and Sobel edge filters from RGB channels. These features guide a Res-U-Net architecture in learning complex spatial patterns more effectively. We also propose training policies incorporating layer freezing, cyclical learning rates, and SuperConvergence to reduce training time and resource usage. Evaluated on a held-out WorldView-3 image, our model achieves an overall accuracy of 96.5%, an F1-score of 0.86, and an Intersection over Union (IoU) of 0.80, outperforming existing RGB-based benchmarks. This study demonstrates the effectiveness of combining multi-resolution imagery, feature augmentation, and optimized training strategies for robust building segmentation in remote sensing applications.

Title: ICon: In-Context Contribution for Automatic Data Selection

Authors: Yixin Yang, Qingxiu Dong, Linli Yao, Fangwei Zhu, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05327
Pdf URL: https://arxiv.org/pdf/2505.05327
Copy Paste: [[2505.05327]] ICon: In-Context Contribution for Automatic Data Selection(https://arxiv.org/abs/2505.05327)
Keywords: large language model
Abstract: Data selection for instruction tuning is essential for improving the performance of Large Language Models (LLMs) and reducing training cost. However, existing automated selection methods either depend on computationally expensive gradient-based measures or manually designed heuristics, which may fail to fully exploit the intrinsic attributes of data. In this paper, we propose In-context Learning for Contribution Measurement (ICon), a novel gradient-free method that takes advantage of the implicit fine-tuning nature of in-context learning (ICL) to measure sample contribution without gradient computation or manual indicators engineering. ICon offers a computationally efficient alternative to gradient-based methods and reduces human inductive bias inherent in heuristic-based approaches. ICon comprises three components and identifies high-contribution data by assessing performance shifts under implicit learning through ICL. Extensive experiments on three LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of ICon. Remarkably, on LLaMA3.1-8B, models trained on 15% of ICon-selected data outperform full datasets by 5.42% points and exceed the best performance of widely used selection methods by 2.06% points. We further analyze high-contribution samples selected by ICon, which show both diverse tasks and appropriate difficulty levels, rather than just the hardest ones.

Title: SUUM: Timestamp-based Nakamoto-style Blockchains are Vulnerable

Authors: Junjie Hu, Na Ruan
Subjects: cs.CR, cs.GT
Abstract URL: https://arxiv.org/abs/2505.05328
Pdf URL: https://arxiv.org/pdf/2505.05328
Copy Paste: [[2505.05328]] SUUM: Timestamp-based Nakamoto-style Blockchains are Vulnerable(https://arxiv.org/abs/2505.05328)
Keywords: security, attack, fair
Abstract: We introduce two advanced attack strategies, the Unrestricted Uncle Maker (UUM) Attack and the Staircase-Unrestricted Uncle Maker (SUUM) Attack, which fundamentally threaten the security of timestamp-based Nakamoto-style blockchains by inflicting permanent systemic harm. Unlike prior work that merely enhances adversarial rewards, these attacks exploit vulnerabilities in timestamp manipulation and fork selection rules to irreversibly destabilize blockchain fairness and incentive mechanisms. Specifically, the SUUM attack enables adversaries to persistently launch attacks at zero cost, eliminating constraints on block withholding and risk-free conditions, while systematically maximizing rewards through coordinated timestamp adjustments and strategic block release. Our analysis demonstrates that SUUM adversaries achieve disproportionate reward advantages over both UUM and the original Riskless Uncle Maker (RUM) Attack [CCS '23], with all three strategies surpassing honest mining. Crucially, SUUM's cost-free persistence allows adversaries to indefinitely drain rewards from honest participants by maintaining minimal difficulty risks through precise timestamp manipulation. This creates a self-reinforcing cycle: adversaries amplify their profits while suppressing honest returns, thereby permanently eroding the protocol's security assumptions. Through rigorous theoretical modeling and simulations, we validate how SUUM's combination of timestamp tampering, block withholding, and difficulty risk control enables unmitigated exploitation of consensus mechanisms. This work underscores the existential risks posed by timestamp-based Nakamoto-style protocols and advocates urgent countermeasures to ensure long-term stability.

Title: Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors

Authors: Zunjie Zhu, Yan Zhao, Yihan Hu, Guoxiang Wang, Hai Qiu, Bolun Zheng, Chenggang Yan, Feng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05336
Pdf URL: https://arxiv.org/pdf/2505.05336
Copy Paste: [[2505.05336]] Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors(https://arxiv.org/abs/2505.05336)
Keywords: transformer
Abstract: The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints. To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing the complexity of the hardware system. In this work, we propose a method called Progressive Inertial Poser (ProgIP) for human pose estimation, which combines neural network estimation with a human dynamics model, considers the hierarchical structure of the kinematic chain, and employs a multi-stage progressive network estimation with increased depth to reconstruct full-body motion in real time. The encoder combines Transformer Encoder and bidirectional LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms high-dimensional features and accurately projects them onto Skinned Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative experimental results on multiple public datasets show that our method outperforms state-of-the-art methods with the same inputs, and is comparable to recent works using six IMU sensors.

Title: Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt

Authors: Jie Deng, Danfeng Hong, Chenyu Li, Naoto Yokoya
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.05367
Pdf URL: https://arxiv.org/pdf/2505.05367
Copy Paste: [[2505.05367]] Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt(https://arxiv.org/abs/2505.05367)
Keywords: robust, segmentation
Abstract: We propose a novel joint framework by integrating super-resolution and segmentation, called JointSeg, which enables the generation of 1-meter ISA maps directly from freely available Sentinel-2 imagery. JointSeg was trained on multimodal cross-resolution inputs, offering a scalable and affordable alternative to traditional approaches. This synergistic design enables gradual resolution enhancement from 10m to 1m while preserving fine-grained spatial textures, and ensures high classification fidelity through effective cross-scale feature fusion. This method has been successfully applied to the Yangtze River Economic Belt (YREB), a region characterized by complex urban-rural patterns and diverse topography. As a result, a comprehensive ISA mapping product for 2021, referred to as ISA-1, was generated, covering an area of over 2.2 million square kilometers. Quantitative comparisons against the 10m ESA WorldCover and other benchmark products reveal that ISA-1 achieves an F1-score of 85.71%, outperforming bilinear-interpolation-based segmentation by 9.5%, and surpassing other ISA datasets by 21.43%-61.07%. In densely urbanized areas (e.g., Suzhou, Nanjing), ISA-1 reduces ISA overestimation through improved discrimination of green spaces and water bodies. Conversely, in mountainous regions (e.g., Ganzi, Zhaotong), it identifies significantly more ISA due to its enhanced ability to detect fragmented anthropogenic features such as rural roads and sparse settlements, demonstrating its robustness across diverse landscapes. Moreover, we present biennial ISA maps from 2017 to 2023, capturing spatiotemporal urbanization dynamics across representative cities. The results highlight distinct regional growth patterns: rapid expansion in upstream cities, moderate growth in midstream regions, and saturation in downstream metropolitan areas.

Title: Threshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks

Authors: Kejie Zhao, Wenjia Hua, Aiersi Tuerhong, Luziwei Leng, Yuxin Ma, Qinghua Guo
Subjects: cs.CV, cs.AI, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2505.05375
Pdf URL: https://arxiv.org/pdf/2505.05375
Copy Paste: [[2505.05375]] Threshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks(https://arxiv.org/abs/2505.05375)
Keywords: robust
Abstract: Recently, spiking neural networks (SNNs), deployed on neuromorphic chips, provide highly efficient solutions on edge devices in different scenarios. However, their ability to adapt to distribution shifts after deployment has become a crucial challenge. Online test-time adaptation (OTTA) offers a promising solution by enabling models to dynamically adjust to new data distributions without requiring source data or labeled target samples. Nevertheless, existing OTTA methods are largely designed for traditional artificial neural networks and are not well-suited for SNNs. To address this gap, we propose a low-power, neuromorphic chip-friendly online test-time adaptation framework, aiming to enhance model generalization under distribution shifts. The proposed approach is called Threshold Modulation (TM), which dynamically adjusts the firing threshold through neuronal dynamics-inspired normalization, being more compatible with neuromorphic hardware. Experimental results on benchmark datasets demonstrate the effectiveness of this method in improving the robustness of SNNs against distribution shifts while maintaining low computational cost. The proposed method offers a practical solution for online test-time adaptation of SNNs, providing inspiration for the design of future neuromorphic chips. The demo code is available at this http URL.

Title: GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans

Authors: Rachmadio Noval Lazuardi, Artem Sevastopolsky, Egor Zakharov, Matthias Niessner, Vanessa Sklyarova
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05376
Pdf URL: https://arxiv.org/pdf/2505.05376
Copy Paste: [[2505.05376]] GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans(https://arxiv.org/abs/2505.05376)
Keywords: extraction, diffusion
Abstract: We propose a novel method that reconstructs hair strands directly from colorless 3D scans by leveraging multi-modal hair orientation extraction. Hair strand reconstruction is a fundamental problem in computer vision and graphics that can be used for high-fidelity digital avatar synthesis, animation, and AR/VR applications. However, accurately recovering hair strands from raw scan data remains challenging due to human hair's complex and fine-grained structure. Existing methods typically rely on RGB captures, which can be sensitive to the environment and can be a challenging domain for extracting the orientation of guiding strands, especially in the case of challenging hairstyles. To reconstruct the hair purely from the observed geometry, our method finds sharp surface features directly on the scan and estimates strand orientation through a neural 2D line detector applied to the renderings of scan shading. Additionally, we incorporate a diffusion prior trained on a diverse set of synthetic hair scans, refined with an improved noise schedule, and adapted to the reconstructed contents via a scan-specific text prompt. We demonstrate that this combination of supervision signals enables accurate reconstruction of both simple and intricate hairstyles without relying on color information. To facilitate further research, we introduce Strands400, the largest publicly available dataset of hair strands with detailed surface geometry extracted from real-world data, which contains reconstructed hair strands from the scans of 400 subjects.

Title: Denoising Diffusion Probabilistic Models for Coastal Inundation Forecasting

Authors: Kazi Ashik Islam, Zakaria Mehrab, Mahantesh Halappanavar, Henning Mortveit, Sridhar Katragadda, Jon Derek Loftis, Madhav Marathe
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05381
Pdf URL: https://arxiv.org/pdf/2505.05381
Copy Paste: [[2505.05381]] Denoising Diffusion Probabilistic Models for Coastal Inundation Forecasting(https://arxiv.org/abs/2505.05381)
Keywords: diffusion
Abstract: Coastal flooding poses significant risks to communities, necessitating fast and accurate forecasting methods to mitigate potential damage. To approach this problem, we present DIFF-FLOOD, a probabilistic spatiotemporal forecasting method designed based on denoising diffusion models. DIFF-FLOOD predicts inundation level at a location by taking both spatial and temporal context into account. It utilizes inundation levels at neighboring locations and digital elevation data as spatial context. Inundation history from a context time window, together with additional co-variates are used as temporal context. Convolutional neural networks and cross-attention mechanism are then employed to capture the spatiotemporal dynamics in the data. We trained and tested DIFF-FLOOD on coastal inundation data from the Eastern Shore of Virginia, a region highly impacted by coastal flooding. Our results show that, DIFF-FLOOD outperforms existing forecasting methods in terms of prediction performance (6% to 64% improvement in terms of two performance metrics) and scalability.

Title: EDmamba: A Simple yet Effective Event Denoising Method with State Space Model

Authors: Ciyu Ruan, Zihang Gong, Ruishan Guo, Jingao Xu, Xinlei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05391
Pdf URL: https://arxiv.org/pdf/2505.05391
Copy Paste: [[2505.05391]] EDmamba: A Simple yet Effective Event Denoising Method with State Space Model(https://arxiv.org/abs/2505.05391)
Keywords: robust, extraction, transformer
Abstract: Event cameras excel in high-speed vision due to their high temporal resolution, high dynamic range, and low power consumption. However, as dynamic vision sensors, their output is inherently noisy, making efficient denoising essential to preserve their ultra-low latency and real-time processing capabilities. Existing event denoising methods struggle with a critical dilemma: computationally intensive approaches compromise the sensor's high-speed advantage, while lightweight methods often lack robustness across varying noise levels. To address this, we propose a novel event denoising framework based on State Space Models (SSMs). Our approach represents events as 4D event clouds and includes a Coarse Feature Extraction (CFE) module that extracts embedding features from both geometric and polarity-aware subspaces. The model is further composed of two essential components: A Spatial Mamba (S-SSM) that models local geometric structures and a Temporal Mamba (T-SSM) that captures global temporal dynamics, efficiently propagating spatiotemporal features across events. Experiments demonstrate that our method achieves state-of-the-art accuracy and efficiency, with 88.89K parameters, 0.0685s per 100K events inference time, and a 0.982 accuracy score, outperforming Transformer-based methods by 2.08% in denoising accuracy and 36X faster.

Title: PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space Model

Authors: Zhang Zhang, Chao Sun, Chao Yue, Da Wen, Tianze Wang, Jianghao Leng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05397
Pdf URL: https://arxiv.org/pdf/2505.05397
Copy Paste: [[2505.05397]] PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space Model(https://arxiv.org/abs/2505.05397)
Keywords: transformer
Abstract: Serving the Intelligent Transport System (ITS) and Vehicle-to-Everything (V2X) tasks, roadside perception has received increasing attention in recent years, as it can extend the perception range of connected vehicles and improve traffic safety. However, roadside point cloud oriented 3D object detection has not been effectively explored. To some extent, the key to the performance of a point cloud detector lies in the receptive field of the network and the ability to effectively utilize the scene context. The recent emergence of Mamba, based on State Space Model (SSM), has shaken up the traditional convolution and transformers that have long been the foundational building blocks, due to its efficient global receptive field. In this work, we introduce Mamba to pillar-based roadside point cloud perception and propose a framework based on Cross-stage State-space Group (CSG), called PillarMamba. It enhances the expressiveness of the network and achieves efficient computation through cross-stage feature fusion. However, due to the limitations of scan directions, state space model faces local connection disrupted and historical relationship forgotten. To address this, we propose the Hybrid State-space Block (HSB) to obtain the local-global context of roadside point cloud. Specifically, it enhances neighborhood connections through local convolution and preserves historical memory through residual attention. The proposed method outperforms the state-of-the-art methods on the popular large scale roadside benchmark: DAIR-V2X-I. The code will be released soon.

Title: Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?

Authors: Valeria Pastorino, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05406
Pdf URL: https://arxiv.org/pdf/2505.05406
Copy Paste: [[2505.05406]] Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?(https://arxiv.org/abs/2505.05406)
Keywords: large language model
Abstract: Framing in media critically shapes public perception by selectively emphasizing some details while downplaying others. With the rise of large language models in automated news and content creation, there is growing concern that these systems may introduce or even amplify framing biases compared to human authors. In this paper, we explore how framing manifests in both out-of-the-box and fine-tuned LLM-generated news content. Our analysis reveals that, particularly in politically and socially sensitive contexts, LLMs tend to exhibit more pronounced framing than their human counterparts. In addition, we observe significant variation in framing tendencies across different model architectures, with some models displaying notably higher biases. These findings point to the need for effective post-training mitigation strategies and tighter evaluation frameworks to ensure that automated news content upholds the standards of balanced reporting.

Title: Crosslingual Reasoning through Test-Time Scaling

Authors: Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05408
Pdf URL: https://arxiv.org/pdf/2505.05408
Copy Paste: [[2505.05408]] Crosslingual Reasoning through Test-Time Scaling(https://arxiv.org/abs/2505.05408)
Keywords: large language model
Abstract: Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.

Title: Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It

Authors: Marvin F. da Silva, Felix Dangel, Sageev Oore
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05409
Pdf URL: https://arxiv.org/pdf/2505.05409
Copy Paste: [[2505.05409]] Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It(https://arxiv.org/abs/2505.05409)
Keywords: transformer
Abstract: The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.

Title: TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Authors: Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05422
Pdf URL: https://arxiv.org/pdf/2505.05422
Copy Paste: [[2505.05422]] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation(https://arxiv.org/abs/2505.05422)
Keywords: transformer, generative
Abstract: Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at this https URL.

Title: TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering

Authors: Ran Zhang, Wei Zhao, Lieve Macken, Steffen Eger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05423
Pdf URL: https://arxiv.org/pdf/2505.05423
Copy Paste: [[2505.05423]] TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering(https://arxiv.org/abs/2505.05423)
Keywords: large language model
Abstract: The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation (MT) as being superior to experienced professional human translation. In the long run, this bias could result in a permanent decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce TransProQA, a novel, reference-free, LLM-based question-answering (QA) framework designed specifically for literary translation evaluation. TransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, TransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation (ACC-EQ and Kendall's tau) and surpassing the best state-of-the-art (SOTA) metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, TransProQA approaches human-level evaluation performance comparable to trained linguistic annotators. It demonstrates broad applicability to open-source models such as LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free literary evaluation metric and a valuable tool for evaluating texts that require local processing due to copyright or ethical considerations.

Title: Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Authors: Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, Xu Han, Zhiyuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05427
Pdf URL: https://arxiv.org/pdf/2505.05427
Copy Paste: [[2505.05427]] Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data(https://arxiv.org/abs/2505.05427)
Keywords: robust, large language model
Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.

Title: clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

Authors: Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05445
Pdf URL: https://arxiv.org/pdf/2505.05445
Copy Paste: [[2505.05445]] clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations(https://arxiv.org/abs/2505.05445)
Keywords: robust, large language model
Abstract: The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd's flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

Title: Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Authors: Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05446
Pdf URL: https://arxiv.org/pdf/2505.05446
Copy Paste: [[2505.05446]] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding(https://arxiv.org/abs/2505.05446)
Keywords: robust
Abstract: Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.

Title: UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections

Authors: Fatima Haouari, Carolina Scarton, Nicolò Faggiani, Nikolaos Nikolaidis, Bonka Kotseva, Ibrahim Abu Farha, Jens Linge, Kalina Bontcheva
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2505.05459
Pdf URL: https://arxiv.org/pdf/2505.05459
Copy Paste: [[2505.05459]] UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections(https://arxiv.org/abs/2505.05459)
Keywords: large language model
Abstract: Misleading narratives play a crucial role in shaping public opinion during elections, as they can influence how voters perceive candidates and political parties. This entails the need to detect these narratives accurately. To address this, we introduce the first taxonomy of common misleading narratives that circulated during recent elections in Europe. Based on this taxonomy, we construct and analyse UKElectionNarratives: the first dataset of human-annotated misleading narratives which circulated during the UK General Elections in 2019 and 2024. We also benchmark Pre-trained and Large Language Models (focusing on GPT-4o), studying their effectiveness in detecting election-related misleading narratives. Finally, we discuss potential use cases and make recommendations for future research directions using the proposed codebook and dataset.

Title: Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Authors: Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05464
Pdf URL: https://arxiv.org/pdf/2505.05464
Copy Paste: [[2505.05464]] Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging(https://arxiv.org/abs/2505.05464)
Keywords: large language model
Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

Title: ComPO: Preference Alignment via Comparison Oracles

Authors: Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05465
Pdf URL: https://arxiv.org/pdf/2505.05465
Copy Paste: [[2505.05465]] ComPO: Preference Alignment via Comparison Oracles(https://arxiv.org/abs/2505.05465)
Keywords: large language model
Abstract: Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on comparison oracles and provide the convergence guarantee for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in \citet{Razin-2025-Unintentional}.

Title: StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

Authors: Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, Ping Huang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05467
Pdf URL: https://arxiv.org/pdf/2505.05467
Copy Paste: [[2505.05467]] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant(https://arxiv.org/abs/2505.05467)
Keywords: large language model
Abstract: We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.

Title: Generating Physically Stable and Buildable LEGO Designs from Text

Authors: Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, Jun-Yan Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05469
Pdf URL: https://arxiv.org/pdf/2505.05469
Copy Paste: [[2505.05469]] Generating Physically Stable and Buildable LEGO Designs from Text(https://arxiv.org/abs/2505.05469)
Keywords: large language model
Abstract: We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We also release our new dataset, StableText2Lego, containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: this https URL.

Title: Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Authors: Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, Weilin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05472
Pdf URL: https://arxiv.org/pdf/2505.05472
Copy Paste: [[2505.05472]] Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation(https://arxiv.org/abs/2505.05472)
Keywords: diffusion
Abstract: Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.

Title: DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Authors: Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05473
Pdf URL: https://arxiv.org/pdf/2505.05473
Copy Paste: [[2505.05473]] DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion(https://arxiv.org/abs/2505.05473)
Keywords: robust, diffusion, transformer
Abstract: Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.

Title: 3D Scene Generation: A Survey

Authors: Beichen Wen, Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05474
Pdf URL: https://arxiv.org/pdf/2505.05474
Copy Paste: [[2505.05474]] 3D Scene Generation: A Survey(https://arxiv.org/abs/2505.05474)
Keywords: diffusion, generative
Abstract: 3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: this https URL.

Title: SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation

Authors: Yonwoo Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05475
Pdf URL: https://arxiv.org/pdf/2505.05475
Copy Paste: [[2505.05475]] SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation(https://arxiv.org/abs/2505.05475)
Keywords: diffusion, generative
Abstract: Creating high-quality animatable 3D human avatars from a single image remains a significant challenge in computer vision due to the inherent difficulty of reconstructing complete 3D information from a single viewpoint. Current approaches face a clear limitation: 3D Gaussian Splatting (3DGS) methods produce high-quality results but require multiple views or video sequences, while video diffusion models can generate animations from single images but struggle with consistency and identity preservation. We present SVAD, a novel approach that addresses these limitations by leveraging complementary strengths of existing techniques. Our method generates synthetic training data through video diffusion, enhances it with identity preservation and image restoration modules, and utilizes this refined data to train 3DGS avatars. Comprehensive evaluations demonstrate that SVAD outperforms state-of-the-art (SOTA) single-image methods in maintaining identity consistency and fine details across novel poses and viewpoints, while enabling real-time rendering capabilities. Through our data augmentation pipeline, we overcome the dependency on dense monocular or multi-view training data typically required by traditional 3DGS approaches. Extensive quantitative, qualitative comparisons show our method achieves superior performance across multiple metrics against baseline models. By effectively combining the generative power of diffusion models with both the high-quality results and rendering efficiency of 3DGS, our work establishes a new approach for high-fidelity avatar generation from a single image input.