2025-05-23

Title: Multilinear subspace learning for person re-identification based fusion of high order tensor features

Authors: Ammar Chouchane, Mohcene Bessaoudi, Hamza Kheddar, Abdelmalik Ouamane, Tiago Vieira, Mahmoud Hassaballah
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2505.15825
Pdf URL: https://arxiv.org/pdf/2505.15825
Copy Paste: [[2505.15825]] Multilinear subspace learning for person re-identification based fusion of high order tensor features(https://arxiv.org/abs/2505.15825)
Keywords: robust, extraction
Abstract: Video surveillance image analysis and processing is a challenging field in computer vision, with one of its most difficult tasks being Person Re-Identification (PRe-ID). PRe-ID aims to identify and track target individuals who have already been detected in a network of cameras, using a robust description of their pedestrian images. The success of recent research in person PRe-ID is largely due to effective feature extraction and representation, as well as the powerful learning of these features to reliably discriminate between pedestrian images. To this end, two powerful features, Convolutional Neural Networks (CNN) and Local Maximal Occurrence (LOMO), are modeled on multidimensional data using the proposed method, High-Dimensional Feature Fusion (HDFF). Specifically, a new tensor fusion scheme is introduced to leverage and combine these two types of features in a single tensor, even though their dimensions are not identical. To enhance the system's accuracy, we employ Tensor Cross-View Quadratic Analysis (TXQDA) for multilinear subspace learning, followed by cosine similarity for matching. TXQDA efficiently facilitates learning while reducing the high dimensionality inherent in high-order tensor data. The effectiveness of our approach is verified through experiments on three widely-used PRe-ID datasets: VIPeR, GRID, and PRID450S. Extensive experiments demonstrate that our approach outperforms recent state-of-the-art methods.

Title: Adaptive Tokenization: On the Hop-Overpriority Problem in Tokenized Graph Learning Models

Authors: Zhibiao Wang, Yunlong Zhou, Ziwei Zhang, Mengmei Zhang, Shirui Pan, Chunming Hu, Xiao Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15845
Pdf URL: https://arxiv.org/pdf/2505.15845
Copy Paste: [[2505.15845]] Adaptive Tokenization: On the Hop-Overpriority Problem in Tokenized Graph Learning Models(https://arxiv.org/abs/2505.15845)
Keywords: transformer, large language model
Abstract: Graph Transformers, leveraging the global attention to capture long-range dependencies in graph structures, have significantly advanced graph machine learning, but face prohibitive computational complexity. Tokenized Graph Learning Models (TGLMs) address this issue by converting graphs into ordered token lists for scalable processing. Besides, TGLMs also empower Large Language Models (LLMs) to handle text-attributed graphs more effectively and thus are also employed in Graph LLMs. However, existing TGLMs rely on hand-designed token lists and their adaptability to diverse graph learning scenarios remains unexplored. In this paper, we first conduct extensive empirical and theoretical preliminary studies for hand-designed token lists. Surprisingly, we identify an unexplored hop-overpriority problem: the common pre-defined token lists overemphasize nearby nodes and overwhelm the ability of TGLMs to balance local and global signals. This phenomenon is especially harmful for heterophilic graphs. To address this problem, we propose the Learnable Graph Token List (LGTL), a plug-and-play module to replace hand-designed token lists in TGLMs. Specifically, LGTL adaptively adjusts the weights across hops and prioritizes informative nodes within hops through a graph attention gate module and a selection module, respectively. In this way, contextually informative nodes can be adaptively emphasized for both homophilic and heterophilic graphs. Besides, we theoretically show that LGTL can address the hop-overpriority problem. Extensive experiments on benchmarks validate the efficacy of LGTL across both Graph Transformers and Graph LLM backbones.

Title: Generative AI for Autonomous Driving: A Review

Authors: Katharina Winter, Abhishek Vivekanandan, Rupert Polley, Yinzhe Shen, Christian Schlauch, Mohamed-Khalil Bouzidi, Bojan Derajic, Natalie Grabowsky, Annajoyce Mariani, Dennis Rochau, Giovanni Lucente, Harsh Yadav, Firas Mualla, Adam Molin, Sebastian Bernhard, Christian Wirth, Ömer Şahin Taş, Nadja Klein, Fabian B. Flohr, Hanno Gottschalk
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.15863
Pdf URL: https://arxiv.org/pdf/2505.15863
Copy Paste: [[2505.15863]] Generative AI for Autonomous Driving: A Review(https://arxiv.org/abs/2505.15863)
Keywords: robust, interpretability, diffusion, transformer, generative
Abstract: Generative AI (GenAI) is rapidly advancing the field of Autonomous Driving (AD), extending beyond traditional applications in text, image, and video generation. We explore how generative models can enhance automotive tasks, such as static map creation, dynamic scenario generation, trajectory forecasting, and vehicle motion planning. By examining multiple generative approaches ranging from Variational Autoencoder (VAEs) over Generative Adversarial Networks (GANs) and Invertible Neural Networks (INNs) to Generative Transformers (GTs) and Diffusion Models (DMs), we highlight and compare their capabilities and limitations for AD-specific applications. Additionally, we discuss hybrid methods integrating conventional techniques with generative approaches, and emphasize their improved adaptability and robustness. We also identify relevant datasets and outline open research questions to guide future developments in GenAI. Finally, we discuss three core challenges: safety, interpretability, and realtime capabilities, and present recommendations for image generation, dynamic scenario generation, and planning.

Title: How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

Authors: Ingeol Baek, Hwan Chang, Sunghyun Ryu, Hwanhee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15865
Pdf URL: https://arxiv.org/pdf/2505.15865
Copy Paste: [[2505.15865]] How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads(https://arxiv.org/abs/2505.15865)
Keywords: interpretability
Abstract: Despite significant advancements in Large Vision Language Models (LVLMs), a gap remains, particularly regarding their interpretability and how they locate and interpret textual information within images. In this paper, we explore various LVLMs to identify the specific heads responsible for recognizing text from images, which we term the Optical Character Recognition Head (OCR Head). Our findings regarding these heads are as follows: (1) Less Sparse: Unlike previous retrieval heads, a large number of heads are activated to extract textual information from images. (2) Qualitatively Distinct: OCR heads possess properties that differ significantly from general retrieval heads, exhibiting low similarity in their characteristics. (3) Statically Activated: The frequency of activation for these heads closely aligns with their OCR scores. We validate our findings in downstream tasks by applying Chain-of-Thought (CoT) to both OCR and conventional retrieval heads and by masking these heads. We also demonstrate that redistributing sink-token values within the OCR heads improves performance. These insights provide a deeper understanding of the internal mechanisms LVLMs employ in processing embedded textual information in images.

Title: SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Authors: Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15867
Pdf URL: https://arxiv.org/pdf/2505.15867
Copy Paste: [[2505.15867]] SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval(https://arxiv.org/abs/2505.15867)
Keywords: robust, transformer
Abstract: Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.

Title: Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities

Authors: Can Rong, Xin Zhang, Yanxin Xi, Hongjie Sui, Jingtao Ding, Yong Li
Subjects: cs.CV, cs.CY, eess.IV
Abstract URL: https://arxiv.org/abs/2505.15870
Pdf URL: https://arxiv.org/pdf/2505.15870
Copy Paste: [[2505.15870]] Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities(https://arxiv.org/abs/2505.15870)
Keywords: privacy, extraction, diffusion
Abstract: Commuting Origin-destination~(OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98\% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen, which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at this https URL.

Title: Decouple and Orthogonalize: A Data-Free Framework for LoRA Merging

Authors: Shenghe Zheng, Hongzhi Wang, Chenyu Huang, Xiaohui Wang, Tao Chen, Jiayuan Fan, Shuyue Hu, Peng Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15875
Pdf URL: https://arxiv.org/pdf/2505.15875
Copy Paste: [[2505.15875]] Decouple and Orthogonalize: A Data-Free Framework for LoRA Merging(https://arxiv.org/abs/2505.15875)
Keywords: data-free
Abstract: With more open-source models available for diverse tasks, model merging has gained attention by combining models into one, reducing training, storage, and inference costs. Current research mainly focuses on model merging for full fine-tuning, overlooking the popular LoRA. However, our empirical analysis reveals that: a) existing merging methods designed for full fine-tuning perform poorly on LoRA; b) LoRA modules show much larger parameter magnitude variance than full fine-tuned weights; c) greater parameter magnitude variance correlates with worse merging performance. Considering that large magnitude variances cause deviations in the distribution of the merged parameters, resulting in information loss and performance degradation, we propose a Decoupled and Orthogonal merging approach(DO-Merging). By separating parameters into magnitude and direction components and merging them independently, we reduce the impact of magnitude differences on the directional alignment of the merged models, thereby preserving task information. Furthermore, we introduce a data-free, layer-wise gradient descent method with orthogonal constraints to mitigate interference during the merging of direction components. We provide theoretical guarantees for both the decoupling and orthogonal components. And we validate through extensive experiments across vision, language, and multi-modal domains that our proposed DO-Merging can achieve significantly higher performance than existing merging methods at a minimal cost. Notably, each component can be flexibly integrated with existing methods, offering near free-lunch improvements across tasks.

Title: Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Authors: Siting Li, Xiang Gao, Simon Shaolei Du
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15877
Pdf URL: https://arxiv.org/pdf/2505.15877
Copy Paste: [[2505.15877]] Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval(https://arxiv.org/abs/2505.15877)
Keywords: large language model
Abstract: While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.

Title: GRIT: Teaching MLLMs to Think with Images

Authors: Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.15879
Pdf URL: https://arxiv.org/pdf/2505.15879
Copy Paste: [[2505.15879]] GRIT: Teaching MLLMs to Think with Images(https://arxiv.org/abs/2505.15879)
Keywords: robust
Abstract: Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

Title: Challenger: Affordable Adversarial Driving Video Generation

Authors: Zhiyuan Xu, Bohan Li, Huan-ang Gao, Mingju Gao, Yong Chen, Ming Liu, Chenxu Yan, Hang Zhao, Shuo Feng, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15880
Pdf URL: https://arxiv.org/pdf/2505.15880
Copy Paste: [[2505.15880]] Challenger: Affordable Adversarial Driving Video Generation(https://arxiv.org/abs/2505.15880)
Keywords: diffusion
Abstract: Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work, we introduce Challenger, a framework that produces physically plausible yet photorealistic adversarial driving videos. Generating such videos poses a fundamental challenge: it requires jointly optimizing over the space of traffic interactions and high-fidelity sensor observations. Challenger makes this affordable through two techniques: (1) a physics-aware multi-round trajectory refinement process that narrows down candidate adversarial maneuvers, and (2) a tailored trajectory scoring function that encourages realistic yet adversarial behavior while maintaining compatibility with downstream video synthesis. As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios-including cut-ins, sudden lane changes, tailgating, and blind spot intrusions-and renders them into multiview photorealistic videos. Extensive evaluations show that these scenarios significantly increase the collision rate of state-of-the-art end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and importantly, adversarial behaviors discovered for one model often transfer to others.

Title: Is (Selective) Round-To-Nearest Quantization All You Need?

Authors: Alex Kogan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15909
Pdf URL: https://arxiv.org/pdf/2505.15909
Copy Paste: [[2505.15909]] Is (Selective) Round-To-Nearest Quantization All You Need?(https://arxiv.org/abs/2505.15909)
Keywords: large language model
Abstract: Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our results, we argue that RTN presents a viable and practical choice for quantizing LLMs.

Title: BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law

Authors: Juvenal Domingos Júnior, Augusto Faria, E. Seiti de Oliveira, Erick de Brito, Matheus Teotonio, Andre Assumpção, Diedre Carmo, Roberto Lotufo, Jayr Pereira
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15916
Pdf URL: https://arxiv.org/pdf/2505.15916
Copy Paste: [[2505.15916]] BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law(https://arxiv.org/abs/2505.15916)
Keywords: segmentation
Abstract: This paper presents BR-TaxQA-R, a novel dataset designed to support question answering with references in the context of Brazilian personal income tax law. The dataset contains 715 questions from the 2024 official Q\&A document published by Brazil's Internal Revenue Service, enriched with statutory norms and administrative rulings from the Conselho Administrativo de Recursos Fiscais (CARF). We implement a Retrieval-Augmented Generation (RAG) pipeline using OpenAI embeddings for searching and GPT-4o-mini for answer generation. We compare different text segmentation strategies and benchmark our system against commercial tools such as ChatGPT and this http URL using RAGAS-based metrics. Results show that our custom RAG pipeline outperforms commercial systems in Response Relevancy, indicating stronger alignment with user queries, while commercial models achieve higher scores in Factual Correctness and fluency. These findings highlight a trade-off between legally grounded generation and linguistic fluency. Crucially, we argue that human expert evaluation remains essential to ensure the legal validity of AI-generated answers in high-stakes domains such as taxation. BR-TaxQA-R is publicly available at this https URL.

Title: Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization

Authors: Aliakbar Nafar, Kristen Brent Venable, Zijun Cui, Parisa Kordjamshidi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15918
Pdf URL: https://arxiv.org/pdf/2505.15918
Copy Paste: [[2505.15918]] Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization(https://arxiv.org/abs/2505.15918)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. This paper investigates using probabilistic knowledge inherent in LLMs to derive probability estimates for statements concerning events and their interrelationships captured via a Bayesian Network (BN). Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from minimal data, significantly reducing systematic biases. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with small amounts of real-world data. Additionally, we evaluate several prompting strategies for eliciting probabilistic knowledge from LLMs and establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.

Title: Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

Authors: Dong Won Lee, Hae Won Park, Cynthia Breazeal, Louis-Philippe Morency
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15922
Pdf URL: https://arxiv.org/pdf/2505.15922
Copy Paste: [[2505.15922]] Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition(https://arxiv.org/abs/2505.15922)
Keywords: large language model
Abstract: We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.

Title: AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning

Authors: Morteza Alizadeh, Mehrdad Oveisi, Sonya Falahati, Ghazal Mousavi, Mohsen Alambardar Meybodi, Somayeh Sadat Mehrnia, Ilker Hacihaliloglu, Arman Rahmim, Mohammad R. Salmanpour
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15931
Pdf URL: https://arxiv.org/pdf/2505.15931
Copy Paste: [[2505.15931]] AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning(https://arxiv.org/abs/2505.15931)
Keywords: robust, segmentation
Abstract: Machine learning (ML) models rely heavily on consistent and accurate performance metrics to evaluate and compare their effectiveness. However, existing libraries often suffer from fragmentation, inconsistent implementations, and insufficient data validation protocols, leading to unreliable results. Existing libraries have often been developed independently and without adherence to a unified standard, particularly concerning the specific tasks they aim to support. As a result, each library tends to adopt its conventions for metric computation, input/output formatting, error handling, and data validation protocols. This lack of standardization leads to both implementation differences (ID) and reporting differences (RD), making it difficult to compare results across frameworks or ensure reliable evaluations. To address these issues, we introduce AllMetrics, an open-source unified Python library designed to standardize metric evaluation across diverse ML tasks, including regression, classification, clustering, segmentation, and image-to-image translation. The library implements class-specific reporting for multi-class tasks through configurable parameters to cover all use cases, while incorporating task-specific parameters to resolve metric computation discrepancies across implementations. Various datasets from domains like healthcare, finance, and real estate were applied to our library and compared with Python, Matlab, and R components to identify which yield similar results. AllMetrics combines a modular Application Programming Interface (API) with robust input validation mechanisms to ensure reproducibility and reliability in model evaluation. This paper presents the design principles, architectural components, and empirical analyses demonstrating the ability to mitigate evaluation errors and to enhance the trustworthiness of ML workflows.

Title: MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding

Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Subjects: cs.LG, cs.AI, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2505.15946
Pdf URL: https://arxiv.org/pdf/2505.15946
Copy Paste: [[2505.15946]] MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding(https://arxiv.org/abs/2505.15946)
Keywords: interpretability, diffusion, generative
Abstract: Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: this https URL.

Title: Citation Parsing and Analysis with Language Models

Authors: Parth Sarin, Juan Pablo Alperin
Subjects: cs.CL, cs.DL, cs.SI
Abstract URL: https://arxiv.org/abs/2505.15948
Pdf URL: https://arxiv.org/pdf/2505.15948
Copy Paste: [[2505.15948]] Citation Parsing and Analysis with Language Models(https://arxiv.org/abs/2505.15948)
Keywords: robust
Abstract: A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation, outperforming state-of-the-art methods. Moreover, the smallest model we evaluated, Qwen3-0.6B, can parse all fields with high accuracy in $2^5$ passes, suggesting that post-training is likely to be effective in producing small, robust citation parsing models. Such a tool could greatly improve the fidelity of citation networks and thus meaningfully improve research indexing and discovery, as well as further metascientific research.

Title: Training Step-Level Reasoning Verifiers with Formal Verification Tools

Authors: Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15960
Pdf URL: https://arxiv.org/pdf/2505.15960
Copy Paste: [[2505.15960]] Training Step-Level Reasoning Verifiers with Formal Verification Tools(https://arxiv.org/abs/2505.15960)
Keywords: large language model
Abstract: Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at this https URL.

Title: OViP: Online Vision-Language Preference Learning

Authors: Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, Zhongyu Wei
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.15963
Pdf URL: https://arxiv.org/pdf/2505.15963
Copy Paste: [[2505.15963]] OViP: Online Vision-Language Preference Learning(https://arxiv.org/abs/2505.15963)
Keywords: diffusion
Abstract: Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. While recent approaches advance multi-modal Direct Preference Optimization (DPO) to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that fail to reflect actual model errors, limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP effectively reduces hallucinations while preserving core multi-modal capabilities.

Title: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Authors: Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, Wenhu Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.15966
Pdf URL: https://arxiv.org/pdf/2505.15966
Copy Paste: [[2505.15966]] Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning(https://arxiv.org/abs/2505.15966)
Keywords: large language model
Abstract: Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

Title: Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders

Authors: Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15970
Pdf URL: https://arxiv.org/pdf/2505.15970
Copy Paste: [[2505.15970]] Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders(https://arxiv.org/abs/2505.15970)
Keywords: large language model
Abstract: The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.

Title: Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku

Authors: Anirudh Maiya, Razan Alghamdi, Maria Leonor Pacheco, Ashutosh Trivedi, Fabio Somenzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15993
Pdf URL: https://arxiv.org/pdf/2505.15993
Copy Paste: [[2505.15993]] Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku(https://arxiv.org/abs/2505.15993)
Keywords: large language model
Abstract: The success of Large Language Models (LLMs) in human-AI collaborative decision-making hinges on their ability to provide trustworthy, gradual, and tailored explanations. Solving complex puzzles, such as Sudoku, offers a canonical example of this collaboration, where clear and customized explanations often hold greater importance than the final solution. In this study, we evaluate the performance of five LLMs in solving and explaining \sixsix{} Sudoku puzzles. While one LLM demonstrates limited success in solving puzzles, none can explain the solution process in a manner that reflects strategic reasoning or intuitive problem-solving. These findings underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making.

Title: Domain Adaptive Skin Lesion Classification via Conformal Ensemble of Vision Transformers

Authors: Mehran Zoravar, Shadi Alijani, Homayoun Najjaran
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2505.15997
Pdf URL: https://arxiv.org/pdf/2505.15997
Copy Paste: [[2505.15997]] Domain Adaptive Skin Lesion Classification via Conformal Ensemble of Vision Transformers(https://arxiv.org/abs/2505.15997)
Keywords: robust, transformer
Abstract: Exploring the trustworthiness of deep learning models is crucial, especially in critical domains such as medical imaging decision support systems. Conformal prediction has emerged as a rigorous means of providing deep learning models with reliable uncertainty estimates and safety guarantees. However, conformal prediction results face challenges due to the backbone model's struggles in domain-shifted scenarios, such as variations in different sources. To aim this challenge, this paper proposes a novel framework termed Conformal Ensemble of Vision Transformers (CE-ViTs) designed to enhance image classification performance by prioritizing domain adaptation and model robustness, while accounting for uncertainty. The proposed method leverages an ensemble of vision transformer models in the backbone, trained on diverse datasets including HAM10000, Dermofit, and Skin Cancer ISIC datasets. This ensemble learning approach, calibrated through the combined mentioned datasets, aims to enhance domain adaptation through conformal learning. Experimental results underscore that the framework achieves a high coverage rate of 90.38\%, representing an improvement of 9.95\% compared to the HAM10000 model. This indicates a strong likelihood that the prediction set includes the true label compared to singular models. Ensemble learning in CE-ViTs significantly improves conformal prediction performance, increasing the average prediction set size for challenging misclassified samples from 1.86 to 3.075.

Title: Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning

Authors: Qiang Zhu, Kuan Lu, Menghao Huo, Yuxiao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16001
Pdf URL: https://arxiv.org/pdf/2505.16001
Copy Paste: [[2505.16001]] Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning(https://arxiv.org/abs/2505.16001)
Keywords: diffusion, transformer
Abstract: Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of transformers. To guide the translation process, we condition the model on image embeddings extracted from a pre-trained CLIP encoder, allowing for fine-grained and structurally consistent translations without relying on text or class labels. We incorporate both a CLIP similarity loss to enforce semantic consistency and an LPIPS perceptual loss to enhance visual fidelity during training. We validate our approach on two benchmark datasets: face2comics, which translates real human faces to comic-style illustrations, and edges2shoes, which translates edge maps to realistic shoe images. Experimental results demonstrate that DiT, combined with CLIP-based conditioning and perceptual similarity objectives, achieves high-quality, semantically faithful translations, offering a promising alternative to GAN-based models for paired image-to-image translation tasks.

Title: Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions

Authors: Sasha Boguraev, Christopher Potts, Kyle Mahowald
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16002
Pdf URL: https://arxiv.org/pdf/2505.16002
Copy Paste: [[2505.16002]] Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions(https://arxiv.org/abs/2505.16002)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) have emerged as powerful sources of evidence for linguists seeking to develop theories of syntax. In this paper, we argue that causal interpretability methods, applied to LLMs, can greatly enhance the value of such evidence by helping us characterize the abstract mechanisms that LLMs learn to use. Our empirical focus is a set of English filler-gap dependency constructions (e.g., questions, relative clauses). Linguistic theories largely agree that these constructions share many properties. Using experiments based in Distributed Interchange Interventions, we show that LLMs converge on similar abstract analyses of these constructions. These analyses also reveal previously overlooked factors -- relating to frequency, filler type, and surrounding context -- that could motivate changes to standard linguistic theory. Overall, these results suggest that mechanistic, internal analyses of LLMs can push linguistic theory forward.

Title: SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Authors: Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, Jason Mars
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16003
Pdf URL: https://arxiv.org/pdf/2505.16003
Copy Paste: [[2505.16003]] SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models(https://arxiv.org/abs/2505.16003)
Keywords: large language model
Abstract: The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks. In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval.

Title: Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Authors: Aaron J. Li, Suraj Srinivas, Usha Bhalla, Himabindu Lakkaraju
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16004
Pdf URL: https://arxiv.org/pdf/2505.16004
Copy Paste: [[2505.16004]] Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations(https://arxiv.org/abs/2505.16004)
Keywords: robust, interpretability, large language model
Abstract: Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.

Title: LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization

Authors: Wenrui Yu, Yiyi Chen, Johannes Bjerva, Sokol Kosta, Qiongxiu Li
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.16008
Pdf URL: https://arxiv.org/pdf/2505.16008
Copy Paste: [[2505.16008]] LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization(https://arxiv.org/abs/2505.16008)
Keywords: privacy, attack, robust
Abstract: We propose LAGO - Language Similarity-Aware Graph Optimization - a novel approach for few-shot cross-lingual embedding inversion attacks, addressing critical privacy vulnerabilities in multilingual NLP systems. Unlike prior work in embedding inversion attacks that treat languages independently, LAGO explicitly models linguistic relationships through a graph-based constrained distributed optimization framework. By integrating syntactic and lexical similarity as edge constraints, our method enables collaborative parameter learning across related languages. Theoretically, we show this formulation generalizes prior approaches, such as ALGEN, which emerges as a special case when similarity constraints are relaxed. Our framework uniquely combines Frobenius-norm regularization with linear inequality or total variation constraints, ensuring robust alignment of cross-lingual embedding spaces even with extremely limited data (as few as 10 samples per language). Extensive experiments across multiple languages and embedding models demonstrate that LAGO substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines. This work establishes language similarity as a critical factor in inversion attack transferability, urging renewed focus on language-aware privacy-preserving multilingual embeddings.

Title: Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains

Authors: Yash Saxena, Anpur Padia, Mandar S Chaudhary, Kalpa Gunaratna, Srinivasan Parthasarathy, Manas Gaur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16014
Pdf URL: https://arxiv.org/pdf/2505.16014
Copy Paste: [[2505.16014]] Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains(https://arxiv.org/abs/2505.16014)
Keywords: defense, attack, robust, interpretability, explainability
Abstract: Traditional Retrieval-Augmented Generation (RAG) pipelines rely on similarity-based retrieval and re-ranking, which depend on heuristics such as top-k, and lack explainability, interpretability, and robustness against adversarial content. To address this gap, we propose a novel method METEORA that replaces re-ranking in RAG with a rationale-driven selection approach. METEORA operates in two stages. First, a general-purpose LLM is preference-tuned to generate rationales conditioned on the input query using direct preference optimization. These rationales guide the evidence chunk selection engine, which selects relevant chunks in three stages: pairing individual rationales with corresponding retrieved chunks for local relevance, global selection with elbow detection for adaptive cutoff, and context expansion via neighboring chunks. This process eliminates the need for top-k heuristics. The rationales are also used for consistency check using a Verifier LLM to detect and filter poisoned or misleading content for safe generation. The framework provides explainable and interpretable evidence flow by using rationales consistently across both selection and verification. Our evaluation across six datasets spanning legal, financial, and academic research domains shows that METEORA improves generation accuracy by 33.34% while using approximately 50% fewer chunks than state-of-the-art re-ranking methods. In adversarial settings, METEORA significantly improves the F1 score from 0.10 to 0.44 over the state-of-the-art perplexity-based defense baseline, demonstrating strong resilience to poisoning attacks. Code available at: this https URL

Title: NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Authors: Wei Liu, Siya Qi, Xinyu Wang, Chen Qian, Yali Du, Yulan He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16022
Pdf URL: https://arxiv.org/pdf/2505.16022
Copy Paste: [[2505.16022]] NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning(https://arxiv.org/abs/2505.16022)
Keywords: large language model
Abstract: Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.

Title: Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild

Authors: Sheshera Mysore, Debarati Das, Hancheng Cao, Bahareh Sarrafzadeh
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2505.16023
Pdf URL: https://arxiv.org/pdf/2505.16023
Copy Paste: [[2505.16023]] Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild(https://arxiv.org/abs/2505.16023)
Keywords: large language model
Abstract: As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human-AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users' intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment.

Title: Toward Theoretical Insights into Diffusion Trajectory Distillation via Operator Merging

Authors: Weiguo Gao, Ming Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16024
Pdf URL: https://arxiv.org/pdf/2505.16024
Copy Paste: [[2505.16024]] Toward Theoretical Insights into Diffusion Trajectory Distillation via Operator Merging(https://arxiv.org/abs/2505.16024)
Keywords: diffusion, generative
Abstract: Diffusion trajectory distillation methods aim to accelerate sampling in diffusion models, which produce high-quality outputs but suffer from slow sampling speeds. These methods train a student model to approximate the multi-step denoising process of a pretrained teacher model in a single step, enabling one-shot generation. However, theoretical insights into the trade-off between different distillation strategies and generative quality remain limited, complicating their optimization and selection. In this work, we take a first step toward addressing this gap. Specifically, we reinterpret trajectory distillation as an operator merging problem in the linear regime, where each step of the teacher model is represented as a linear operator acting on noisy data. These operators admit a clear geometric interpretation as projections and rescalings corresponding to the noise schedule. During merging, signal shrinkage occurs as a convex combination of operators, arising from both discretization and limited optimization time of the student model. We propose a dynamic programming algorithm to compute the optimal merging strategy that maximally preserves signal fidelity. Additionally, we demonstrate the existence of a sharp phase transition in the optimal strategy, governed by data covariance structures. Our findings enhance the theoretical understanding of diffusion trajectory distillation and offer practical insights for improving distillation strategies.

Title: CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment

Authors: Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, Yilin Wang
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16025
Pdf URL: https://arxiv.org/pdf/2505.16025
Copy Paste: [[2505.16025]] CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment(https://arxiv.org/abs/2505.16025)
Keywords: robust, large language model
Abstract: Video quality assessment (VQA) is a challenging research topic with broad applications. Effective VQA necessitates sensitivity to pixel-level distortions and a comprehensive understanding of video context to accurately determine the perceptual impact of distortions. Traditional hand-crafted and learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent LLM-based models struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context and Pixel aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). The model is trained via a multi-task pipeline optimizing for score prediction, description generation, and pairwise comparisons. Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on established VQA benchmarks and superior robustness to pixel distortions, confirming its efficacy for comprehensive and practical video quality assessment in real-world scenarios.

Title: An Approach Towards Identifying Bangladeshi Leaf Diseases through Transfer Learning and XAI

Authors: Faika Fairuj Preotee, Shuvashis Sarker, Shamim Rahim Refat, Tashreef Muhammad, Shifat Islam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16033
Pdf URL: https://arxiv.org/pdf/2505.16033
Copy Paste: [[2505.16033]] An Approach Towards Identifying Bangladeshi Leaf Diseases through Transfer Learning and XAI(https://arxiv.org/abs/2505.16033)
Keywords: security, protect
Abstract: Leaf diseases are harmful conditions that affect the health, appearance and productivity of plants, leading to significant plant loss and negatively impacting farmers' livelihoods. These diseases cause visible symptoms such as lesions, color changes, and texture variations, making it difficult for farmers to manage plant health, especially in large or remote farms where expert knowledge is limited. The main motivation of this study is to provide an efficient and accessible solution for identifying plant leaf diseases in Bangladesh, where agriculture plays a critical role in food security. The objective of our research is to classify 21 distinct leaf diseases across six plants using deep learning models, improving disease detection accuracy while reducing the need for expert involvement. Deep Learning (DL) techniques, including CNN and Transfer Learning (TL) models like VGG16, VGG19, MobileNetV2, InceptionV3, ResNet50V2 and Xception are used. VGG19 and Xception achieve the highest accuracies, with 98.90% and 98.66% respectively. Additionally, Explainable AI (XAI) techniques such as GradCAM, GradCAM++, LayerCAM, ScoreCAM and FasterScoreCAM are used to enhance transparency by highlighting the regions of the models focused on during disease classification. This transparency ensures that farmers can understand the model's predictions and take necessary action. This approach not only improves disease management but also supports farmers in making informed decisions, leading to better plant protection and increased agricultural productivity.

Title: Equivariant Eikonal Neural Networks: Grid-Free, Scalable Travel-Time Prediction on Homogeneous Spaces

Authors: Alejandro García-Castellanos, David R. Wessels, Nicky J. van den Berg, Remco Duits, Daniël M. Pelt, Erik J. Bekkers
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16035
Pdf URL: https://arxiv.org/pdf/2505.16035
Copy Paste: [[2505.16035]] Equivariant Eikonal Neural Networks: Grid-Free, Scalable Travel-Time Prediction on Homogeneous Spaces(https://arxiv.org/abs/2505.16035)
Keywords: robust
Abstract: We introduce Equivariant Neural Eikonal Solvers, a novel framework that integrates Equivariant Neural Fields (ENFs) with Neural Eikonal Solvers. Our approach employs a single neural field where a unified shared backbone is conditioned on signal-specific latent variables - represented as point clouds in a Lie group - to model diverse Eikonal solutions. The ENF integration ensures equivariant mapping from these latent representations to the solution field, delivering three key benefits: enhanced representation efficiency through weight-sharing, robust geometric grounding, and solution steerability. This steerability allows transformations applied to the latent point cloud to induce predictable, geometrically meaningful modifications in the resulting Eikonal solution. By coupling these steerable representations with Physics-Informed Neural Networks (PINNs), our framework accurately models Eikonal travel-time solutions while generalizing to arbitrary Riemannian manifolds with regular group actions. This includes homogeneous spaces such as Euclidean, position-orientation, spherical, and hyperbolic manifolds. We validate our approach through applications in seismic travel-time modeling of 2D and 3D benchmark datasets. Experimental results demonstrate superior performance, scalability, adaptability, and user controllability compared to existing Neural Operator-based Eikonal solver methods.

Title: OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models

Authors: Burak Erinç Çetin, Yıldırım Özen, Elif Naz Demiryılmaz, Kaan Engür, Cagri Toraman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16036
Pdf URL: https://arxiv.org/pdf/2505.16036
Copy Paste: [[2505.16036]] OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models(https://arxiv.org/abs/2505.16036)
Keywords: robust, fair, generative, large language model
Abstract: Generative large language models present significant potential but also raise critical ethical concerns. Most studies focus on narrow ethical dimensions, and also limited diversity of languages and models. To address these gaps, we conduct a broad ethical evaluation of 29 recent open-source large language models using a novel data collection including four ethical aspects: Robustness, reliability, safety, and fairness. We analyze model behavior in both a commonly used language, English, and a low-resource language, Turkish. Our aim is to provide a comprehensive ethical assessment and guide safer model development by filling existing gaps in evaluation breadth, language coverage, and model diversity. Our experimental results, based on LLM-as-a-Judge, reveal that optimization efforts for many open-source models appear to have prioritized safety and fairness, and demonstrated good robustness while reliability remains a concern. We demonstrate that ethical evaluation can be effectively conducted independently of the language used. In addition, models with larger parameter counts tend to exhibit better ethical performance, with Gemma and Qwen models demonstrating the most ethical behavior among those evaluated.

Title: An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection

Authors: Shuvashis Sarker, Shamim Rahim Refat, Faika Fairuj Preotee, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16039
Pdf URL: https://arxiv.org/pdf/2505.16039
Copy Paste: [[2505.16039]] An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection(https://arxiv.org/abs/2505.16039)
Keywords: transformer, generative
Abstract: The brain is a highly complex organ that manages many important tasks, including movement, memory and thinking. Brain-related conditions, like tumors and degenerative disorders, can be hard to diagnose and treat. Magnetic Resonance Imaging (MRI) serves as a key tool for identifying these conditions, offering high-resolution images of brain structures. Despite this, interpreting MRI scans can be complicated. This study tackles this challenge by conducting a comparative analysis of Vision Transformer (ViT) and Transfer Learning (TL) models such as VGG16, VGG19, Resnet50V2, MobilenetV2 for classifying brain diseases using MRI data from Bangladesh based dataset. ViT, known for their ability to capture global relationships in images, are particularly effective for medical imaging tasks. Transfer learning helps to mitigate data constraints by fine-tuning pre-trained models. Furthermore, Explainable AI (XAI) methods such as GradCAM, GradCAM++, LayerCAM, ScoreCAM, and Faster-ScoreCAM are employed to interpret model predictions. The results demonstrate that ViT surpasses transfer learning models, achieving a classification accuracy of 94.39%. The integration of XAI methods enhances model transparency, offering crucial insights to aid medical professionals in diagnosing brain diseases with greater precision.

Title: Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

Authors: Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16056
Pdf URL: https://arxiv.org/pdf/2505.16056
Copy Paste: [[2505.16056]] Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models(https://arxiv.org/abs/2505.16056)
Keywords: large language model
Abstract: Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at this https URL .

Title: Mesh-free sparse identification of nonlinear dynamics

Authors: Mars Liyao Gao, J. Nathan Kutz, Bernat Font
Subjects: cs.LG, cs.AI, physics.data-an
Abstract URL: https://arxiv.org/abs/2505.16058
Pdf URL: https://arxiv.org/pdf/2505.16058
Copy Paste: [[2505.16058]] Mesh-free sparse identification of nonlinear dynamics(https://arxiv.org/abs/2505.16058)
Keywords: robust, diffusion
Abstract: Identifying the governing equations of a dynamical system is one of the most important tasks for scientific modeling. However, this procedure often requires high-quality spatio-temporal data uniformly sampled on structured grids. In this paper, we propose mesh-free SINDy, a novel algorithm which leverages the power of neural network approximation as well as auto-differentiation to identify governing equations from arbitrary sensor placements and non-uniform temporal data sampling. We show that mesh-free SINDy is robust to high noise levels and limited data while remaining computationally efficient. In our implementation, the training procedure is straight-forward and nearly free of hyperparameter tuning, making mesh-free SINDy widely applicable to many scientific and engineering problems. In the experiments, we demonstrate its effectiveness on a series of PDEs including the Burgers' equation, the heat equation, the Korteweg-De Vries equation and the 2D advection-diffusion equation. We conduct detailed numerical experiments on all datasets, varying the noise levels and number of samples, and we also compare our approach to previous state-of-the-art methods. It is noteworthy that, even in high-noise and low-data scenarios, mesh-free SINDy demonstrates robust PDE discovery, achieving successful identification with up to 75% noise for the Burgers' equation using 5,000 samples and with as few as 100 samples and 1% noise. All of this is achieved within a training time of under one minute.

Title: Few-Shot Test-Time Optimization Without Retraining for Semiconductor Recipe Generation and Beyond

Authors: Shangding Gu, Donghao Ying, Ming Jin, Yu Joe Lu, Jun Wang, Javad Lavaei, Costas Spanos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16060
Pdf URL: https://arxiv.org/pdf/2505.16060
Copy Paste: [[2505.16060]] Few-Shot Test-Time Optimization Without Retraining for Semiconductor Recipe Generation and Beyond(https://arxiv.org/abs/2505.16060)
Keywords: robust
Abstract: We introduce Model Feedback Learning (MFL), a novel test-time optimization framework for optimizing inputs to pre-trained AI models or deployed hardware systems without requiring any retraining of the models or modifications to the hardware. In contrast to existing methods that rely on adjusting model parameters, MFL leverages a lightweight reverse model to iteratively search for optimal inputs, enabling efficient adaptation to new objectives under deployment constraints. This framework is particularly advantageous in real-world settings, such as semiconductor manufacturing recipe generation, where modifying deployed systems is often infeasible or cost-prohibitive. We validate MFL on semiconductor plasma etching tasks, where it achieves target recipe generation in just five iterations, significantly outperforming both Bayesian optimization and human experts. Beyond semiconductor applications, MFL also demonstrates strong performance in chemical processes (e.g., chemical vapor deposition) and electronic systems (e.g., wire bonding), highlighting its broad applicability. Additionally, MFL incorporates stability-aware optimization, enhancing robustness to process variations and surpassing conventional supervised learning and random search methods in high-dimensional control settings. By enabling few-shot adaptation, MFL provides a scalable and efficient paradigm for deploying intelligent control in real-world environments.

Title: Internal and External Impacts of Natural Language Processing Papers

Authors: Yu Zhang
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2505.16061
Pdf URL: https://arxiv.org/pdf/2505.16061
Copy Paste: [[2505.16061]] Internal and External Impacts of Natural Language Processing Papers(https://arxiv.org/abs/2505.16061)
Keywords: fair
Abstract: We investigate the impacts of NLP research published in top-tier conferences (i.e., ACL, EMNLP, and NAACL) from 1979 to 2024. By analyzing citations from research articles and external sources such as patents, media, and policy documents, we examine how different NLP topics are consumed both within the academic community and by the broader public. Our findings reveal that language modeling has the widest internal and external influence, while linguistic foundations have lower impacts. We also observe that internal and external impacts generally align, but topics like ethics, bias, and fairness show significant attention in policy documents with much fewer academic citations. Additionally, external domains exhibit distinct preferences, with patents focusing on practical NLP applications and media and policy documents engaging more with the societal implications of NLP models.

Title: Small Language Models in the Real World: Insights from Industrial Text Classification

Authors: Lujun Li, Lama Sleem, Niccolo' Gentile, Geoffrey Nichil, Radu State
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16078
Pdf URL: https://arxiv.org/pdf/2505.16078
Copy Paste: [[2505.16078]] Small Language Models in the Real World: Insights from Industrial Text Classification(https://arxiv.org/abs/2505.16078)
Keywords: transformer
Abstract: With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU resource requirements often limit widespread adoption. Thus, the question of whether smaller language models are capable of effectively handling text classification tasks emerges as a topic of significant interest. However, the selection of appropriate models and methodologies remains largely underexplored. In this paper, we conduct a comprehensive evaluation of prompt engineering and supervised fine-tuning methods for transformer-based text classification. Specifically, we focus on practical industrial scenarios, including email classification, legal document categorization, and the classification of extremely long academic texts. We examine the strengths and limitations of smaller models, with particular attention to both their performance and their efficiency in Video Random-Access Memory (VRAM) utilization, thereby providing valuable insights for the local deployment and application of compact models in industrial settings.

Title: BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Annotations and Rationale Indicators

Authors: KMA Solaiman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16081
Pdf URL: https://arxiv.org/pdf/2505.16081
Copy Paste: [[2505.16081]] BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Annotations and Rationale Indicators(https://arxiv.org/abs/2505.16081)
Keywords: interpretability
Abstract: We present BiasLab, a dataset of 300 political news articles annotated for perceived ideological bias. These articles were selected from a curated 900-document pool covering diverse political events and source biases. Each article is labeled by crowdworkers along two independent scales, assessing sentiment toward the Democratic and Republican parties, and enriched with rationale indicators. The annotation pipeline incorporates targeted worker qualification and was refined through pilot-phase analysis. We quantify inter-annotator agreement, analyze misalignment with source-level outlet bias, and organize the resulting labels into interpretable subsets. Additionally, we simulate annotation using schema-constrained GPT-4o, enabling direct comparison to human labels and revealing mirrored asymmetries, especially in misclassifying subtly right-leaning content. We define two modeling tasks: perception drift prediction and rationale type classification, and report baseline performance to illustrate the challenge of explainable bias detection. BiasLab's rich rationale annotations provide actionable interpretations that facilitate explainable modeling of political bias, supporting the development of transparent, socially aware NLP systems. We release the dataset, annotation schema, and modeling code to encourage research on human-in-the-loop interpretability and the evaluation of explanation effectiveness in real-world settings.

Title: Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Authors: Gagan Bhatia, Maxime Peyrard, Wei Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16088
Pdf URL: https://arxiv.org/pdf/2505.16088
Copy Paste: [[2505.16088]] Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning(https://arxiv.org/abs/2505.16088)
Keywords: robust, large language model
Abstract: Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $\rightarrow$ month $\rightarrow$ day).

Title: A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization

Authors: Ziqing Wang, Kexin Zhang, Zihan Zhao, Yibo Wen, Abhishek Pandey, Han Liu, Kaize Ding
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16094
Pdf URL: https://arxiv.org/pdf/2505.16094
Copy Paste: [[2505.16094]] A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization(https://arxiv.org/abs/2505.16094)
Keywords: large language model
Abstract: Large language models (LLMs) are introducing a paradigm shift in molecular discovery by enabling text-guided interaction with chemical spaces through natural language, symbolic notations, with emerging extensions to incorporate multi-modal inputs. To advance the new field of LLM for molecular discovery, this survey provides an up-to-date and forward-looking review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization. Based on our proposed taxonomy for both problems, we analyze representative techniques in each category, highlighting how LLM capabilities are leveraged across different learning settings. In addition, we include the commonly used datasets and evaluation protocols. We conclude by discussing key challenges and future directions, positioning this survey as a resource for researchers working at the intersection of LLMs and molecular science. A continuously updated reading list is available at this https URL.

Title: Continually Self-Improving Language Models for Bariatric Surgery Question--Answering

Authors: Yash Kumar Atri, Thomas H Shin, Thomas Hartvigsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16102
Pdf URL: https://arxiv.org/pdf/2505.16102
Copy Paste: [[2505.16102]] Continually Self-Improving Language Models for Bariatric Surgery Question--Answering(https://arxiv.org/abs/2505.16102)
Keywords: large language model
Abstract: While bariatric and metabolic surgery (MBS) is considered the gold standard treatment for severe and morbid obesity, its therapeutic efficacy hinges upon active and longitudinal engagement with multidisciplinary providers, including surgeons, dietitians/nutritionists, psychologists, and endocrinologists. This engagement spans the entire patient journey, from preoperative preparation to long-term postoperative management. However, this process is often hindered by numerous healthcare disparities, such as logistical and access barriers, which impair easy patient access to timely, evidence-based, clinician-endorsed information. To address these gaps, we introduce bRAGgen, a novel adaptive retrieval-augmented generation (RAG)-based model that autonomously integrates real-time medical evidence when response confidence dips below dynamic thresholds. This self-updating architecture ensures that responses remain current and accurate, reducing the risk of misinformation. Additionally, we present bRAGq, a curated dataset of 1,302 bariatric surgery--related questions, validated by an expert bariatric surgeon. bRAGq constitutes the first large-scale, domain-specific benchmark for comprehensive MBS care. In a two-phase evaluation, bRAGgen is benchmarked against state-of-the-art models using both large language model (LLM)--based metrics and expert surgeon review. Across all evaluation dimensions, bRAGgen demonstrates substantially superior performance in generating clinically accurate and relevant responses.

Title: MPL: Multiple Programming Languages with Large Language Models for Information Extraction

Authors: Bo Li, Gexiang Fang, Wei Ye, Zhenghua Xu, Jinglei Zhang, Hao Cheng, Shikun Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16107
Pdf URL: https://arxiv.org/pdf/2505.16107
Copy Paste: [[2505.16107]] MPL: Multiple Programming Languages with Large Language Models for Information Extraction(https://arxiv.org/abs/2505.16107)
Keywords: extraction, large language model
Abstract: Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose \textbf{M}ultiple \textbf{P}rogramming \textbf{L}anguages with large language models for information extraction (abbreviated as \textbf{MPL}), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce \texttt{function-prompt} with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.

Title: Extensible Post Quantum Cryptography Based Authentication

Authors: Homer A. Riva-Cambrin, Rahul Singh, Sanju Lama, Garnette R. Sutherland
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16112
Pdf URL: https://arxiv.org/pdf/2505.16112
Copy Paste: [[2505.16112]] Extensible Post Quantum Cryptography Based Authentication(https://arxiv.org/abs/2505.16112)
Keywords: secure, security
Abstract: Cryptography underpins the security of modern digital infrastructure, from cloud services to health data. However, many widely deployed systems will become vulnerable after the advent of scalable quantum computing. Although quantum-safe cryptographic primitives have been developed, such as lattice-based digital signature algorithms (DSAs) and key encapsulation mechanisms (KEMs), their unique structural and performance characteristics make them unsuitable for existing protocols. In this work, we introduce a quantum-safe single-shot protocol for machine-to-machine authentication and authorization that is specifically designed to leverage the strengths of lattice-based DSAs and KEMs. Operating entirely over insecure channels, this protocol enables the forward-secure establishment of tokens in constrained environments. By demonstrating how new quantum-safe cryptographic primitives can be incorporated into secure systems, this study lays the groundwork for scalable, resilient, and future-proof identity infrastructures in a quantum-enabled world.

Title: Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools

Authors: Panagiotis Lymperopoulos, Vasanth Sarathy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16113
Pdf URL: https://arxiv.org/pdf/2505.16113
Copy Paste: [[2505.16113]] Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools(https://arxiv.org/abs/2505.16113)
Keywords: large language model
Abstract: Modern Large Language Models (LLMs) often require external tools, such as machine learning classifiers or knowledge retrieval systems, to provide accurate answers in domains where their pre-trained knowledge is insufficient. This integration of LLMs with external tools expands their utility but also introduces a critical challenge: determining the trustworthiness of responses generated by the combined system. In high-stakes applications, such as medical decision-making, it is essential to assess the uncertainty of both the LLM's generated text and the tool's output to ensure the reliability of the final response. However, existing uncertainty quantification methods do not account for the tool-calling scenario, where both the LLM and external tool contribute to the overall system's uncertainty. In this work, we present a novel framework for modeling tool-calling LLMs that quantifies uncertainty by jointly considering the predictive uncertainty of the LLM and the external tool. We extend previous methods for uncertainty quantification over token sequences to this setting and propose efficient approximations that make uncertainty computation practical for real-world applications. We evaluate our framework on two new synthetic QA datasets, derived from well-known machine learning datasets, which require tool-calling for accurate answers. Additionally, we apply our method to retrieval-augmented generation (RAG) systems and conduct a proof-of-concept experiment demonstrating the effectiveness of our uncertainty metrics in scenarios where external information retrieval is needed. Our results show that the framework is effective in enhancing trust in LLM-based systems, especially in cases where the LLM's internal knowledge is insufficient and external tools are required.

Title: A Generic Framework for Conformal Fairness

Authors: Aditya T. Vadlamani, Anutam Srinivasan, Pranav Maneriker, Ali Payani, Srinivasan Parthasarathy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16115
Pdf URL: https://arxiv.org/pdf/2505.16115
Copy Paste: [[2505.16115]] A Generic Framework for Conformal Fairness(https://arxiv.org/abs/2505.16115)
Keywords: fair
Abstract: Conformal Prediction (CP) is a popular method for uncertainty quantification with machine learning models. While conformal prediction provides probabilistic guarantees regarding the coverage of the true label, these guarantees are agnostic to the presence of sensitive attributes within the dataset. In this work, we formalize \textit{Conformal Fairness}, a notion of fairness using conformal predictors, and provide a theoretically well-founded algorithm and associated framework to control for the gaps in coverage between different sensitive groups. Our framework leverages the exchangeability assumption (implicit to CP) rather than the typical IID assumption, allowing us to apply the notion of Conformal Fairness to data types and tasks that are not IID, such as graph data. Experiments were conducted on graph and tabular datasets to demonstrate that the algorithm can control fairness-related gaps in addition to coverage aligned with theoretical expectations.

Title: Semiotic Reconstruction of Destination Expectation Constructs An LLM-Driven Computational Paradigm for Social Media Tourism Analytics

Authors: Haotian Lan, Yao Gao, Yujun Cheng, Wei Yuan, Kun Wang
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2505.16118
Pdf URL: https://arxiv.org/pdf/2505.16118
Copy Paste: [[2505.16118]] Semiotic Reconstruction of Destination Expectation Constructs An LLM-Driven Computational Paradigm for Social Media Tourism Analytics(https://arxiv.org/abs/2505.16118)
Keywords: extraction
Abstract: Social media's rise establishes user-generated content (UGC) as pivotal for travel decisions, yet analytical methods lack scalability. This study introduces a dual-method LLM framework: unsupervised expectation extraction from UGC paired with survey-informed supervised fine-tuning. Findings reveal leisure/social expectations drive engagement more than foundational natural/emotional factors. By establishing LLMs as precision tools for expectation quantification, we advance tourism analytics methodology and propose targeted strategies for experience personalization and social travel promotion. The framework's adaptability extends to consumer behavior research, demonstrating computational social science's transformative potential in marketing optimization.

Title: Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning

Authors: Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16122
Pdf URL: https://arxiv.org/pdf/2505.16122
Copy Paste: [[2505.16122]] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning(https://arxiv.org/abs/2505.16122)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the $E^3$ metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in $E^3$. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at this http URL.

Title: KoBALT: Korean Benchmark For Advanced Linguistic Tasks

Authors: Hyopil Shin, Sangah Lee, Dongjun Jang, Wooseok Song, Jaeyoon Kim, Chaeyoung Oh, Hyemi Jo, Youngchae Ahn, Sihyun Oh, Hyohyeong Chang, Sunkyoung Kim, Jinsik Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16125
Pdf URL: https://arxiv.org/pdf/2505.16125
Copy Paste: [[2505.16125]] KoBALT: Korean Benchmark For Advanced Linguistic Tasks(https://arxiv.org/abs/2505.16125)
Keywords: robust, large language model
Abstract: We introduce KoBALT (Korean Benchmark for Advanced Linguistic Tasks), a comprehensive linguistically-motivated benchmark comprising 700 multiple-choice questions spanning 24 phenomena across five linguistic domains: syntax, semantics, pragmatics, phonetics/phonology, and morphology. KoBALT is designed to advance the evaluation of large language models (LLMs) in Korean, a morphologically rich language, by addressing the limitations of conventional benchmarks that often lack linguistic depth and typological grounding. It introduces a suite of expert-curated, linguistically motivated questions with minimal n-gram overlap with standard Korean corpora, substantially mitigating the risk of data contamination and allowing a more robust assessment of true language understanding. Our evaluation of 20 contemporary LLMs reveals significant performance disparities, with the highest-performing model achieving 61\% general accuracy but showing substantial variation across linguistic domains - from stronger performance in semantics (66\%) to considerable weaknesses in phonology (31\%) and morphology (36\%). Through human preference evaluation with 95 annotators, we demonstrate a strong correlation between KoBALT scores and human judgments, validating our benchmark's effectiveness as a discriminative measure of Korean language understanding. KoBALT addresses critical gaps in linguistic evaluation for typologically diverse languages and provides a robust framework for assessing genuine linguistic competence in Korean language models.

Title: Robust Invariant Representation Learning by Distribution Extrapolation

Authors: Kotaro Yoshida, Slavakis Konstantinos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16126
Pdf URL: https://arxiv.org/pdf/2505.16126
Copy Paste: [[2505.16126]] Robust Invariant Representation Learning by Distribution Extrapolation(https://arxiv.org/abs/2505.16126)
Keywords: robust
Abstract: Invariant risk minimization (IRM) aims to enable out-of-distribution (OOD) generalization in deep learning by learning invariant representations. As IRM poses an inherently challenging bi-level optimization problem, most existing approaches -- including IRMv1 -- adopt penalty-based single-level approximations. However, empirical studies consistently show that these methods often fail to outperform well-tuned empirical risk minimization (ERM), highlighting the need for more robust IRM implementations. This work theoretically identifies a key limitation common to many IRM variants: their penalty terms are highly sensitive to limited environment diversity and over-parameterization, resulting in performance degradation. To address this issue, a novel extrapolation-based framework is proposed that enhances environmental diversity by augmenting the IRM penalty through synthetic distributional shifts. Extensive experiments -- ranging from synthetic setups to realistic, over-parameterized scenarios -- demonstrate that the proposed method consistently outperforms state-of-the-art IRM variants, validating its effectiveness and robustness.

Title: LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

Authors: Hyang Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16129
Pdf URL: https://arxiv.org/pdf/2505.16129
Copy Paste: [[2505.16129]] LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods(https://arxiv.org/abs/2505.16129)
Keywords: large language model
Abstract: Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.

Title: Scalable Graph Generative Modeling via Substructure Sequences

Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2505.16130
Pdf URL: https://arxiv.org/pdf/2505.16130
Copy Paste: [[2505.16130]] Scalable Graph Generative Modeling via Substructure Sequences(https://arxiv.org/abs/2505.16130)
Keywords: transformer, generative
Abstract: Graph neural networks (GNNs) has been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance, limiting the viability of GNNs as backbones for graph foundation models. In this work, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable, transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node classification, graph classification, and transfer learning -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at this https URL.

Title: Position of Uncertainty: A Cross-Linguistic Study of Positional Bias in Large Language Models

Authors: Menschikov Mikhail, Alexander Kharitonov, Maiia Kotyga, Vadim Porvatov, Anna Zhukovskaya, David Kagramanyan, Egor Shvetsov, Evgeny Burnaev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16134
Pdf URL: https://arxiv.org/pdf/2505.16134
Copy Paste: [[2505.16134]] Position of Uncertainty: A Cross-Linguistic Study of Positional Bias in Large Language Models(https://arxiv.org/abs/2505.16134)
Keywords: large language model
Abstract: Large language models exhibit positional bias -- systematic neglect of information at specific context positions -- yet its interplay with linguistic diversity remains poorly understood. We present a cross-linguistic study across five typologically distinct languages (English, Russian, German, Hindi, Vietnamese), examining how positional bias interacts with model uncertainty, syntax, and prompting. Key findings: (1) Positional bias is model-driven, with language-specific variations -- Qwen2.5-7B favors late positions, challenging assumptions of early-token bias; (2) Explicit positional guidance (e.g., correct context is at position X) reduces accuracy across languages, undermining prompt-engineering practices; (3) Aligning context with positional bias increases entropy, yet minimal entropy does not predict accuracy. (4) We further uncover that LLMs differently impose dominant word order in free-word-order languages like Hindi.

Title: Outsourcing SAT-based Verification Computations in Network Security

Authors: Qi Duan, Ehab Al-Shaer
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16137
Pdf URL: https://arxiv.org/pdf/2505.16137
Copy Paste: [[2505.16137]] Outsourcing SAT-based Verification Computations in Network Security(https://arxiv.org/abs/2505.16137)
Keywords: security, privacy
Abstract: The emergence of cloud computing gives huge impact on large computations. Cloud computing platforms offer servers with large computation power to be available for customers. These servers can be used efficiently to solve problems that are complex by nature, for example, satisfiability (SAT) problems. Many practical problems can be converted to SAT, for example, circuit verification and network configuration analysis. However, outsourcing SAT instances to the servers may cause data leakage that can jeopardize system's security. Before outsourcing the SAT instance, one needs to hide the input information. One way to preserve privacy and hide information is to randomize the SAT instance before outsourcing. In this paper, we present multiple novel methods to randomize SAT instances. We present a novel method to randomize the SAT instance, a variable randomization method to randomize the solution set, and methods to randomize Mincost SAT and MAX3SAT instances. Our analysis and evaluation show the correctness and feasibility of these randomization methods. The scalability and generality of our methods make it applicable for real world problems.

Title: Multimodal Online Federated Learning with Modality Missing in Internet of Things

Authors: Heqiang Wang, Xiang Liu, Xiaoxiong Zhong, Lixing Chen, Fangming Liu, Weizhe Zhang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2505.16138
Pdf URL: https://arxiv.org/pdf/2505.16138
Copy Paste: [[2505.16138]] Multimodal Online Federated Learning with Modality Missing in Internet of Things(https://arxiv.org/abs/2505.16138)
Keywords: federate
Abstract: The Internet of Things (IoT) ecosystem generates vast amounts of multimodal data from heterogeneous sources such as sensors, cameras, and microphones. As edge intelligence continues to evolve, IoT devices have progressed from simple data collection units to nodes capable of executing complex computational tasks. This evolution necessitates the adoption of distributed learning strategies to effectively handle multimodal data in an IoT environment. Furthermore, the real-time nature of data collection and limited local storage on edge devices in IoT call for an online learning paradigm. To address these challenges, we introduce the concept of Multimodal Online Federated Learning (MMO-FL), a novel framework designed for dynamic and decentralized multimodal learning in IoT environments. Building on this framework, we further account for the inherent instability of edge devices, which frequently results in missing modalities during the learning process. We conduct a comprehensive theoretical analysis under both complete and missing modality scenarios, providing insights into the performance degradation caused by missing modalities. To mitigate the impact of modality missing, we propose the Prototypical Modality Mitigation (PMM) algorithm, which leverages prototype learning to effectively compensate for missing modalities. Experimental results on two multimodal datasets further demonstrate the superior performance of PMM compared to benchmarks.

Title: Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

Authors: Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16142
Pdf URL: https://arxiv.org/pdf/2505.16142
Copy Paste: [[2505.16142]] Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning(https://arxiv.org/abs/2505.16142)
Keywords: generative, large language model
Abstract: Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

Title: GMatch: Geometry-Constrained Feature Matching for RGB-D Object Pose Estimation

Authors: Ming Yang, Haoran Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16144
Pdf URL: https://arxiv.org/pdf/2505.16144
Copy Paste: [[2505.16144]] GMatch: Geometry-Constrained Feature Matching for RGB-D Object Pose Estimation(https://arxiv.org/abs/2505.16144)
Keywords: robust, interpretability
Abstract: We present GMatch, a learning-free feature matcher designed for robust 6DoF object pose estimation, addressing common local ambiguities in sparse feature matching. Unlike traditional methods that rely solely on descriptor similarity, GMatch performs a guided, incremental search, enforcing SE(3)-invariant geometric consistency throughout the matching process. It leverages a provably complete set of geometric features that uniquely determine 3D keypoint configurations, ensuring globally consistent correspondences without the need for training or GPU support. When combined with classical descriptors such as SIFT, GMatch-SIFT forms a general-purpose pose estimation pipeline that offers strong interpretability and generalization across diverse objects and scenes. Experiments on the HOPE dataset show that GMatch outperforms both traditional and learning-based matchers, with GMatch-SIFT achieving or surpassing the performance of instance-level pose networks. On the YCB-Video dataset, GMatch-SIFT demonstrates high accuracy and low variance on texture-rich objects. These results not only validate the effectiveness of GMatch-SIFT for object pose estimation but also highlight the broader applicability of GMatch as a general-purpose feature matcher. Code will be released upon acceptance.

Title: When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification

Authors: Zirui Pang, Haosheng Tan, Yuhan Pu, Zhijie Deng, Zhouan Shen, Keyu Hu, Jiaheng Wei
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16149
Pdf URL: https://arxiv.org/pdf/2505.16149
Copy Paste: [[2505.16149]] When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification(https://arxiv.org/abs/2505.16149)
Keywords: fair
Abstract: Image classification benchmark datasets such as CIFAR, MNIST, and ImageNet serve as critical tools for model evaluation. However, despite the cleaning efforts, these datasets still suffer from pervasive noisy labels and often contain missing labels due to the co-existing image pattern where multiple classes appear in an image sample. This results in misleading model comparisons and unfair evaluations. Existing label cleaning methods focus primarily on noisy labels, but the issue of missing labels remains largely overlooked. Motivated by these challenges, we present a comprehensive framework named REVEAL, integrating state-of-the-art pre-trained vision-language models (e.g., LLaVA, BLIP, Janus, Qwen) with advanced machine/human label curation methods (e.g., Docta, Cleanlab, MTurk), to systematically address both noisy labels and missing label detection in widely-used image classification test sets. REVEAL detects potential noisy labels and omissions, aggregates predictions from various methods, and refines label accuracy through confidence-informed predictions and consensus-based filtering. Additionally, we provide a thorough analysis of state-of-the-art vision-language models and pre-trained image classifiers, highlighting their strengths and limitations within the context of dataset renovation by revealing 10 observations. Our method effectively reveals missing labels from public datasets and provides soft-labeled results with likelihoods. Through human verifications, REVEAL significantly improves the quality of 6 benchmark test sets, highly aligning to human judgments and enabling more accurate and meaningful comparisons in image classification.

Title: BadDepth: Backdoor Attacks Against Monocular Depth Estimation in the Physical World

Authors: Ji Guo, Long Zhou, Zhijin Wang, Jiaming He, Qiyang Song, Aiguo Chen, Wenbo Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16154
Pdf URL: https://arxiv.org/pdf/2505.16154
Copy Paste: [[2505.16154]] BadDepth: Backdoor Attacks Against Monocular Depth Estimation in the Physical World(https://arxiv.org/abs/2505.16154)
Keywords: attack, robust, segmentation
Abstract: In recent years, deep learning-based Monocular Depth Estimation (MDE) models have been widely applied in fields such as autonomous driving and robotics. However, their vulnerability to backdoor attacks remains unexplored. To fill the gap in this area, we conduct a comprehensive investigation of backdoor attacks against MDE models. Typically, existing backdoor attack methods can not be applied to MDE models. This is because the label used in MDE is in the form of a depth map. To address this, we propose BadDepth, the first backdoor attack targeting MDE models. BadDepth overcomes this limitation by selectively manipulating the target object's depth using an image segmentation model and restoring the surrounding areas via depth completion, thereby generating poisoned datasets for object-level backdoor attacks. To improve robustness in physical world scenarios, we further introduce digital-to-physical augmentation to adapt to the domain gap between the physical world and the digital domain. Extensive experiments on multiple models validate the effectiveness of BadDepth in both the digital domain and the physical world, without being affected by environmental factors.

Title: Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention

Authors: Yuang Ai, Huaibo Huang, Tao Wu, Qihang Fan, Ran He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16157
Pdf URL: https://arxiv.org/pdf/2505.16157
Copy Paste: [[2505.16157]] Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention(https://arxiv.org/abs/2505.16157)
Keywords: transformer
Abstract: Transformer-based models have made remarkable progress in image restoration (IR) tasks. However, the quadratic complexity of self-attention in Transformer hinders its applicability to high-resolution images. Existing methods mitigate this issue with sparse or window-based attention, yet inherently limit global context modeling. Linear attention, a variant of softmax attention, demonstrates promise in global context modeling while maintaining linear complexity, offering a potential solution to the above challenge. Despite its efficiency benefits, vanilla linear attention suffers from a significant performance drop in IR, largely due to the low-rank nature of its attention map. To counter this, we propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution. Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer. LAformer achieves effective global perception by integrating linear attention and channel attention, while also enhancing local fitting capabilities through a convolutional gated feed-forward network. Notably, LAformer eliminates hardware-inefficient operations such as softmax and window shifting, enabling efficient processing of high-resolution images. Extensive experiments across 7 IR tasks and 21 benchmarks demonstrate that LAformer outperforms SOTA methods and offers significant computational advantages.

Title: Why Can Accurate Models Be Learned from Inaccurate Annotations?

Authors: Chongjie Si, Yidan Cui, Fuchao Yang, Xiaokang Yang, Wei Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16159
Pdf URL: https://arxiv.org/pdf/2505.16159
Copy Paste: [[2505.16159]] Why Can Accurate Models Be Learned from Inaccurate Annotations?(https://arxiv.org/abs/2505.16159)
Keywords: robust
Abstract: Learning from inaccurate annotations has gained significant attention due to the high cost of precise labeling. However, despite the presence of erroneous labels, models trained on noisy data often retain the ability to make accurate predictions. This intriguing phenomenon raises a fundamental yet largely unexplored question: why models can still extract correct label information from inaccurate annotations remains unexplored. In this paper, we conduct a comprehensive investigation into this issue. By analyzing weight matrices from both empirical and theoretical perspectives, we find that label inaccuracy primarily accumulates noise in lower singular components and subtly perturbs the principal subspace. Within a certain range, the principal subspaces of weights trained on inaccurate labels remain largely aligned with those learned from clean labels, preserving essential task-relevant information. We formally prove that the angles of principal subspaces exhibit minimal deviation under moderate label inaccuracy, explaining why models can still generalize effectively. Building on these insights, we propose LIP, a lightweight plug-in designed to help classifiers retain principal subspace information while mitigating noise induced by label inaccuracy. Extensive experiments on tasks with various inaccuracy conditions demonstrate that LIP consistently enhances the performance of existing algorithms. We hope our findings can offer valuable theoretical and practical insights to understand of model robustness under inaccurate supervision.

Title: EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Authors: Bin Xu, Yu Bai, Huashan Sun, Yiguan Lin, Siming Liu, Xinyue Liang, Yaolin Li, Yang Gao, Heyan Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16160
Pdf URL: https://arxiv.org/pdf/2505.16160
Copy Paste: [[2505.16160]] EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios(https://arxiv.org/abs/2505.16160)
Keywords: large language model
Abstract: As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at this https URL.

Title: KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization

Authors: Mingbo Song, Heming Xia, Jun Zhang, Chak Tou Leong, Qiancheng Xu, Wenjie Li, Sujian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16162
Pdf URL: https://arxiv.org/pdf/2505.16162
Copy Paste: [[2505.16162]] KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization(https://arxiv.org/abs/2505.16162)
Keywords: large language model
Abstract: Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which eliminates the need for additional parameters or training. Despite its strengths, we observe in this work that drafting with layer skipping exhibits significant sensitivity to domain shifts, leading to a substantial drop in acceleration performance. To enhance the domain generalizability of this paradigm, we introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs. We evaluated our algorithm in various models and multiple tasks, observing that its application leads to 1.3x-1.6x speedup in LLM inference.

Title: Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

Authors: Mengyang Qiu, Zoe Brisebois, Siena Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16164
Pdf URL: https://arxiv.org/pdf/2505.16164
Copy Paste: [[2505.16164]] Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task(https://arxiv.org/abs/2505.16164)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear. This study examines whether LLMs can approximate individual differences in the phonemic fluency task, where participants generate words beginning with a target letter. We evaluated 34 model configurations, varying prompt specificity, sampling temperature, and model type, and compared outputs to responses from 106 human participants. While some configurations, especially Claude 3.7 Sonnet, matched human averages and lexical preferences, none reproduced the scope of human variability. LLM outputs were consistently less diverse and structurally rigid, and LLM ensembles failed to increase diversity. Network analyses further revealed fundamental differences in retrieval structure between humans and models. These results highlight key limitations in using LLMs to simulate human cognition and behavior.

Title: RE-TRIP : Reflectivity Instance Augmented Triangle Descriptor for 3D Place Recognition

Authors: Yechan Park, Gyuhyeon Pak, Euntai Kim
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.16165
Pdf URL: https://arxiv.org/pdf/2505.16165
Copy Paste: [[2505.16165]] RE-TRIP : Reflectivity Instance Augmented Triangle Descriptor for 3D Place Recognition(https://arxiv.org/abs/2505.16165)
Keywords: robust, extraction, segmentation
Abstract: While most people associate LiDAR primarily with its ability to measure distances and provide geometric information about the environment (via point clouds), LiDAR also captures additional data, including reflectivity or intensity values. Unfortunately, when LiDAR is applied to Place Recognition (PR) in mobile robotics, most previous works on LiDAR-based PR rely only on geometric measurements, neglecting the additional reflectivity information that LiDAR provides. In this paper, we propose a novel descriptor for 3D PR, named RE-TRIP (REflectivity-instance augmented TRIangle descriPtor). This new descriptor leverages both geometric measurements and reflectivity to enhance robustness in challenging scenarios such as geometric degeneracy, high geometric similarity, and the presence of dynamic objects. To implement RE-TRIP in real-world applications, we further propose (1) a keypoint extraction method, (2) a key instance segmentation method, (3) a RE-TRIP matching method, and (4) a reflectivity-combined loop verification method. Finally, we conduct a series of experiments to demonstrate the effectiveness of RE-TRIP. Applied to public datasets (i.e., HELIPR, FusionPortable) containing diverse scenarios such as long corridors, bridges, large-scale urban areas, and highly dynamic environments -- our experimental results show that the proposed method outperforms existing state-of-the-art methods in terms of Scan Context, Intensity Scan Context, and STD.

Title: TRAIL: Transferable Robust Adversarial Images via Latent diffusion

Authors: Yuhao Xue, Zhifei Zhang, Xinyang Jiang, Yifei Shen, Junyao Gao, Wentao Gu, Jiale Zhao, Miaojing Shi, Cairong Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16166
Pdf URL: https://arxiv.org/pdf/2505.16166
Copy Paste: [[2505.16166]] TRAIL: Transferable Robust Adversarial Images via Latent diffusion(https://arxiv.org/abs/2505.16166)
Keywords: security, attack, robust, diffusion
Abstract: Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.

Title: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

Authors: Yuqing Yang, Robin Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16170
Pdf URL: https://arxiv.org/pdf/2505.16170
Copy Paste: [[2505.16170]] When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction(https://arxiv.org/abs/2505.16170)
Keywords: large language model
Abstract: Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on this https URL.

Title: Automated Feedback Loops to Protect Text Simplification with Generative AI from Information Loss

Authors: Abhay Kumara Sri Krishna Nandiraju, Gondy Leroy, David Kauchak, Arif Ahmed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16172
Pdf URL: https://arxiv.org/pdf/2505.16172
Copy Paste: [[2505.16172]] Automated Feedback Loops to Protect Text Simplification with Generative AI from Information Loss(https://arxiv.org/abs/2505.16172)
Keywords: protect, generative
Abstract: Understanding health information is essential in achieving and maintaining a healthy life. We focus on simplifying health information for better understanding. With the availability of generative AI, the simplification process has become efficient and of reasonable quality, however, the algorithms remove information that may be crucial for comprehension. In this study, we compare generative AI to detect missing information in simplified text, evaluate its importance, and fix the text with the missing information. We collected 50 health information texts and simplified them using gpt-4-0613. We compare five approaches to identify missing elements and regenerate the text by inserting the missing elements. These five approaches involve adding missing entities and missing words in various ways: 1) adding all the missing entities, 2) adding all missing words, 3) adding the top-3 entities ranked by gpt-4-0613, and 4, 5) serving as controls for comparison, adding randomly chosen entities. We use cosine similarity and ROUGE scores to evaluate the semantic similarity and content overlap between the original, simplified, and reconstructed simplified text. We do this for both summaries and full text. Overall, we find that adding missing entities improves the text. Adding all the missing entities resulted in better text regeneration, which was better than adding the top-ranked entities or words, or random words. Current tools can identify these entities, but are not valuable in ranking them.

Title: Erased or Dormant? Rethinking Concept Erasure Through Reversibility

Authors: Ping Liu, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16174
Pdf URL: https://arxiv.org/pdf/2505.16174
Copy Paste: [[2505.16174]] Erased or Dormant? Rethinking Concept Erasure Through Reversibility(https://arxiv.org/abs/2505.16174)
Keywords: robust, diffusion, generative
Abstract: To what extent does concept erasure eliminate generative capacity in diffusion models? While prior evaluations have primarily focused on measuring concept suppression under specific textual prompts, we explore a complementary and fundamental question: do current concept erasure techniques genuinely remove the ability to generate targeted concepts, or do they merely achieve superficial, prompt-specific suppression? We systematically evaluate the robustness and reversibility of two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion, by probing their ability to eliminate targeted generative behaviors in text-to-image models. These methods attempt to suppress undesired semantic concepts by modifying internal model parameters, either through targeted attention edits or model-level fine-tuning strategies. To rigorously assess whether these techniques truly erase generative capacity, we propose an instance-level evaluation strategy that employs lightweight fine-tuning to explicitly test the reactivation potential of erased concepts. Through quantitative metrics and qualitative analyses, we show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation, indicating that current methods suppress latent generative representations without fully eliminating them. Our findings reveal critical limitations in existing concept erasure approaches and highlight the need for deeper, representation-level interventions and more rigorous evaluation standards to ensure genuine, irreversible removal of concepts from generative models.

Title: Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

Authors: Ying Zhang, Benjamin Heinzerling, Dongyuan Li, Ryoma Ishigaki, Yuta Hitomi, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16178
Pdf URL: https://arxiv.org/pdf/2505.16178
Copy Paste: [[2505.16178]] Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge(https://arxiv.org/abs/2505.16178)
Keywords: robust
Abstract: Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.

Title: Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics

Authors: Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16180
Pdf URL: https://arxiv.org/pdf/2505.16180
Copy Paste: [[2505.16180]] Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics(https://arxiv.org/abs/2505.16180)
Keywords: robust, interpretability
Abstract: Evaluating image captions requires cohesive assessment of both visual semantics and language pragmatics, which is often not entirely captured by most metrics. We introduce Redemption Score, a novel hybrid framework that ranks image captions by triangulating three complementary signals: (1) Mutual Information Divergence (MID) for global image-text distributional alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) BERTScore for contextual text similarity against human references. A calibrated fusion of these signals allows Redemption Score to offer a more holistic assessment. On the Flickr8k benchmark, Redemption Score achieves a Kendall-$\tau$ of 56.43, outperforming twelve prior methods and demonstrating superior correlation with human judgments without requiring task-specific training. Our framework provides a more robust and nuanced evaluation by effectively redeeming image semantics and linguistic interpretability indicated by strong transfer of knowledge in the Conceptual Captions and MS COCO datasets.

Title: Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Authors: Mohammad Reza Taesiri, Brandon Collins, Logan Bolton, Viet Dac Lai, Franck Dernoncourt, Trung Bui, Anh Totti Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16181
Pdf URL: https://arxiv.org/pdf/2505.16181
Copy Paste: [[2505.16181]] Understanding Generative AI Capabilities in Everyday Image Editing Tasks(https://arxiv.org/abs/2505.16181)
Keywords: generative
Abstract: Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: this https URL

Title: SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Authors: Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, Mengnan Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16188
Pdf URL: https://arxiv.org/pdf/2505.16188
Copy Paste: [[2505.16188]] SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models(https://arxiv.org/abs/2505.16188)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs)to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and politics polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions.

Title: Enhancing Federated Survival Analysis through Peer-Driven Client Reputation in Healthcare

Authors: Navid Seidi, Satyaki Roy, Sajal Das
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16190
Pdf URL: https://arxiv.org/pdf/2505.16190
Copy Paste: [[2505.16190]] Enhancing Federated Survival Analysis through Peer-Driven Client Reputation in Healthcare(https://arxiv.org/abs/2505.16190)
Keywords: privacy, protect, robust, federate
Abstract: Federated Learning (FL) holds great promise for digital health by enabling collaborative model training without compromising patient data privacy. However, heterogeneity across institutions, lack of sustained reputation, and unreliable contributions remain major challenges. In this paper, we propose a robust, peer-driven reputation mechanism for federated healthcare that employs a hybrid communication model to integrate decentralized peer feedback with clustering-based noise handling to enhance model aggregation. Crucially, our approach decouples the federated aggregation and reputation mechanisms by applying differential privacy to client-side model updates before sharing them for peer evaluation. This ensures sensitive information remains protected during reputation computation, while unaltered updates are sent to the server for global model training. Using the Cox Proportional Hazards model for survival analysis across multiple federated nodes, our framework addresses both data heterogeneity and reputation deficit by dynamically adjusting trust scores based on local performance improvements measured via the concordance index. Experimental evaluations on both synthetic datasets and the SEER dataset demonstrate that our method consistently achieves high and stable C-index values, effectively down-weighing noisy client updates and outperforming FL methods that lack a reputation system.

Title: VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Authors: Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16192
Pdf URL: https://arxiv.org/pdf/2505.16192
Copy Paste: [[2505.16192]] VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought(https://arxiv.org/abs/2505.16192)
Keywords: extraction
Abstract: Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbf{VLM-R$^3$} (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning), a framework that equips an MLLM with the ability to (i) decide \emph{when} additional visual evidence is needed, (ii) determine \emph{where} to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.

Title: An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability

Authors: Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16193
Pdf URL: https://arxiv.org/pdf/2505.16193
Copy Paste: [[2505.16193]] An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability(https://arxiv.org/abs/2505.16193)
Keywords: large language model
Abstract: The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.

Title: VIVID: A Novel Approach to Remediation Prioritization in Static Application Security Testing (SAST)

Authors: Naeem Budhwani, Mohammad Faghani, Hayden Richard
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16205
Pdf URL: https://arxiv.org/pdf/2505.16205
Copy Paste: [[2505.16205]] VIVID: A Novel Approach to Remediation Prioritization in Static Application Security Testing (SAST)(https://arxiv.org/abs/2505.16205)
Keywords: secure, security
Abstract: Static Application Security Testing (SAST) enables organizations to detect vulnerabilities in code early; however, major SAST platforms do not include visual aids and present little insight on correlations between tainted data chains. We propose VIVID - Vulnerability Information Via Data flow - a novel method to extract and consume SAST insights, which is to graph the application's vulnerability data flows (VDFs) and carry out graph theory analysis on the resulting VDF directed graph. Nine metrics were assessed to evaluate their effectiveness in analyzing the VDF graphs of deliberately insecure web applications. These metrics include 3 centrality metrics, 2 structural metrics, PageRank, in-degree, out-degree, and cross-clique connectivity. We present simulations that find that out-degree, betweenness centrality, in-eigenvector centrality, and cross-clique connectivity were found to be associated with files exhibiting high vulnerability traffic, making them refactoring candidates where input sanitization may have been missed. Meanwhile, out-eigenvector centrality, PageRank, and in-degree were found to be associated with nodes enabling vulnerability flow and sinks, but not necessarily where input validation should be placed. This is a novel method to automatically provide development teams an evidence-based prioritized list of files to embed security controls into, informed by vulnerability propagation patterns in the application architecture.

Title: NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Authors: Zhihang Cai, Xingjun Zhang, Zhendong Tan, Zheng Wei
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16210
Pdf URL: https://arxiv.org/pdf/2505.16210
Copy Paste: [[2505.16210]] NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics(https://arxiv.org/abs/2505.16210)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.

Title: Large Language Models based ASR Error Correction for Child Conversations

Authors: Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.16212
Pdf URL: https://arxiv.org/pdf/2505.16212
Copy Paste: [[2505.16212]] Large Language Models based ASR Error Correction for Child Conversations(https://arxiv.org/abs/2505.16212)
Keywords: large language model
Abstract: Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.

Title: A Scalable Hierarchical Intrusion Detection System for Internet of Vehicles

Authors: Md Ashraf Uddin, Nam H. Chu, Reza Rafeh, Mutaz Barika
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16215
Pdf URL: https://arxiv.org/pdf/2505.16215
Copy Paste: [[2505.16215]] A Scalable Hierarchical Intrusion Detection System for Internet of Vehicles(https://arxiv.org/abs/2505.16215)
Keywords: security, attack
Abstract: Due to its nature of dynamic, mobility, and wireless data transfer, the Internet of Vehicles (IoV) is prone to various cyber threats, ranging from spoofing and Distributed Denial of Services (DDoS) attacks to malware. To safeguard the IoV ecosystem from intrusions, malicious activities, policy violations, intrusion detection systems (IDS) play a critical role by continuously monitoring and analyzing network traffic to identify and mitigate potential threats in real-time. However, most existing research has focused on developing centralized, machine learning-based IDS systems for IoV without accounting for its inherently distributed nature. Due to intensive computing requirements, these centralized systems often rely on the cloud to detect cyber threats, increasing delay of system response. On the other hand, edge nodes typically lack the necessary resources to train and deploy complex machine learning algorithms. To address this issue, this paper proposes an effective hierarchical classification framework tailored for IoV networks. Hierarchical classification allows classifiers to be trained and tested at different levels, enabling edge nodes to detect specific types of attacks independently. With this approach, edge nodes can conduct targeted attack detection while leveraging cloud nodes for comprehensive threat analysis and support. Given the resource constraints of edge nodes, we have employed the Boruta feature selection method to reduce data dimensionality, optimizing processing efficiency. To evaluate our proposed framework, we utilize the latest IoV security dataset CIC-IoV2024, achieving promising results that demonstrate the feasibility and effectiveness of our models in securing IoV networks.

Title: Memorization or Reasoning? Exploring the Idiom Understanding of LLMs

Authors: Jisu Kim, Youngwoo Shin, Uiji Hwang, Jihun Choi, Richeng Xuan, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16216
Pdf URL: https://arxiv.org/pdf/2505.16216
Copy Paste: [[2505.16216]] Memorization or Reasoning? Exploring the Idiom Understanding of LLMs(https://arxiv.org/abs/2505.16216)
Keywords: large language model
Abstract: Idioms have long posed a challenge due to their unique linguistic properties, which set them apart from other common expressions. While recent studies have leveraged large language models (LLMs) to handle idioms across various tasks, e.g., idiom-containing sentence generation and idiomatic machine translation, little is known about the underlying mechanisms of idiom processing in LLMs, particularly in multilingual settings. To this end, we introduce MIDAS, a new large-scale dataset of idioms in six languages, each paired with its corresponding meaning. Leveraging this resource, we conduct a comprehensive evaluation of LLMs' idiom processing ability, identifying key factors that influence their performance. Our findings suggest that LLMs rely not only on memorization, but also adopt a hybrid approach that integrates contextual cues and reasoning, especially when processing compositional idioms. This implies that idiom understanding in LLMs emerges from an interplay between internal knowledge retrieval and reasoning-based inference.

Title: Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

Authors: Jiwon Moon, Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, Kyomin Jung
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.16222
Pdf URL: https://arxiv.org/pdf/2505.16222
Copy Paste: [[2505.16222]] Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation(https://arxiv.org/abs/2505.16222)
Keywords: robust, fair, large language model
Abstract: With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations-such as differences in variable names, comments, or formatting-that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate that all tested LLM judges are susceptible to both positive and negative biases, resulting in inflated or unfairly low scores. Moreover, we observe that LLM judges remain vulnerable to these biases even when prompted to generate test cases before scoring, highlighting the need for more robust code evaluation methods.

Title: Realistic Evaluation of TabPFN v2 in Open Environments

Authors: Zi-Jian Cheng, Zi-Yi Jia, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16226
Pdf URL: https://arxiv.org/pdf/2505.16226
Copy Paste: [[2505.16226]] Realistic Evaluation of TabPFN v2 in Open Environments(https://arxiv.org/abs/2505.16226)
Keywords: robust
Abstract: Tabular data, owing to its ubiquitous presence in real-world domains, has garnered significant attention in machine learning research. While tree-based models have long dominated tabular machine learning tasks, the recently proposed deep learning model TabPFN v2 has emerged, demonstrating unparalleled performance and scalability potential. Although extensive research has been conducted on TabPFN v2 to further improve performance, the majority of this research remains confined to closed environments, neglecting the challenges that frequently arise in open environments. This raises the question: Can TabPFN v2 maintain good performance in open environments? To this end, we conduct the first comprehensive evaluation of TabPFN v2's adaptability in open environments. We construct a unified evaluation framework covering various real-world challenges and assess the robustness of TabPFN v2 under open environments scenarios using this framework. Empirical results demonstrate that TabPFN v2 shows significant limitations in open environments but is suitable for small-scale, covariate-shifted, and class-balanced tasks. Tree-based models remain the optimal choice for general tabular tasks in open environments. To facilitate future research on open environments challenges, we advocate for open environments tabular benchmarks, multi-metric evaluation, and universal modules to strengthen model robustness. We publicly release our evaluation framework at this https URL.

Title: MuseRAG: Idea Originality Scoring At Scale

Authors: Ali Sarosh Bangash, Krish Veera, Ishfat Abrar Islam, Raiyan Abdul Baten
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16232
Pdf URL: https://arxiv.org/pdf/2505.16232
Copy Paste: [[2505.16232]] MuseRAG: Idea Originality Scoring At Scale(https://arxiv.org/abs/2505.16232)
Keywords: large language model
Abstract: An objective, face-valid way to assess the originality of creative ideas is to measure how rare each idea is within a population -- an approach long used in creativity research but difficult to automate at scale. Tabulating response frequencies via manual bucketing of idea rephrasings is labor-intensive, error-prone, and brittle under large corpora. We introduce a fully automated, psychometrically validated pipeline for frequency-based originality scoring. Our method, MuseRAG, combines large language models (LLMs) with an externally orchestrated retrieval-augmented generation (RAG) framework. Given a new idea, the system retrieves semantically similar prior idea buckets and zero-shot prompts the LLM to judge whether the new idea belongs to an existing bucket or forms a new one. The resulting buckets enable computation of frequency-based originality metrics. Across five datasets (N=1143, n_ideas=16294), MuseRAG matches human annotators in idea clustering structure and resolution (AMI = 0.59) and in participant-level scoring (r = 0.89) -- while exhibiting strong convergent and external validity. Our work enables intent-sensitive, human-aligned originality scoring at scale to aid creativity research.

Title: LIFEBench: Evaluating Length Instruction Following in Large Language Models

Authors: Wei Zhang, Zhenhong Zhou, Junfeng Fang, Rongwu Xu, Kun Wang, Yuanhe Zhang, Rui Wang, Ge Zhang, Xinfeng Li, Li Sun, Lingjuan Lyu, Yang Liu, Sen Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16234
Pdf URL: https://arxiv.org/pdf/2505.16234
Copy Paste: [[2505.16234]] LIFEBench: Evaluating Length Instruction Following in Large Language Models(https://arxiv.org/abs/2505.16234)
Keywords: large language model
Abstract: While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.

Title: Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation

Authors: Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, Enhong Chen, Tong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16237
Pdf URL: https://arxiv.org/pdf/2505.16237
Copy Paste: [[2505.16237]] Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation(https://arxiv.org/abs/2505.16237)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimizes a graph encoder with LLM-summarized reasoning. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The code will be available upon accepted.

Title: DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Authors: Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16239
Pdf URL: https://arxiv.org/pdf/2505.16239
Copy Paste: [[2505.16239]] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution(https://arxiv.org/abs/2505.16239)
Keywords: diffusion
Abstract: Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: this https URL.

Title: Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

Authors: Viet-Anh Nguyen, Shiqian Zhao, Gia Dao, Runyi Hu, Yi Xie, Luu Anh Tuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16241
Pdf URL: https://arxiv.org/pdf/2505.16241
Copy Paste: [[2505.16241]] Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers(https://arxiv.org/abs/2505.16241)
Keywords: security, attack, robust, large language model
Abstract: Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive alignment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the models reasoning capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incorporate two dynamic strategies - random and adaptive - that adjust the cipher length, order, and combination. Extensive experiments on real-world reasoning models, including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outperforming state-of-the-art baselines by a significant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.

Title: Verifying Differentially Private Median Estimation

Authors: Hyukjun Kwon, Chenglin Fan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16246
Pdf URL: https://arxiv.org/pdf/2505.16246
Copy Paste: [[2505.16246]] Verifying Differentially Private Median Estimation(https://arxiv.org/abs/2505.16246)
Keywords: privacy, robust
Abstract: Differential Privacy (DP) is a robust privacy guarantee that is widely employed in private data analysis today, finding broad application in domains such as statistical query release and machine learning. However, DP achieves privacy by introducing noise into data or query answers, which malicious actors could exploit during analysis. To address this concern, we propose the first verifiable differentially private median estimation scheme based on zk-SNARKs. Our scheme combines the exponential mechanism and a utility function for median estimation into an arithmetic circuit, leveraging a scaled version of the inverse cumulative distribution function (CDF) method for precise sampling from the distribution derived from the utility function. This approach not only ensures privacy but also provides a mechanism to verify that the algorithm achieves DP guarantees without revealing sensitive information in the process.

Title: Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models

Authors: Hwiyeong Lee, Uiji Hwang, Hyelim Lim, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16252
Pdf URL: https://arxiv.org/pdf/2505.16252
Copy Paste: [[2505.16252]] Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models(https://arxiv.org/abs/2505.16252)
Keywords: robust, large language model
Abstract: Large language models often retain unintended content, prompting growing interest in knowledge unlearning. Recent approaches emphasize localized unlearning, which restricts parameter updates to specific regions in an effort to remove target knowledge while preserving unrelated general knowledge. However, their effectiveness remains uncertain due to the lack of robust and thorough evaluation of the trade-off between the competing goals of unlearning. In this paper, we begin by revisiting existing localized unlearning approaches. We then conduct controlled experiments to rigorously evaluate whether local parameter updates causally contribute to unlearning. Our findings reveal that the set of parameters that must be modified for effective unlearning is not strictly determined, challenging the core assumption of localized unlearning that parameter locality is inherently indicative of effective knowledge removal.

Title: Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces

Authors: Preeti Mehta, Aman Sagar, Suchi Kumari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16253
Pdf URL: https://arxiv.org/pdf/2505.16253
Copy Paste: [[2505.16253]] Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces(https://arxiv.org/abs/2505.16253)
Keywords: robust, transformer
Abstract: This study aims to address the growing challenge of distinguishing computer-generated imagery (CGI) from authentic digital images across three different color spaces; RGB, YCbCr, and HSV. Given the limitations of existing classification methods in handling the complexity and variability of CGI, this research proposes a Swin Transformer based model for accurate differentiation between natural and synthetic images. The proposed model leverages the Swin Transformer's hierarchical architecture to capture local and global features for distinguishing CGI from natural images. Its performance was assessed through intra- and inter-dataset testing across three datasets: CiFAKE, JSSSTU, and Columbia. The model was evaluated individually on each dataset (D1, D2, D3) and on the combined datasets (D1+D2+D3) to test its robustness and domain generalization. To address dataset imbalance, data augmentation techniques were applied. Additionally, t-SNE visualization was used to demonstrate the feature separability achieved by the Swin Transformer across the selected color spaces. The model's performance was tested across all color schemes, with the RGB color scheme yielding the highest accuracy for each dataset. As a result, RGB was selected for domain generalization analysis and compared with other CNN-based models, VGG-19 and ResNet-50. The comparative results demonstrate the proposed model's effectiveness in detecting CGI, highlighting its robustness and reliability in both intra-dataset and inter-dataset evaluations. The findings of this study highlight the Swin Transformer model's potential as an advanced tool for digital image forensics, particularly in distinguishing CGI from natural images. The model's strong performance indicates its capability for domain generalization, making it a valuable asset in scenarios requiring precise and reliable image classification.

Title: DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor

Authors: Yan Zhao, Zhengxue Cheng, Junxuan Zhang, Qunshan Gu, Qi Wang, Li Song
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16256
Pdf URL: https://arxiv.org/pdf/2505.16256
Copy Paste: [[2505.16256]] DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor(https://arxiv.org/abs/2505.16256)
Keywords: large language model
Abstract: Most learning-based lossless compressors are designed for a single modality, requiring separate models for multi-modal data and lacking flexibility. However, different modalities vary significantly in format and statistical properties, making it ineffective to use compressors that lack modality-specific adaptations. While multi-modal large language models (MLLMs) offer a potential solution for modality-unified compression, their excessive complexity hinders practical deployment. To address these challenges, we focus on the two most common modalities, image and text, and propose DualComp, the first unified and lightweight learning-based dual-modality lossless compressor. Built on a lightweight backbone, DualComp incorporates three key structural enhancements to handle modality heterogeneity: modality-unified tokenization, modality-switching contextual learning, and modality-routing mixture-of-experts. A reparameterization training strategy is also used to boost compression performance. DualComp integrates both modality-specific and shared parameters for efficient parameter utilization, enabling near real-time inference (200KB/s) on desktop CPUs. With much fewer parameters, DualComp achieves compression performance on par with the SOTA LLM-based methods for both text and image datasets. Its simplified single-modality variant surpasses the previous best image compressor on the Kodak dataset by about 9% using just 1.2% of the model size.

Title: Interpretable Anomaly Detection in Encrypted Traffic Using SHAP with Machine Learning Models

Authors: Kalindi Singh, Aayush Kashyap, Aswani Kumar Cherukuri
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16261
Pdf URL: https://arxiv.org/pdf/2505.16261
Copy Paste: [[2505.16261]] Interpretable Anomaly Detection in Encrypted Traffic Using SHAP with Machine Learning Models(https://arxiv.org/abs/2505.16261)
Keywords: security, privacy, robust, interpretability, explainability
Abstract: The widespread adoption of encrypted communication protocols such as HTTPS and TLS has enhanced data privacy but also rendered traditional anomaly detection techniques less effective, as they often rely on inspecting unencrypted payloads. This study aims to develop an interpretable machine learning-based framework for anomaly detection in encrypted network traffic. This study proposes a model-agnostic framework that integrates multiple machine learning classifiers, with SHapley Additive exPlanations SHAP to ensure post-hoc model interpretability. The models are trained and evaluated on three benchmark encrypted traffic datasets. Performance is assessed using standard classification metrics, and SHAP is used to explain model predictions by attributing importance to individual input features. SHAP visualizations successfully revealed the most influential traffic features contributing to anomaly predictions, enhancing the transparency and trustworthiness of the models. Unlike conventional approaches that treat machine learning as a black box, this work combines robust classification techniques with explainability through SHAP, offering a novel interpretable anomaly detection system tailored for encrypted traffic environments. While the framework is generalizable, real-time deployment and performance under adversarial conditions require further investigation. Future work may explore adaptive models and real-time interpretability in operational network environments. This interpretable anomaly detection framework can be integrated into modern security operations for encrypted environments, allowing analysts not only to detect anomalies with high precision but also to understand why a model made a particular decision a crucial capability in compliance-driven and mission-critical settings.

Title: All You Need is "Leet": Evading Hate-speech Detection AI

Authors: Sampanna Yashwant Kahu, Naman Ahuja
Subjects: cs.CR, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16263
Pdf URL: https://arxiv.org/pdf/2505.16263
Copy Paste: [[2505.16263]] All You Need is "Leet": Evading Hate-speech Detection AI(https://arxiv.org/abs/2505.16263)
Keywords: protect, attack
Abstract: Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.

Title: LINEA: Fast and Accurate Line Detection Using Scalable Transformers

Authors: Sebastian Janampa, Marios Pattichis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16264
Pdf URL: https://arxiv.org/pdf/2505.16264
Copy Paste: [[2505.16264]] LINEA: Fast and Accurate Line Detection Using Scalable Transformers(https://arxiv.org/abs/2505.16264)
Keywords: transformer
Abstract: Line detection is a basic digital image processing operation used by higher-level processing methods. Recently, transformer-based methods for line detection have proven to be more accurate than methods based on CNNs, at the expense of significantly lower inference speeds. As a result, video analysis methods that require low latencies cannot benefit from current transformer-based methods for line detection. In addition, current transformer-based models require pretraining attention mechanisms on large datasets (e.g., COCO or Object360). This paper develops a new transformer-based method that is significantly faster without requiring pretraining the attention mechanism on large datasets. We eliminate the need to pre-train the attention mechanism using a new mechanism, Deformable Line Attention (DLA). We use the term LINEA to refer to our new transformer-based method based on DLA. Extensive experiments show that LINEA is significantly faster and outperforms previous models on sAP in out-of-distribution dataset testing.

Title: Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Authors: Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16265
Pdf URL: https://arxiv.org/pdf/2505.16265
Copy Paste: [[2505.16265]] Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models(https://arxiv.org/abs/2505.16265)
Keywords: robust, generative, large language model
Abstract: Reinforcement learning from human feedback (RLHF) has become a powerful post-training paradigm for aligning large language models with human preferences. A core challenge in RLHF is constructing accurate reward signals, where the conventional Bradley-Terry reward models (BT RMs) often suffer from sensitivity to data size and coverage, as well as vulnerability to reward hacking. Generative reward models (GenRMs) offer a more robust alternative by generating chain-of-thought (CoT) rationales followed by a final reward. However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks. Moreover, their pairwise preference outputs are incompatible with standard RLHF algorithms that require pointwise reward signals. In this work, we introduce Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. Rather than producing structured, externally provided rationales, Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities such as self-reflection, hypothetical reasoning, and divergent reasoning. To elicit these reasoning abilities, we first warm-up the models by supervised fine-tuning (SFT) over long CoT data. We then further improve the model's long-horizon abilities by rule-based reinforcement learning (RL). In addition, we propose a novel pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards, eliminating the need for pointwise reward conversion and enabling more effective use of Think-RM outputs. Experiments show that Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%. When combined with our pairwise RLHF pipeline, it demonstrates superior end-policy performance compared to traditional approaches.

Title: Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

Authors: Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, Jingrui He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16270
Pdf URL: https://arxiv.org/pdf/2505.16270
Copy Paste: [[2505.16270]] Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning(https://arxiv.org/abs/2505.16270)
Keywords: transformer, large language model
Abstract: Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model's own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model's learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot's inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot's logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability.

Title: Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility

Authors: Sheng-Fu Wang, Laurent Prevot, Jou-an Chi, Ri-Sheng Huang, Shu-Kai Hsieh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16277
Pdf URL: https://arxiv.org/pdf/2505.16277
Copy Paste: [[2505.16277]] Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility(https://arxiv.org/abs/2505.16277)
Keywords: large language model
Abstract: The achievements of Large Language Models in Natural Language Processing, especially for high-resource languages, call for a better understanding of their characteristics from a cognitive perspective. Researchers have attempted to evaluate artificial models by testing their ability to predict behavioral (e.g., eye-tracking fixations) and physiological (e.g., brain responses) variables during language processing (e.g., reading/listening). In this paper, we propose using spontaneous speech corpora to derive production variables (speech reductions, prosodic prominences) and applying them in a similar fashion. More precisely, we extract. We then test models trained with a standard procedure on different pretraining datasets (written, spoken, and mixed genres) for their ability to predict these two variables. Our results show that, after some fine-tuning, the models can predict these production variables well above baselines. We also observe that spoken genre training data provides more accurate predictions than written genres. These results contribute to the broader effort of using high-quality speech corpora as benchmarks for LLMs.

Title: DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Authors: Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.16278
Pdf URL: https://arxiv.org/pdf/2505.16278
Copy Paste: [[2505.16278]] DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving(https://arxiv.org/abs/2505.16278)
Keywords: robust, large language model
Abstract: End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

Title: HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

Authors: Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16281
Pdf URL: https://arxiv.org/pdf/2505.16281
Copy Paste: [[2505.16281]] HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation(https://arxiv.org/abs/2505.16281)
Keywords: large language model
Abstract: The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model's self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at this https URL.

Title: ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay

Authors: Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16282
Pdf URL: https://arxiv.org/pdf/2505.16282
Copy Paste: [[2505.16282]] ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay(https://arxiv.org/abs/2505.16282)
Keywords: large language model
Abstract: Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:this https URL.

Title: Efficient Prototype Consistency Learning in Medical Image Segmentation via Joint Uncertainty and Data Augmentation

Authors: Lijian Li, Yuanpeng He, Chi-Man Pun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16283
Pdf URL: https://arxiv.org/pdf/2505.16283
Copy Paste: [[2505.16283]] Efficient Prototype Consistency Learning in Medical Image Segmentation via Joint Uncertainty and Data Augmentation(https://arxiv.org/abs/2505.16283)
Keywords: segmentation
Abstract: Recently, prototype learning has emerged in semi-supervised medical image segmentation and achieved remarkable performance. However, the scarcity of labeled data limits the expressiveness of prototypes in previous methods, potentially hindering the complete representation of prototypes for class embedding. To overcome this issue, we propose an efficient prototype consistency learning via joint uncertainty quantification and data augmentation (EPCL-JUDA) to enhance the semantic expression of prototypes based on the framework of Mean-Teacher. The concatenation of original and augmented labeled data is fed into student network to generate expressive prototypes. Then, a joint uncertainty quantification method is devised to optimize pseudo-labels and generate reliable prototypes for original and augmented unlabeled data separately. High-quality global prototypes for each class are formed by fusing labeled and unlabeled prototypes, which are utilized to generate prototype-to-features to conduct consistency learning. Notably, a prototype network is proposed to reduce high memory requirements brought by the introduction of augmented data. Extensive experiments on Left Atrium, Pancreas-NIH, Type B Aortic Dissection datasets demonstrate EPCL-JUDA's superiority over previous state-of-the-art approaches, confirming the effectiveness of our framework. The code will be released soon.

Title: Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

Authors: Josh Alman, Zhao Song
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16284
Pdf URL: https://arxiv.org/pdf/2505.16284
Copy Paste: [[2505.16284]] Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse(https://arxiv.org/abs/2505.16284)
Keywords: transformer, large language model
Abstract: Attention mechanisms lie at the heart of modern large language models (LLMs). Straightforward algorithms for forward and backward (gradient) computation take quadratic time, and a line of work initiated by [Alman and Song NeurIPS 2023] and [Alman and Song NeurIPS 2024] has shown that quadratic time is necessary unless the model weights are small, in which case almost linear time algorithms are possible. In this paper, we show that large weights are necessary to avoid a strong preclusion to representational strength we call layer collapse, which means that the entire network can be approximated well by a network with only a single layer. Thus, the quadratic running time of attention is unavoidable for expressive transformers. The notion of layer collapse that we introduce is a variant on the notion of rank collapse from the work of [Dong, Cordonnier, and Loukas ICML 2021]. They showed that in Self Attention Networks with small weights and with skip connections, rank collapse must occur. This is typically interpreted as justifying the necessity of skip connections in expressive networks. However, our result shows that even with skip connections, if the weights are small, then layer collapse still occurs. Thus, only large weights, and not skip connections, can prevent these representational weaknesses.

Title: Fairness under Competition

Authors: Ronen Gradwohl, Eilam Shapira, Moshe Tennenholtz
Subjects: cs.LG, cs.GT
Abstract URL: https://arxiv.org/abs/2505.16291
Pdf URL: https://arxiv.org/pdf/2505.16291
Copy Paste: [[2505.16291]] Fairness under Competition(https://arxiv.org/abs/2505.16291)
Keywords: fair
Abstract: Algorithmic fairness has emerged as a central issue in ML, and it has become standard practice to adjust ML algorithms so that they will satisfy fairness requirements such as Equal Opportunity. In this paper we consider the effects of adopting such fair classifiers on the overall level of ecosystem fairness. Specifically, we introduce the study of fairness with competing firms, and demonstrate the failure of fair classifiers in yielding fair ecosystems. Our results quantify the loss of fairness in systems, under a variety of conditions, based on classifiers' correlation and the level of their data overlap. We show that even if competing classifiers are individually fair, the ecosystem's outcome may be unfair; and that adjusting biased algorithms to improve their individual fairness may lead to an overall decline in ecosystem fairness. In addition to these theoretical results, we also provide supporting experimental evidence. Together, our model and results provide a novel and essential call for action.

Title: Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA

Authors: Rishabh Maheshwary, Masoud Hashemi, Khyati Mahajan, Shiva Krishna Reddy Malay, Sai Rajeswar, Sathwik Tejaswi Madhusudhan, Spandana Gella, Vikas Yadav
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16293
Pdf URL: https://arxiv.org/pdf/2505.16293
Copy Paste: [[2505.16293]] Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA(https://arxiv.org/abs/2505.16293)
Keywords: large language model
Abstract: Iterative RAG for multi-hop question answering faces challenges with lengthy contexts and the buildup of irrelevant information. This hinders a model's capacity to process and reason over retrieved content and limits performance. While recent methods focus on compressing retrieved information, they are either restricted to single-round RAG, require finetuning or lack scalability in iterative RAG. To address these challenges, we propose Notes Writing, a method that generates concise and relevant notes from retrieved documents at each step, thereby reducing noise and retaining only essential information. This indirectly increases the effective context length of Large Language Models (LLMs), enabling them to reason and plan more effectively while processing larger volumes of input text. Notes Writing is framework agnostic and can be integrated with different iterative RAG methods. We demonstrate its effectiveness with three iterative RAG methods, across two models and four evaluation datasets. Notes writing yields an average improvement of 15.6 percentage points overall, with minimal increase in output tokens.

Title: ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Authors: Seongryong Jung, Suwan Yoon, DongGeon Kim, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16297
Pdf URL: https://arxiv.org/pdf/2505.16297
Copy Paste: [[2505.16297]] ToDi: Token-wise Distillation via Fine-Grained Divergence Control(https://arxiv.org/abs/2505.16297)
Keywords: large language model
Abstract: Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi's effectiveness and practicality.

Title: Poster: Towards an Automated Security Testing Framework for Industrial UEs

Authors: Sotiris Michaelides, Daniel Eguiguren Chavez, Martin Henze
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16300
Pdf URL: https://arxiv.org/pdf/2505.16300
Copy Paste: [[2505.16300]] Poster: Towards an Automated Security Testing Framework for Industrial UEs(https://arxiv.org/abs/2505.16300)
Keywords: secure, security
Abstract: With the ongoing adoption of 5G for communication in industrial systems and critical infrastructure, the security of industrial UEs such as 5G-enabled industrial robots becomes an increasingly important topic. Most notably, to meet the stringent security requirements of industrial deployments, industrial UEs not only have to fully comply with the 5G specifications but also implement and use correctly secure communication protocols such as TLS. To ensure the security of industrial UEs, operators of industrial 5G networks rely on security testing before deploying new devices to their production networks. However, currently only isolated tests for individual security aspects of industrial UEs exist, severely hindering comprehensive testing. In this paper, we report on our ongoing efforts to alleviate this situation by creating an automated security testing framework for industrial UEs to comprehensively evaluate their security posture before deployment. With this framework, we aim to provide stakeholders with a fully automated-method to verify that higher-layer security protocols are correctly implemented, while simultaneously ensuring that the UE's protocol stack adheres to 3GPP specifications.

Title: INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling

Authors: Haochen Shi, Tianshi Zheng, Weiqi Wang, Baixuan Xu, Chunyang Li, Chunkit Chan, Tao Fan, Yangqiu Song, Qiang Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16303
Pdf URL: https://arxiv.org/pdf/2505.16303
Copy Paste: [[2505.16303]] INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling(https://arxiv.org/abs/2505.16303)
Keywords: large language model
Abstract: Large Language Model (LLM) routing is a pivotal technique for navigating a diverse landscape of LLMs, aiming to select the best-performing LLMs tailored to the domains of user queries, while managing computational resources. However, current routing approaches often face limitations in scalability when dealing with a large pool of specialized LLMs, or in their adaptability to extending model scope and evolving capability domains. To overcome those challenges, we propose InferenceDynamics, a flexible and scalable multi-dimensional routing framework by modeling the capability and knowledge of models. We operate it on our comprehensive dataset RouteMix, and demonstrate its effectiveness and generalizability in group-level routing using modern benchmarks including MMLU-Pro, GPQA, BigGenBench, and LiveBench, showcasing its ability to identify and leverage top-performing models for given tasks, leading to superior outcomes with efficient resource utilization. The broader adoption of Inference Dynamics can empower users to harness the full specialized potential of the LLM ecosystem, and our code will be made publicly available to encourage further research.

Title: SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation

Authors: Guohao Huo, Ruiting Dai, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16304
Pdf URL: https://arxiv.org/pdf/2505.16304
Copy Paste: [[2505.16304]] SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation(https://arxiv.org/abs/2505.16304)
Keywords: extraction, segmentation
Abstract: To address the challenge of complex pathological feature extraction in automated cardiac MRI segmentation, this study proposes an innovative dual-encoder architecture named SAMba-UNet. The framework achieves cross-modal feature collaborative learning by integrating the vision foundation model SAM2, the state-space model Mamba, and the classical UNet. To mitigate domain discrepancies between medical and natural images, a Dynamic Feature Fusion Refiner is designed, which enhances small lesion feature extraction through multi-scale pooling and a dual-path calibration mechanism across channel and spatial dimensions. Furthermore, a Heterogeneous Omni-Attention Convergence Module (HOACM) is introduced, combining global contextual attention with branch-selective emphasis mechanisms to effectively fuse SAM2's local positional semantics and Mamba's long-range dependency modeling capabilities. Experiments on the ACDC cardiac MRI dataset demonstrate that the proposed model achieves a Dice coefficient of 0.9103 and an HD95 boundary error of 1.0859 mm, significantly outperforming existing methods, particularly in boundary localization for complex pathological structures such as right ventricular anomalies. This work provides an efficient and reliable solution for automated cardiac disease diagnosis, and the code will be open-sourced.

Title: PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models

Authors: Chenzhuo Zhao, Ziqian Liu, Xingda Wang, Junting Lu, Chaoyi Ruan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16307
Pdf URL: https://arxiv.org/pdf/2505.16307
Copy Paste: [[2505.16307]] PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models(https://arxiv.org/abs/2505.16307)
Keywords: large language model
Abstract: Prompt optimization offers a practical and broadly applicable alternative to fine-tuning for improving large language model (LLM) performance. However, existing methods often rely on costly output generation, self-critiquing abilities, or human-annotated preferences, which limit their scalability, especially for smaller or non-instruction-tuned models. We introduce PMPO (Probabilistic Metric Prompt Optimization), a unified framework that refines prompts using token-level cross-entropy loss as a direct, lightweight evaluation signal. PMPO identifies low-quality prompt segments by masking and measuring their impact on loss, then rewrites and selects improved variants by minimizing loss over positive and negative examples. Unlike prior methods, it requires no output sampling or human evaluation during optimization, relying only on forward passes and log-likelihoods. PMPO supports both supervised and preference-based tasks through a closely aligned loss-based evaluation strategy. Experiments show that PMPO consistently outperforms prior methods across model sizes and tasks: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA-RAT, and improves AlpacaEval 2.0 win rates by over 19 points. These results highlight PMPO's effectiveness, efficiency, and broad applicability.

Title: CAIFormer: A Causal Informed Transformer for Multivariate Time Series Forecasting

Authors: Xingyu Zhang, Wenwen Qiang, Siyu Zhao, Huijie Guo, Jiangmeng Li, Chuxiong Sun, Changwen Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16308
Pdf URL: https://arxiv.org/pdf/2505.16308
Copy Paste: [[2505.16308]] CAIFormer: A Causal Informed Transformer for Multivariate Time Series Forecasting(https://arxiv.org/abs/2505.16308)
Keywords: transformer
Abstract: Most existing multivariate time series forecasting methods adopt an all-to-all paradigm that feeds all variable histories into a unified model to predict their future values without distinguishing their individual roles. However, this undifferentiated paradigm makes it difficult to identify variable-specific causal influences and often entangles causally relevant information with spurious correlations. To address this limitation, we propose an all-to-one forecasting paradigm that predicts each target variable separately. Specifically, we first construct a Structural Causal Model from observational data and then, for each target variable, we partition the historical sequence into four sub-segments according to the inferred causal structure: endogenous, direct causal, collider causal, and spurious correlation. The prediction relies solely on the first three causally relevant sub-segments, while the spurious correlation sub-segment is excluded. Furthermore, we propose Causal Informed Transformer (CAIFormer), a novel forecasting model comprising three components: Endogenous Sub-segment Prediction Block, Direct Causal Sub-segment Prediction Block, and Collider Causal Sub-segment Prediction Block, which process the endogenous, direct causal, and collider causal sub-segments, respectively. Their outputs are then combined to produce the final prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of the CAIFormer.

Title: Paired and Unpaired Image to Image Translation using Generative Adversarial Networks

Authors: Gaurav Kumar, Soham Satyadharma, Harpreet Singh
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16310
Pdf URL: https://arxiv.org/pdf/2505.16310
Copy Paste: [[2505.16310]] Paired and Unpaired Image to Image Translation using Generative Adversarial Networks(https://arxiv.org/abs/2505.16310)
Keywords: generative
Abstract: Image to image translation is an active area of research in the field of computer vision, enabling the generation of new images with different styles, textures, or resolutions while preserving their characteristic properties. Recent architectures leverage Generative Adversarial Networks (GANs) to transform input images from one domain to another. In this work, we focus on the study of both paired and unpaired image translation across multiple image domains. For the paired task, we used a conditional GAN model, and for the unpaired task, we trained it using cycle consistency loss. We experimented with different types of loss functions, multiple Patch-GAN sizes, and model architectures. New quantitative metrics - precision, recall, and FID score - were used for analysis. In addition, a qualitative study of the results of different experiments was conducted.

Title: Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings

Authors: Arjhun Swaminathan, Mete Akgün
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16313
Pdf URL: https://arxiv.org/pdf/2505.16313
Copy Paste: [[2505.16313]] Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings(https://arxiv.org/abs/2505.16313)
Keywords: attack
Abstract: Deep neural networks for image classification remain vulnerable to adversarial examples -- small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70\% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.

Title: NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

Authors: Shuhao Han, Haotian Fan, Fangyuan Kong, Wenjie Liao, Chunle Guo, Chongyi Li, Radu Timofte, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Jianhui Sun, Xinli Yue, Tianyi Wang, Huan Hou, Junda Lu, Xinyang Huang, Zitang Zhou, Zijian Zhang, Xuhui Zheng, Xuecheng Wu, Chong Peng, Xuezhi Cao, Trong-Hieu Nguyen-Mau, Minh-Hoang Le, Minh-Khoa Le-Phan, Duy-Nam Ly, Hai-Dang Nguyen, Minh-Triet Tran, Yukang Lin, Yan Hong, Chuanbiao Song, Siyuan Li, Jun Lan, Zhichao Zhang, Xinyue Li, Wei Sun, Zicheng Zhang, Yunhao Li, Xiaohong Liu, Guangtao Zhai, Zitong Xu, Huiyu Duan, Jiarui Wang, Guangji Ma, Liu Yang, Lu Liu, Qiang Hu, Xiongkuo Min, Zichuan Wang, Zhenchen Tang, Bo Peng, Jing Dong, Fengbin Guan, Zihao Yu, Yiting Lu, Wei Luo, Xin Li, Minhao Lin, Haofeng Chen, Xuanxuan He, Kele Xu, Qisheng Xu, Zijian Gao, Tianjiao Wan, Bo-Cheng Qiu, Chih-Chung Hsu, Chia-ming Lee, Yu-Fan Lin, Bo Yu, Zehao Wang, Da Mu, Mingxiu Chen, Junkang Fang, Huamei Sun, Wending Zhao, Zhiyu Wang, Wang Liu, Weikang Yu, Puhong Duan, Bin Sun, Xudong Kang, Shutao Li, Shuai He, Lingzhi Fu, Heng Cong, Rongyu Zhang, Jiarong He, Zhishan Qiao, Yongqing Huang, Zewen Chen, Zhe Pang, Juan Wang, Jian Guo, Zhizhuo Shao, Ziyu Feng, Bing Li, Weiming Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16314
Pdf URL: https://arxiv.org/pdf/2505.16314
Copy Paste: [[2505.16314]] NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment(https://arxiv.org/abs/2505.16314)
Keywords: generative
Abstract: This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.

Title: SuperPure: Efficient Purification of Localized and Distributed Adversarial Patches via Super-Resolution GAN Models

Authors: Hossein Khalili, Seongbin Park, Venkat Bollapragada, Nader Sehatbakhsh
Subjects: cs.CV, cs.CR, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16318
Pdf URL: https://arxiv.org/pdf/2505.16318
Copy Paste: [[2505.16318]] SuperPure: Efficient Purification of Localized and Distributed Adversarial Patches via Super-Resolution GAN Models(https://arxiv.org/abs/2505.16318)
Keywords: defense, attack, robust
Abstract: As vision-based machine learning models are increasingly integrated into autonomous and cyber-physical systems, concerns about (physical) adversarial patch attacks are growing. While state-of-the-art defenses can achieve certified robustness with minimal impact on utility against highly-concentrated localized patch attacks, they fall short in two important areas: (i) State-of-the-art methods are vulnerable to low-noise distributed patches where perturbations are subtly dispersed to evade detection or masking, as shown recently by the DorPatch attack; (ii) Achieving high robustness with state-of-the-art methods is extremely time and resource-consuming, rendering them impractical for latency-sensitive applications in many cyber-physical systems. To address both robustness and latency issues, this paper proposes a new defense strategy for adversarial patch attacks called SuperPure. The key novelty is developing a pixel-wise masking scheme that is robust against both distributed and localized patches. The masking involves leveraging a GAN-based super-resolution scheme to gradually purify the image from adversarial patches. Our extensive evaluations using ImageNet and two standard classifiers, ResNet and EfficientNet, show that SuperPure advances the state-of-the-art in three major directions: (i) it improves the robustness against conventional localized patches by more than 20%, on average, while also improving top-1 clean accuracy by almost 10%; (ii) It achieves 58% robustness against distributed patch attacks (as opposed to 0% in state-of-the-art method, PatchCleanser); (iii) It decreases the defense end-to-end latency by over 98% compared to PatchCleanser. Our further analysis shows that SuperPure is robust against white-box attacks and different patch sizes. Our code is open-source.

Title: FreshRetailNet-50K: A Stockout-Annotated Censored Demand Dataset for Latent Demand Recovery and Forecasting in Fresh Retail

Authors: Yangyang Wang, Jiawei Gu, Li Long, Xin Li, Li Shen, Zhouyu Fu, Xiangjun Zhou, Xu Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16319
Pdf URL: https://arxiv.org/pdf/2505.16319
Copy Paste: [[2505.16319]] FreshRetailNet-50K: A Stockout-Annotated Censored Demand Dataset for Latent Demand Recovery and Forecasting in Fresh Retail(https://arxiv.org/abs/2505.16319)
Keywords: robust
Abstract: Accurate demand estimation is critical for the retail business in guiding the inventory and pricing policies of perishable products. However, it faces fundamental challenges from censored sales data during stockouts, where unobserved demand creates systemic policy biases. Existing datasets lack the temporal resolution and annotations needed to address this censoring effect. To fill this gap, we present FreshRetailNet-50K, the first large-scale benchmark for censored demand estimation. It comprises 50,000 store-product time series of detailed hourly sales data from 898 stores in 18 major cities, encompassing 863 perishable SKUs meticulously annotated for stockout events. The hourly stock status records unique to this dataset, combined with rich contextual covariates, including promotional discounts, precipitation, and temporal features, enable innovative research beyond existing solutions. We demonstrate one such use case of two-stage demand modeling: first, we reconstruct the latent demand during stockouts using precise hourly annotations. We then leverage the recovered demand to train robust demand forecasting models in the second stage. Experimental results show that this approach achieves a 2.73\% improvement in prediction accuracy while reducing the systematic demand underestimation from 7.37\% to near-zero bias. With unprecedented temporal granularity and comprehensive real-world information, FreshRetailNet-50K opens new research directions in demand imputation, perishable inventory optimization, and causal retail analytics. The unique annotation quality and scale of the dataset address long-standing limitations in retail AI, providing immediate solutions and a platform for future methodological innovation. The data (this https URL) and code (this https URL}) are openly released.

Title: Efficient Motion Prompt Learning for Robust Visual Tracking

Authors: Jie Zhao, Xin Chen, Yongsheng Yuan, Michael Felsberg, Dong Wang, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16321
Pdf URL: https://arxiv.org/pdf/2505.16321
Copy Paste: [[2505.16321]] Efficient Motion Prompt Learning for Robust Visual Tracking(https://arxiv.org/abs/2505.16321)
Keywords: robust
Abstract: Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at this https URL.

Title: TensorAR: Refinement is All You Need in Autoregressive Image Generation

Authors: Cheng Cheng, Lin Song, Yicheng Xiao, Yuxin Chen, Xuchong Zhang, Hongbin Sun, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16324
Pdf URL: https://arxiv.org/pdf/2505.16324
Copy Paste: [[2505.16324]] TensorAR: Refinement is All You Need in Autoregressive Image Generation(https://arxiv.org/abs/2505.16324)
Keywords: diffusion
Abstract: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.

Title: CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation

Authors: Yuyang Jiang, Chacha Chen, Shengyuan Wang, Feng Li, Zecong Tang, Benjamin M. Mervak, Lydia Chelala, Christopher M Straus, Reve Chahine, Samuel G. Armato III, Chenhao Tan
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.16325
Pdf URL: https://arxiv.org/pdf/2505.16325
Copy Paste: [[2505.16325]] CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation(https://arxiv.org/abs/2505.16325)
Keywords: interpretability
Abstract: Existing metrics often lack the granularity and interpretability to capture nuanced clinical differences between candidate and ground-truth radiology reports, resulting in suboptimal evaluation. We introduce a Clinically-grounded tabular framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR). CLEAR not only examines whether a report can accurately identify the presence or absence of medical conditions, but also assesses whether it can precisely describe each positively identified condition across five key attributes: first occurrence, change, severity, descriptive location, and recommendation. Compared to prior works, CLEAR's multi-dimensional, attribute-level outputs enable a more comprehensive and clinically interpretable evaluation of report quality. Additionally, to measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR, annotated across 6 curated attributes and 13 CheXpert conditions. Our experiments show that CLEAR achieves high accuracy in extracting clinical attributes and provides automated metrics that are strongly aligned with clinical judgment.

Title: ChemMLLM: Chemical Multimodal Large Language Model

Authors: Qian Tan, Dongzhan Zhou, Peng Xia, Wanhao Liu, Wanli Ouyang, Lei Bai, Yuqiang Li, Tianfan Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16326
Pdf URL: https://arxiv.org/pdf/2505.16326
Copy Paste: [[2505.16326]] ChemMLLM: Chemical Multimodal Large Language Model(https://arxiv.org/abs/2505.16326)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, in this paper, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 118.9\% (4.27 vs 1.95 property improvement). The code is publicly available at this https URL.

Title: SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers

Authors: Wenqing Wu, Chengzhi Zhang, Tong Bao, Yi Zhao
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2505.16330
Pdf URL: https://arxiv.org/pdf/2505.16330
Copy Paste: [[2505.16330]] SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers(https://arxiv.org/abs/2505.16330)
Keywords: large language model
Abstract: Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at this https URL.

Title: Understanding Differential Transformer Unchains Pretrained Self-Attentions

Authors: Chaerin Kong, Jiho Jang, Nojun Kwak
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16333
Pdf URL: https://arxiv.org/pdf/2505.16333
Copy Paste: [[2505.16333]] Understanding Differential Transformer Unchains Pretrained Self-Attentions(https://arxiv.org/abs/2505.16333)
Keywords: transformer
Abstract: Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a lightweight differential operation on the output value matrix, DEX effectively incorporates the key advantages of differential attention while remaining lightweight in both training and inference. Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (< 0.01\%).

Title: Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text

Authors: Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16334
Pdf URL: https://arxiv.org/pdf/2505.16334
Copy Paste: [[2505.16334]] Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text(https://arxiv.org/abs/2505.16334)
Keywords: large language model
Abstract: This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image this http URL an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model this http URL show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: this https URL

Title: FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design

Authors: Renjie Wei, Songqiang Xu, Qingyu Guo, Meng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16335
Pdf URL: https://arxiv.org/pdf/2505.16335
Copy Paste: [[2505.16335]] FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design(https://arxiv.org/abs/2505.16335)
Keywords: diffusion
Abstract: Visual autoregressive (VAR) modeling has marked a paradigm shift in image generation from next-token prediction to next-scale prediction. VAR predicts a set of tokens at each step from coarse to fine scale, leading to better image quality and faster inference speed compared to existing diffusion models. However, the large parameter size and computation cost hinder its deployment on edge devices. To reduce the memory and computation cost, we propose FPQVAR, an efficient post-training floating-point (FP) quantization framework for VAR featuring algorithm and hardware co-design. At the algorithm level, we first identify the challenges of quantizing VAR. To address them, we propose Dual Format Quantization for the highly imbalanced input activation. We further propose Group-wise Hadamard Transformation and GHT-Aware Learnable Transformation to address the time-varying outlier channels. At the hardware level, we design the first low-bit FP quantizer and multiplier with lookup tables on FPGA and propose the first FPGA-based VAR accelerator featuring low-bit FP computation and an elaborate two-level pipeline. Extensive experiments show that compared to the state-of-the-art quantization method, our proposed FPQVAR significantly improves Fréchet Inception Distance (FID) from 10.83 to 3.58, Inception Score (IS) from 175.9 to 241.5 under 4-bit quantization. FPQVAR also significantly improves the performance of 6-bit quantized VAR, bringing it on par with the FP16 model. Our accelerator on AMD-Xilinx VCK190 FPGA achieves a throughput of 1.1 image/s, which is 3.1x higher than the integer-based accelerator. It also demonstrates 3.6x and 2.8x higher energy efficiency compared to the integer-based accelerator and GPU baseline, respectively.

Title: Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification

Authors: Amirreza Mahbod, Rupert Ecker, Ramona Woitek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16338
Pdf URL: https://arxiv.org/pdf/2505.16338
Copy Paste: [[2505.16338]] Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification(https://arxiv.org/abs/2505.16338)
Keywords: transformer
Abstract: Accurate classification of skin lesions from dermatoscopic images is essential for diagnosis and treatment of skin cancer. In this study, we investigate the utility of a dermatology-specific foundation model, PanDerm, in comparison with two Vision Transformer (ViT) architectures (ViT base and Swin Transformer V2 base) for the task of skin lesion classification. Using frozen features extracted from PanDerm, we apply non-linear probing with three different classifiers, namely, multi-layer perceptron (MLP), XGBoost, and TabNet. For the ViT-based models, we perform full fine-tuning to optimize classification performance. Our experiments on the HAM10000 and MSKCC datasets demonstrate that the PanDerm-based MLP model performs comparably to the fine-tuned Swin transformer model, while fusion of PanDerm and Swin Transformer predictions leads to further performance improvements. Future work will explore additional foundation models, fine-tuning strategies, and advanced fusion techniques.

Title: Improving Chemical Understanding of LLMs via SMILES Parsing

Authors: Yunhui Jang, Jaehyung Kim, Sungsoo Ahn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16340
Pdf URL: https://arxiv.org/pdf/2505.16340
Copy Paste: [[2505.16340]] Improving Chemical Understanding of LLMs via SMILES Parsing(https://arxiv.org/abs/2505.16340)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.

Title: Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

Authors: Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16348
Pdf URL: https://arxiv.org/pdf/2505.16348
Copy Paste: [[2505.16348]] Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance(https://arxiv.org/abs/2505.16348)
Keywords: large language model
Abstract: Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: this https URL

Title: Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

Authors: Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16360
Pdf URL: https://arxiv.org/pdf/2505.16360
Copy Paste: [[2505.16360]] Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation(https://arxiv.org/abs/2505.16360)
Keywords: robust, diffusion, segmentation
Abstract: Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: this https URL.

Title: AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

Authors: Huishuai Zhang, Bohan Wang, Luoxin Chen
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16363
Pdf URL: https://arxiv.org/pdf/2505.16363
Copy Paste: [[2505.16363]] AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training(https://arxiv.org/abs/2505.16363)
Keywords: transformer, large language model
Abstract: We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_0, L_1)$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.

Title: A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules

Authors: Manuel Ruiz-Botella, Marta Sales-Pardo, Roger Guimerà
Subjects: cs.LG, cs.AI, physics.comp-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.16365
Pdf URL: https://arxiv.org/pdf/2505.16365
Copy Paste: [[2505.16365]] A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules(https://arxiv.org/abs/2505.16365)
Keywords: diffusion
Abstract: Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.

Title: ReCopilot: Reverse Engineering Copilot in Binary Analysis

Authors: Guoqiang Chen, Huiqi Sun, Daguang Liu, Zhiqi Wang, Qiang Wang, Bin Yin, Lu Liu, Lingyun Ying
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16366
Pdf URL: https://arxiv.org/pdf/2505.16366
Copy Paste: [[2505.16366]] ReCopilot: Reverse Engineering Copilot in Binary Analysis(https://arxiv.org/abs/2505.16366)
Keywords: security, large language model
Abstract: Binary analysis plays a pivotal role in security domains such as malware detection and vulnerability discovery, yet it remains labor-intensive and heavily reliant on expert knowledge. General-purpose large language models (LLMs) perform well in programming analysis on source code, while binaryspecific LLMs are underexplored. In this work, we present ReCopilot, an expert LLM designed for binary analysis tasks. ReCopilot integrates binary code knowledge through a meticulously constructed dataset, encompassing continue pretraining (CPT), supervised fine-tuning (SFT), and direct preference optimization (DPO) stages. It leverages variable data flow and call graph to enhance context awareness and employs test-time scaling to improve reasoning capabilities. Evaluations on a comprehensive binary analysis benchmark demonstrate that ReCopilot achieves state-of-the-art performance in tasks such as function name recovery and variable type inference on the decompiled pseudo code, outperforming both existing tools and LLMs by 13%. Our findings highlight the effectiveness of domain-specific training and context enhancement, while also revealing challenges in building super long chain-of-thought. ReCopilot represents a significant step toward automating binary analysis with interpretable and scalable AI assistance in this domain.

Title: SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning

Authors: Huanyu Liu, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong, Ge Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16368
Pdf URL: https://arxiv.org/pdf/2505.16368
Copy Paste: [[2505.16368]] SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning(https://arxiv.org/abs/2505.16368)
Keywords: large language model
Abstract: How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLM reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.

Title: Privacy-Aware Cyberterrorism Network Analysis using Graph Neural Networks and Federated Learning

Authors: Anas Ali, Mubashar Husain, Peter Hans
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16371
Pdf URL: https://arxiv.org/pdf/2505.16371
Copy Paste: [[2505.16371]] Privacy-Aware Cyberterrorism Network Analysis using Graph Neural Networks and Federated Learning(https://arxiv.org/abs/2505.16371)
Keywords: secure, privacy, defense, robust, federate
Abstract: Cyberterrorism poses a formidable threat to digital infrastructures, with increasing reliance on encrypted, decentralized platforms that obscure threat actor activity. To address the challenge of analyzing such adversarial networks while preserving the privacy of distributed intelligence data, we propose a Privacy-Aware Federated Graph Neural Network (PA-FGNN) framework. PA-FGNN integrates graph attention networks, differential privacy, and homomorphic encryption into a robust federated learning pipeline tailored for cyberterrorism network analysis. Each client trains locally on sensitive graph data and exchanges encrypted, noise-perturbed model updates with a central aggregator, which performs secure aggregation and broadcasts global updates. We implement anomaly detection for flagging high-risk nodes and incorporate defenses against gradient poisoning. Experimental evaluations on simulated dark web and cyber-intelligence graphs demonstrate that PA-FGNN achieves over 91\% classification accuracy, maintains resilience under 20\% adversarial client behavior, and incurs less than 18\% communication overhead. Our results highlight that privacy-preserving GNNs can support large-scale cyber threat detection without compromising on utility, privacy, or robustness.

Title: Temporal and Spatial Feature Fusion Framework for Dynamic Micro Expression Recognition

Authors: Feng Liu, Bingyu Nan, Xuezhong Qian, Xiaolan Fu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16372
Pdf URL: https://arxiv.org/pdf/2505.16372
Copy Paste: [[2505.16372]] Temporal and Spatial Feature Fusion Framework for Dynamic Micro Expression Recognition(https://arxiv.org/abs/2505.16372)
Keywords: transformer
Abstract: When emotions are repressed, an individual's true feelings may be revealed through micro-expressions. Consequently, micro-expressions are regarded as a genuine source of insight into an individual's authentic emotions. However, the transient and highly localised nature of micro-expressions poses a significant challenge to their accurate recognition, with the accuracy rate of micro-expression recognition being as low as 50%, even for professionals. In order to address these challenges, it is necessary to explore the field of dynamic micro expression recognition (DMER) using multimodal fusion techniques, with special attention to the diverse fusion of temporal and spatial modal features. In this paper, we propose a novel Temporal and Spatial feature Fusion framework for DMER (TSFmicro). This framework integrates a Retention Network (RetNet) and a transformer-based DMER network, with the objective of efficient micro-expression recognition through the capture and fusion of temporal and spatial relations. Meanwhile, we propose a novel parallel time-space fusion method from the perspective of modal fusion, which fuses spatio-temporal information in high-dimensional feature space, resulting in complementary "where-how" relationships at the semantic level and providing richer semantic information for the model. The experimental results demonstrate the superior performance of the TSFmicro method in comparison to other contemporary state-of-the-art methods. This is evidenced by its effectiveness on three well-recognised micro-expression datasets.

Title: DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Authors: Zijia Lu, A S M Iftekhar, Gaurav Mittal, Tianjian Meng, Xiawei Wang, Cheng Zhao, Rohith Kukkala, Ehsan Elhamifar, Mei Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16376
Pdf URL: https://arxiv.org/pdf/2505.16376
Copy Paste: [[2505.16376]] DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos(https://arxiv.org/abs/2505.16376)
Keywords: extraction
Abstract: Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at this https URL.

Title: PaTH Attention: Position Encoding via Accumulating Householder Transformations

Authors: Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16381
Pdf URL: https://arxiv.org/pdf/2505.16381
Copy Paste: [[2505.16381]] PaTH Attention: Position Encoding via Accumulating Householder Transformations(https://arxiv.org/abs/2505.16381)
Keywords: transformer, large language model
Abstract: The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.

Title: Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models

Authors: Kaiyu He, Tong Zhou, Yubo Chen, Delai Qiu, Shengping Liu, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16385
Pdf URL: https://arxiv.org/pdf/2505.16385
Copy Paste: [[2505.16385]] Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models(https://arxiv.org/abs/2505.16385)
Keywords: interpretability, large language model
Abstract: Large language models (LLMs) demonstrate remarkable ability in cross-lingual tasks. Understanding how LLMs acquire this ability is crucial for their interpretability. To quantify the cross-lingual ability of LLMs accurately, we propose a Word-Level Cross-Lingual Translation Task. To find how LLMs learn cross-lingual ability, we trace the outputs of LLMs' intermediate layers in the word translation task. We identify and distinguish two distinct behaviors in the forward pass of LLMs: co-occurrence behavior and semantic pivot behavior. We attribute LLMs' two distinct behaviors to the co-occurrence frequency of words and find the semantic pivot from the pre-training dataset. Finally, to apply our findings to improve the cross-lingual ability of LLMs, we reconstruct a semantic pivot-aware pre-training dataset using documents with a high proportion of semantic pivots. Our experiments validate the effectiveness of our approach in enhancing cross-lingual ability. Our research contributes insights into the interpretability of LLMs and offers a method for improving LLMs' cross-lingual ability.

Title: Omni TM-AE: A Scalable and Interpretable Embedding Model Using the Full Tsetlin Machine State Space

Authors: Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16386
Pdf URL: https://arxiv.org/pdf/2505.16386
Copy Paste: [[2505.16386]] Omni TM-AE: A Scalable and Interpretable Embedding Model Using the Full Tsetlin Machine State Space(https://arxiv.org/abs/2505.16386)
Keywords: interpretability
Abstract: The increasing complexity of large-scale language models has amplified concerns regarding their interpretability and reusability. While traditional embedding models like Word2Vec and GloVe offer scalability, they lack transparency and often behave as black boxes. Conversely, interpretable models such as the Tsetlin Machine (TM) have shown promise in constructing explainable learning systems, though they previously faced limitations in scalability and reusability. In this paper, we introduce Omni Tsetlin Machine AutoEncoder (Omni TM-AE), a novel embedding model that fully exploits the information contained in the TM's state matrix, including literals previously excluded from clause formation. This method enables the construction of reusable, interpretable embeddings through a single training phase. Extensive experiments across semantic similarity, sentiment classification, and document clustering tasks show that Omni TM-AE performs competitively with and often surpasses mainstream embedding models. These results demonstrate that it is possible to balance performance, scalability, and interpretability in modern Natural Language Processing (NLP) systems without resorting to opaque architectures.

Title: Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection

Authors: Benjamin Vendeville, Liana Ermakova, Pierre De Loor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16392
Pdf URL: https://arxiv.org/pdf/2505.16392
Copy Paste: [[2505.16392]] Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection(https://arxiv.org/abs/2505.16392)
Keywords: large language model
Abstract: The general public often encounters complex texts but does not have the time or expertise to fully understand them, leading to the spread of misinformation. Automatic Text Simplification (ATS) helps make information more accessible, but its evaluation methods have not kept up with advances in text generation, especially with Large Language Models (LLMs). In particular, recent studies have shown that current ATS metrics do not correlate with the presence of errors. Manual inspections have further revealed a variety of errors, underscoring the need for a more nuanced evaluation framework, which is currently lacking. This resource paper addresses this gap by introducing a test collection for detecting and classifying errors in simplified texts. First, we propose a taxonomy of errors, with a formal focus on information distortion. Next, we introduce a parallel dataset of automatically simplified scientific texts. This dataset has been human-annotated with labels based on our proposed taxonomy. Finally, we analyze the quality of the dataset, and we study the performance of existing models to detect and classify errors from that taxonomy. These contributions give researchers the tools to better evaluate errors in ATS, develop more reliable models, and ultimately improve the quality of automatically simplified texts.

Title: Consistent and Compatible Modelling of Cyber Intrusions and Incident Response Demonstrated in the Context of Malware Attacks on Critical Infrastructure

Authors: Peter Maynard, Yulia Cherdantseva, Avi Shaked, Pete Burnap, Arif Mehmood
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16398
Pdf URL: https://arxiv.org/pdf/2505.16398
Copy Paste: [[2505.16398]] Consistent and Compatible Modelling of Cyber Intrusions and Incident Response Demonstrated in the Context of Malware Attacks on Critical Infrastructure(https://arxiv.org/abs/2505.16398)
Keywords: security, attack
Abstract: Cyber Security Incident Response (IR) Playbooks are used to capture the steps required to recover from a cyber intrusion. Individual IR playbooks should focus on a specific type of incident and be aligned with the architecture of a system under attack. Intrusion modelling focuses on a specific potential cyber intrusion and is used to identify where and what countermeasures are needed, and the resulting intrusion models are expected to be used in effective IR, ideally by feeding IR Playbooks designs. IR playbooks and intrusion models, however, are created in isolation and at varying stages of the system's lifecycle. We take nine critical national infrastructure intrusion models - expressed using Sequential AND Attack Trees - and transform them into models of the same format as IR playbooks. We use Security Modelling Framework for modelling attacks and playbooks, and for demonstrating the feasibility of the better integration between risk assessment and IR at the modelling level. This results in improved intrusion models and tighter coupling between IR playbooks and threat modelling which - as we demonstrate - yields novel insights into the analysis of attacks and response actions. The main contributions of this paper are (a) a novel way of representing attack trees using the Security Modelling Framework,(b) a new tool for converting Sequential AND attack trees into models compatible with playbooks, and (c) the examples of nine intrusion models represented using the Security Modelling Framework.

Title: Sketchy Bounding-box Supervision for 3D Instance Segmentation

Authors: Qian Deng, Le Hui, Jin Xie, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16399
Pdf URL: https://arxiv.org/pdf/2505.16399
Copy Paste: [[2505.16399]] Sketchy Bounding-box Supervision for 3D Instance Segmentation(https://arxiv.org/abs/2505.16399)
Keywords: segmentation
Abstract: Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point-level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding box by adding scaling, translation, and rotation. In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box supervisions. Specifically, we first propose an adaptive box-to-point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse-to-fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high-quality instances through joint training. Extensive experiments show that our method achieves state-of-the-art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using sketchy bounding boxes. Code is available at this https URL.

Title: AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Authors: Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16400
Pdf URL: https://arxiv.org/pdf/2505.16400
Copy Paste: [[2505.16400]] AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning(https://arxiv.org/abs/2505.16400)
Keywords: robust
Abstract: Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

Title: Divide-Fuse-Conquer: Eliciting "Aha Moments" in Multi-Scenario Games

Authors: Xiaoqing Zhang, Huabin Zheng, Ang Lv, Yuhan Liu, Zirui Song, Flood Sung, Xiuying Chen, Rui Yan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16401
Pdf URL: https://arxiv.org/pdf/2505.16401
Copy Paste: [[2505.16401]] Divide-Fuse-Conquer: Eliciting "Aha Moments" in Multi-Scenario Games(https://arxiv.org/abs/2505.16401)
Keywords: large language model
Abstract: Large language models (LLMs) have been observed to suddenly exhibit advanced reasoning abilities during reinforcement learning (RL), resembling an ``aha moment'' triggered by simple outcome-based rewards. While RL has proven effective in eliciting such breakthroughs in tasks involving mathematics, coding, and vision, it faces significant challenges in multi-scenario games. The diversity of game rules, interaction modes, and environmental complexities often leads to policies that perform well in one scenario but fail to generalize to others. Simply combining multiple scenarios during training introduces additional challenges, such as training instability and poor performance. To overcome these challenges, we propose Divide-Fuse-Conquer, a framework designed to enhance generalization in multi-scenario RL. This approach starts by heuristically grouping games based on characteristics such as rules and difficulties. Specialized models are then trained for each group to excel at games in the group is what we refer to as the divide step. Next, we fuse model parameters from different groups as a new model, and continue training it for multiple groups, until the scenarios in all groups are conquered. Experiments across 18 TextArena games show that Qwen2.5-32B-Align trained with the Divide-Fuse-Conquer strategy reaches a performance level comparable to Claude3.5, achieving 7 wins and 4 draws. We hope our approach can inspire future research on using reinforcement learning to improve the generalization of LLMs.

Title: AdvReal: Adversarial Patch Generation Framework with Application to Adversarial Safety Evaluation of Object Detection Systems

Authors: Yuanhao Huang, Yilong Ren, Jinlei Wang, Lujia Huo, Xuesong Bai, Jinchuan Zhang, Haiyan Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16402
Pdf URL: https://arxiv.org/pdf/2505.16402
Copy Paste: [[2505.16402]] AdvReal: Adversarial Patch Generation Framework with Application to Adversarial Safety Evaluation of Object Detection Systems(https://arxiv.org/abs/2505.16402)
Keywords: attack, robust, transformer
Abstract: Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in safety accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D samples to address the challenges of intra-class diversity and environmental variations in real-world scenarios. Building upon this framework, we introduce an adversarial sample reality enhancement approach that incorporates non-rigid surface modeling and a realistic 3D matching mechanism. We compare with 5 advanced adversarial patches and evaluate their attack performance on 8 object detecotrs, including single-stage, two-stage, and transformer-based models. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Moreover, proposed method demonstrates excellent robustness and transferability under multi-angle attacks, varying lighting conditions, and different distance in the physical world. The demo video and code can be obtained at this https URL.

Title: Performance Guaranteed Poisoning Attacks in Federated Learning: A Sliding Mode Approach

Authors: Huazi Pan, Yanjun Zhang, Leo Yu Zhang, Scott Adams, Abbas Kouzani, Suiyang Khoo
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2505.16403
Pdf URL: https://arxiv.org/pdf/2505.16403
Copy Paste: [[2505.16403]] Performance Guaranteed Poisoning Attacks in Federated Learning: A Sliding Mode Approach(https://arxiv.org/abs/2505.16403)
Keywords: attack, robust, steal, federate
Abstract: Manipulation of local training data and local updates, i.e., the poisoning attack, is the main threat arising from the collaborative nature of the federated learning (FL) paradigm. Most existing poisoning attacks aim to manipulate local data/models in a way that causes denial-of-service (DoS) issues. In this paper, we introduce a novel attack method, named Federated Learning Sliding Attack (FedSA) scheme, aiming at precisely introducing the extent of poisoning in a subtle controlled manner. It operates with a predefined objective, such as reducing global model's prediction accuracy by 10\%. FedSA integrates robust nonlinear control-Sliding Mode Control (SMC) theory with model poisoning attacks. It can manipulate the updates from malicious clients to drive the global model towards a compromised state, achieving this at a controlled and inconspicuous rate. Additionally, leveraging the robust control properties of FedSA allows precise control over the convergence bounds, enabling the attacker to set the global accuracy of the poisoned model to any desired level. Experimental results demonstrate that FedSA can accurately achieve a predefined global accuracy with fewer malicious clients while maintaining a high level of stealth and adjustable learning rates.

Title: From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs

Authors: Muhammad Farid Adilazuarda, Chen Cecilia Liu, Iryna Gurevych, Alham Fikri Aji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16408
Pdf URL: https://arxiv.org/pdf/2505.16408
Copy Paste: [[2505.16408]] From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs(https://arxiv.org/abs/2505.16408)
Keywords: large language model
Abstract: Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and limited training data. Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data. However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for various downstream tasks. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To investigate these issues, we augment WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. While these narratives may have variable effects on downstream tasks, they consistently improve cultural distinctiveness than survey data alone. Our work highlights the inherent complexity of aligning cultural values with the goal of guiding task-specific behavior.

Title: Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning

Authors: Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16410
Pdf URL: https://arxiv.org/pdf/2505.16410
Copy Paste: [[2505.16410]] Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning(https://arxiv.org/abs/2505.16410)
Keywords: large language model
Abstract: Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at this https URL.

Title: Pose-invariant face recognition via feature-space pose frontalization

Authors: Nikolay Stanishev, Yuhang Lu, Touradj Ebrahimi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16412
Pdf URL: https://arxiv.org/pdf/2505.16412
Copy Paste: [[2505.16412]] Pose-invariant face recognition via feature-space pose frontalization(https://arxiv.org/abs/2505.16412)
Keywords: robust, generative
Abstract: Pose-invariant face recognition has become a challenging problem for modern AI-based face recognition systems. It aims at matching a profile face captured in the wild with a frontal face registered in a database. Existing methods perform face frontalization via either generative models or learning a pose robust feature representation. In this paper, a new method is presented to perform face frontalization and recognition within the feature space. First, a novel feature space pose frontalization module (FSPFM) is proposed to transform profile images with arbitrary angles into frontal counterparts. Second, a new training paradigm is proposed to maximize the potential of FSPFM and boost its performance. The latter consists of a pre-training and an attention-guided fine-tuning stage. Moreover, extensive experiments have been conducted on five popular face recognition benchmarks. Results show that not only our method outperforms the state-of-the-art in the pose-invariant face recognition task but also maintains superior performance in other standard scenarios.

Title: Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

Authors: Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16415
Pdf URL: https://arxiv.org/pdf/2505.16415
Copy Paste: [[2505.16415]] Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation(https://arxiv.org/abs/2505.16415)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models.

Title: Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Authors: Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16416
Pdf URL: https://arxiv.org/pdf/2505.16416
Copy Paste: [[2505.16416]] Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models(https://arxiv.org/abs/2505.16416)
Keywords: robust, large language model
Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at [this https URL](this https URL).

Title: WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

Authors: Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16421
Pdf URL: https://arxiv.org/pdf/2505.16421
Copy Paste: [[2505.16421]] WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning(https://arxiv.org/abs/2505.16421)
Keywords: large language model
Abstract: While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.

Title: $I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion

Authors: Jing Bi, Pinxin Liu, Ali Vosoughi, Jiarui Wu, Jinxi He, Chenliang Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16425
Pdf URL: https://arxiv.org/pdf/2505.16425
Copy Paste: [[2505.16425]] $I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion(https://arxiv.org/abs/2505.16425)
Keywords: diffusion
Abstract: The effective communication of procedural knowledge remains a significant challenge in natural language processing (NLP), as purely textual instructions often fail to convey complex physical actions and spatial relationships. We address this limitation by proposing a language-driven framework that translates procedural text into coherent visual instructions. Our approach models the linguistic structure of instructional content by decomposing it into goal statements and sequential steps, then conditioning visual generation on these linguistic elements. We introduce three key innovations: (1) a constituency parser-based text encoding mechanism that preserves semantic completeness even with lengthy instructions, (2) a pairwise discourse coherence model that maintains consistency across instruction sequences, and (3) a novel evaluation protocol specifically designed for procedural language-to-image alignment. Our experiments across three instructional datasets (HTStep, CaptainCook4D, and WikiAll) demonstrate that our method significantly outperforms existing baselines in generating visuals that accurately reflect the linguistic content and sequential nature of instructions. This work contributes to the growing body of research on grounding procedural language in visual content, with applications spanning education, task guidance, and multimodal language understanding.

Title: Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems

Authors: Song Jin, Juntian Zhang, Yuhan Liu, Xun Zhang, Yufei Zhang, Guojun Yin, Fei Jiang, Wei Lin, Rui Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16429
Pdf URL: https://arxiv.org/pdf/2505.16429
Copy Paste: [[2505.16429]] Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems(https://arxiv.org/abs/2505.16429)
Keywords: robust
Abstract: Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter, a novel agent-based simulation platform for recommender systems featuring a robust interaction mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time, and introduced Merchant Agents can reply, fostering a more realistic and evolving ecosystem. High-fidelity simulation is ensured through Multidimensional User Profiling module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought (CoT) enriched interaction data. Our platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments demonstrate that this interaction mechanism is pivotal for simulating realistic system evolution, establishing our platform as a credible testbed for recommender systems research.

Title: Password Strength Detection via Machine Learning: Analysis, Modeling, and Evaluation

Authors: Jiazhi Mo, Hailu Kuang, Xiaoqi Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16439
Pdf URL: https://arxiv.org/pdf/2505.16439
Copy Paste: [[2505.16439]] Password Strength Detection via Machine Learning: Analysis, Modeling, and Evaluation(https://arxiv.org/abs/2505.16439)
Keywords: security, defense
Abstract: As network security issues continue gaining prominence, password security has become crucial in safeguarding personal information and network systems. This study first introduces various methods for system password cracking, outlines password defense strategies, and discusses the application of machine learning in the realm of password security. Subsequently, we conduct a detailed public password database analysis, uncovering standard features and patterns among passwords. We extract multiple characteristics of passwords, including length, the number of digits, the number of uppercase and lowercase letters, and the number of special characters. We then experiment with six different machine learning algorithms: support vector machines, logistic regression, neural networks, decision trees, random forests, and stacked models, evaluating each model's performance based on various metrics, including accuracy, recall, and F1 score through model validation and hyperparameter tuning. The evaluation results on the test set indicate that decision trees and stacked models excel in accuracy, recall, and F1 score, making them a practical option for the strong and weak password classification task.

Title: Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

Authors: Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16446
Pdf URL: https://arxiv.org/pdf/2505.16446
Copy Paste: [[2505.16446]] Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models(https://arxiv.org/abs/2505.16446)
Keywords: attack, steal, large language model
Abstract: Multimodal large language models (MLLMs) enable powerful cross-modal reasoning capabilities. However, the expanded input space introduces new attack surfaces. Previous jailbreak attacks often inject malicious instructions from text into less aligned modalities, such as vision. As MLLMs increasingly incorporate cross-modal consistency and alignment mechanisms, such explicit attacks become easier to detect and block. In this work, we propose a novel implicit jailbreak framework termed IJA that stealthily embeds malicious instructions into images via least significant bit steganography and couples them with seemingly benign, image-related textual prompts. To further enhance attack effectiveness across diverse MLLMs, we incorporate adversarial suffixes generated by a surrogate model and introduce a template optimization module that iteratively refines both the prompt and embedding based on model feedback. On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack success rates of over 90% using an average of only 3 queries.

Title: TAT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition

Authors: Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16447
Pdf URL: https://arxiv.org/pdf/2505.16447
Copy Paste: [[2505.16447]] TAT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition(https://arxiv.org/abs/2505.16447)
Keywords: transformer
Abstract: TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.

Title: CMRINet: Joint Groupwise Registration and Segmentation for Cardiac Function Quantification from Cine-MRI

Authors: Mohamed S. Elmahdy, Marius Staring, Patrick J. H. de Koning, Samer Alabed, Mahan Salehi, Faisal Alandejani, Michael Sharkey, Ziad Aldabbagh, Andrew J. Swift, Rob J. van der Geest
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16452
Pdf URL: https://arxiv.org/pdf/2505.16452
Copy Paste: [[2505.16452]] CMRINet: Joint Groupwise Registration and Segmentation for Cardiac Function Quantification from Cine-MRI(https://arxiv.org/abs/2505.16452)
Keywords: segmentation
Abstract: Accurate and efficient quantification of cardiac function is essential for the estimation of prognosis of cardiovascular diseases (CVDs). One of the most commonly used metrics for evaluating cardiac pumping performance is left ventricular ejection fraction (LVEF). However, LVEF can be affected by factors such as inter-observer variability and varying pre-load and after-load conditions, which can reduce its reproducibility. Additionally, cardiac dysfunction may not always manifest as alterations in LVEF, such as in heart failure and cardiotoxicity diseases. An alternative measure that can provide a relatively load-independent quantitative assessment of myocardial contractility is myocardial strain and strain rate. By using LVEF in combination with myocardial strain, it is possible to obtain a thorough description of cardiac function. Automated estimation of LVEF and other volumetric measures from cine-MRI sequences can be achieved through segmentation models, while strain calculation requires the estimation of tissue displacement between sequential frames, which can be accomplished using registration models. These tasks are often performed separately, potentially limiting the assessment of cardiac function. To address this issue, in this study we propose an end-to-end deep learning (DL) model that jointly estimates groupwise (GW) registration and segmentation for cardiac cine-MRI images. The proposed anatomically-guided Deep GW network was trained and validated on a large dataset of 4-chamber view cine-MRI image series of 374 subjects. A quantitative comparison with conventional GW registration using elastix and two DL-based methods showed that the proposed model improved performance and substantially reduced computation time.

Title: MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM

Authors: Siwei Meng, Yawei Luo, Ping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16456
Pdf URL: https://arxiv.org/pdf/2505.16456
Copy Paste: [[2505.16456]] MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM(https://arxiv.org/abs/2505.16456)
Keywords: diffusion, generative
Abstract: Recent advances in static 3D generation have intensified the demand for physically consistent dynamic 3D content. However, existing video generation models, including diffusion-based methods, often prioritize visual realism while neglecting physical plausibility, resulting in implausible object dynamics. Prior approaches for physics-aware dynamic generation typically rely on large-scale annotated datasets or extensive model fine-tuning, which imposes significant computational and data collection burdens and limits scalability across scenarios. To address these challenges, we present MAGIC, a training-free framework for single-image physical property inference and dynamic generation, integrating pretrained image-to-video diffusion models with iterative LLM-based reasoning. Our framework generates motion-rich videos from a static image and closes the visual-to-physical gap through a confidence-driven LLM feedback loop that adaptively steers the diffusion model toward physics-relevant motion. To translate visual dynamics into controllable physical behavior, we further introduce a differentiable MPM simulator operating directly on 3D Gaussians reconstructed from the single image, enabling physically grounded, simulation-ready outputs without any supervision or model tuning. Experiments show that MAGIC outperforms existing physics-aware generative methods in inference accuracy and achieves greater temporal coherence than state-of-the-art video diffusion models.

Title: University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

Authors: Ikhlasul Akmal Hanif, Eryawan Presma Yulianrifat, Jaycent Gunawan Ongris, Eduardus Tjitrahardja, Muhammad Falensi Azmi, Rahmat Bryan Naufal, Alfan Farizki Wicaksono
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16460
Pdf URL: https://arxiv.org/pdf/2505.16460
Copy Paste: [[2505.16460]] University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection(https://arxiv.org/abs/2505.16460)
Keywords: transformer
Abstract: This paper presents our approach for SemEval 2025 Task 11 Track A, focusing on multilabel emotion classification across 28 languages. We explore two main strategies: fully fine-tuning transformer models and classifier-only training, evaluating different settings such as fine-tuning strategies, model architectures, loss functions, encoders, and classifiers. Our findings suggest that training a classifier on top of prompt-based encoders such as mE5 and BGE yields significantly better results than fully fine-tuning XLMR and mBERT. Our best-performing model on the final leaderboard is an ensemble combining multiple BGE models, where CatBoost serves as the classifier, with different configurations. This ensemble achieves an average F1-macro score of 56.58 across all languages.

Title: AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer

Authors: Jiquan Shan, Junxiao Wang, Lifeng Zhao, Liang Cai, Hongyuan Zhang, Ioannis Liritzis
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16463
Pdf URL: https://arxiv.org/pdf/2505.16463
Copy Paste: [[2505.16463]] AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer(https://arxiv.org/abs/2505.16463)
Keywords: transformer, segmentation
Abstract: Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given $n$ patches, they will have quadratic complexity such as $\mathcal{O}(n^2)$ and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$, where $m$ is an anchor number and $m < n$. Notably, by representing the anchors with the neurons in a neural layer, we can differentiable learn these distributions and approximate global self-attention through the Markov process. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of our AnchorFormer, e.g., achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification, 81.3% higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.

Title: Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization

Authors: Vera Neplenbroek, Arianna Bisazza, Raquel Fernández
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16467
Pdf URL: https://arxiv.org/pdf/2505.16467
Copy Paste: [[2505.16467]] Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization(https://arxiv.org/abs/2505.16467)
Keywords: generative, large language model
Abstract: Generative Large Language Models (LLMs) infer user's demographic information from subtle cues in the conversation -- a phenomenon called implicit personalization. Prior work has shown that such inferences can lead to lower quality responses for users assumed to be from minority groups, even when no demographic information is explicitly provided. In this work, we systematically explore how LLMs respond to stereotypical cues using controlled synthetic conversations, by analyzing the models' latent user representations through both model internals and generated answers to targeted user questions. Our findings reveal that LLMs do infer demographic attributes based on these stereotypical signals, which for a number of groups even persists when the user explicitly identifies with a different demographic group. Finally, we show that this form of stereotype-driven implicit personalization can be effectively mitigated by intervening on the model's internal representations using a trained linear probe to steer them toward the explicitly stated identity. Our results highlight the need for greater transparency and control in how LLMs represent user identity.

Title: Consistent World Models via Foresight Diffusion

Authors: Yu Zhang, Xingzhuo Guo, Haoran Xu, Mingsheng Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16474
Pdf URL: https://arxiv.org/pdf/2505.16474
Copy Paste: [[2505.16474]] Consistent World Models via Foresight Diffusion(https://arxiv.org/abs/2505.16474)
Keywords: diffusion
Abstract: Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.

Title: Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Authors: Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16483
Pdf URL: https://arxiv.org/pdf/2505.16483
Copy Paste: [[2505.16483]] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning(https://arxiv.org/abs/2505.16483)
Keywords: large language model
Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

Title: LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing

Authors: Dario Di Palma, Alessandro De Bellis, Giovanni Servedio, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16491
Pdf URL: https://arxiv.org/pdf/2505.16491
Copy Paste: [[2505.16491]] LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing(https://arxiv.org/abs/2505.16491)
Keywords: large language model
Abstract: Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of Llama models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis. Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%. These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements.

Title: Accuracy vs. Accuracy: Computational Tradeoffs Between Classification Rates and Utility

Authors: Noga Amit, Omer Reingold, Guy N. Rothblum
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16494
Pdf URL: https://arxiv.org/pdf/2505.16494
Copy Paste: [[2505.16494]] Accuracy vs. Accuracy: Computational Tradeoffs Between Classification Rates and Utility(https://arxiv.org/abs/2505.16494)
Keywords: fair
Abstract: We revisit the foundations of fairness and its interplay with utility and efficiency in settings where the training data contain richer labels, such as individual types, rankings, or risk estimates, rather than just binary outcomes. In this context, we propose algorithms that achieve stronger notions of evidence-based fairness than are possible in standard supervised learning. Our methods support classification and ranking techniques that preserve accurate subpopulation classification rates, as suggested by the underlying data distributions, across a broad class of classification rules and downstream applications. Furthermore, our predictors enable loss minimization, whether aimed at maximizing utility or in the service of fair treatment. Complementing our algorithmic contributions, we present impossibility results demonstrating that simultaneously achieving accurate classification rates and optimal loss minimization is, in some cases, computationally infeasible. Unlike prior impossibility results, our notions are not inherently in conflict and are simultaneously satisfied by the Bayes-optimal predictor. Furthermore, we show that each notion can be satisfied individually via efficient learning. Our separation thus stems from the computational hardness of learning a sufficiently good approximation of the Bayes-optimal predictor. These computational impossibilities present a choice between two natural and attainable notions of accuracy that could both be motivated by fairness.

Title: ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

Authors: Lingfeng Wang, Hualing Lin, Senda Chen, Tao Wang, Changxu Cheng, Yangyang Zhong, Dong Zheng, Wuyue Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16495
Pdf URL: https://arxiv.org/pdf/2505.16495
Copy Paste: [[2505.16495]] ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation(https://arxiv.org/abs/2505.16495)
Keywords: large language model, segmentation
Abstract: While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at this https URL.

Title: Language-based Security and Time-inserting Supervisor

Authors: Damas P. Gruska
Subjects: cs.CR, cs.LO
Abstract URL: https://arxiv.org/abs/2505.16503
Pdf URL: https://arxiv.org/pdf/2505.16503
Copy Paste: [[2505.16503]] Language-based Security and Time-inserting Supervisor(https://arxiv.org/abs/2505.16503)
Keywords: secure, security, attack
Abstract: Algebraic methods are employed in order to define language-based security properties of processes. A supervisor is introduced that can disable unwanted behavior of an insecure process by controlling some of its actions or by inserting timed actions to make an insecure process secure. We assume a situation where neither the supervisor nor the attacker has complete information about the ongoing systems behavior. We study the conditions under which such a supervisor exists, as well as its properties and limitations.

Title: Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Authors: Jiaxin Liu, Jia Wang, Saihui Hou, Min Ren, Huijia Wu, Zhaofeng He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16512
Pdf URL: https://arxiv.org/pdf/2505.16512
Copy Paste: [[2505.16512]] Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection(https://arxiv.org/abs/2505.16512)
Keywords: security, diffusion
Abstract: In recent years, the rapid development of deepfake technology has given rise to an emerging and serious threat to public security: diffusion model-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency through multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the first large-scale multimodal digital human forgery dataset based on diffusion models. Employing five latest digital human generation methods (Sonic, Hallo, etc.) and voice cloning method, we systematically produce a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies show that the confusion rate between forged and real videos reaches 68%, and existing state-of-the-art (SOTA) detection models exhibit large drops in AUC values on DigiFakeAV, highlighting the challenge of the dataset. To address this problem, we further propose DigiShield, a detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves SOTA performance on both the DigiFakeAV and DF-TIMIT datasets. Experiments show that this method effectively identifies covert artifacts through fine-grained analysis of the temporal evolution of facial features in synthetic videos.

Title: Detailed Evaluation of Modern Machine Learning Approaches for Optic Plastics Sorting

Authors: Vaishali Maheshkar, Aadarsh Anantha Ramakrishnan, Charuvahan Adhivarahan, Karthik Dantu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16513
Pdf URL: https://arxiv.org/pdf/2505.16513
Copy Paste: [[2505.16513]] Detailed Evaluation of Modern Machine Learning Approaches for Optic Plastics Sorting(https://arxiv.org/abs/2505.16513)
Keywords: segmentation
Abstract: According to the EPA, only 25% of waste is recycled, and just 60% of U.S. municipalities offer curbside recycling. Plastics fare worse, with a recycling rate of only 8%; an additional 16% is incinerated, while the remaining 76% ends up in landfills. The low plastic recycling rate stems from contamination, poor economic incentives, and technical difficulties, making efficient recycling a challenge. To improve recovery, automated sorting plays a critical role. Companies like AMP Robotics and Greyparrot utilize optical systems for sorting, while Materials Recovery Facilities (MRFs) employ Near-Infrared (NIR) sensors to detect plastic types. Modern optical sorting uses advances in computer vision such as object recognition and instance segmentation, powered by machine learning. Two-stage detectors like Mask R-CNN use region proposals and classification with deep backbones like ResNet. Single-stage detectors like YOLO handle detection in one pass, trading some accuracy for speed. While such methods excel under ideal conditions with a large volume of labeled training data, challenges arise in realistic scenarios, emphasizing the need to further examine the efficacy of optic detection for automated sorting. In this study, we compiled novel datasets totaling 20,000+ images from varied sources. Using both public and custom machine learning pipelines, we assessed the capabilities and limitations of optical recognition for sorting. Grad-CAM, saliency maps, and confusion matrices were employed to interpret model behavior. We perform this analysis on our custom trained models from the compiled datasets. To conclude, our findings are that optic recognition methods have limited success in accurate sorting of real-world plastics at MRFs, primarily because they rely on physical properties such as color and shape.

Title: AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios

Authors: Yuting Huang, Meitong Guo, Yiquan Wu, Ang Li, Xiaozhong Liu, Keting Yin, Changlong Sun, Fei Wu, Kun Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16514
Pdf URL: https://arxiv.org/pdf/2505.16514
Copy Paste: [[2505.16514]] AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios(https://arxiv.org/abs/2505.16514)
Keywords: fair
Abstract: Recent advances in LegalAI have primarily focused on individual case judgment analysis, often overlooking the critical appellate process within the judicial system. Appeals serve as a core mechanism for error correction and ensuring fair trials, making them highly significant both in practice and in research. To address this gap, we present the AppealCase dataset, consisting of 10,000 pairs of real-world, matched first-instance and second-instance documents across 91 categories of civil cases. The dataset also includes detailed annotations along five dimensions central to appellate review: judgment reversals, reversal reasons, cited legal provisions, claim-level decisions, and whether there is new information in the second instance. Based on these annotations, we propose five novel LegalAI tasks and conduct a comprehensive evaluation across 20 mainstream models. Experimental results reveal that all current models achieve less than 50% F1 scores on the judgment reversal prediction task, highlighting the complexity and challenge of the appeal scenario. We hope that the AppealCase dataset will spur further research in LegalAI for appellate case analysis and contribute to improving consistency in judicial decision-making.

Title: Computing Exact Shapley Values in Polynomial Time for Product-Kernel Methods

Authors: Majid Mohammadi, Siu Lun Chau, Krikamol Muandet
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16516
Pdf URL: https://arxiv.org/pdf/2505.16516
Copy Paste: [[2505.16516]] Computing Exact Shapley Values in Polynomial Time for Product-Kernel Methods(https://arxiv.org/abs/2505.16516)
Keywords: interpretability, explainability
Abstract: Kernel methods are widely used in machine learning due to their flexibility and expressive power. However, their black-box nature poses significant challenges to interpretability, limiting their adoption in high-stakes applications. Shapley value-based feature attribution techniques, such as SHAP and kernel-specific variants like RKHS-SHAP, offer a promising path toward explainability. Yet, computing exact Shapley values remains computationally intractable in general, motivating the development of various approximation schemes. In this work, we introduce PKeX-Shapley, a novel algorithm that utilizes the multiplicative structure of product kernels to enable the exact computation of Shapley values in polynomial time. We show that product-kernel models admit a functional decomposition that allows for a recursive formulation of Shapley values. This decomposition not only yields computational efficiency but also enhances interpretability in kernel-based learning. We also demonstrate how our framework can be generalized to explain kernel-based statistical discrepancies such as the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Independence Criterion (HSIC), thus offering new tools for interpretable statistical inference.

Title: Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs

Authors: Giovanni Servedio, Alessandro De Bellis, Dario Di Palma, Vito Walter Anelli, Tommaso Di Noia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16520
Pdf URL: https://arxiv.org/pdf/2505.16520
Copy Paste: [[2505.16520]] Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs(https://arxiv.org/abs/2505.16520)
Keywords: large language model
Abstract: Factual hallucinations are a major challenge for Large Language Models (LLMs). They undermine reliability and user trust by generating inaccurate or fabricated content. Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness. However, these studies often rely on synthetic datasets that lack realism, which limits generalization when evaluating the factual accuracy of text generated by the model itself. In this paper, we challenge the findings of previous work by investigating truthfulness encoding capabilities, leading to the generation of a more realistic and challenging dataset. Specifically, we extend previous work by introducing: (1) a strategy for sampling plausible true-false factoid sentences from tabular data and (2) a procedure for generating realistic, LLM-dependent true-false datasets from Question Answering collections. Our analysis of two open-source LLMs reveals that while the findings from previous studies are partially validated, generalization to LLM-generated datasets remains challenging. This study lays the groundwork for future research on factuality in LLMs and offers practical guidelines for more effective evaluation.

Title: Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing

Authors: Zhouhao Sun, Zhiyuan Kan, Xiao Ding, Li Du, Yang Zhao, Bing Qin, Ting Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16522
Pdf URL: https://arxiv.org/pdf/2505.16522
Copy Paste: [[2505.16522]] Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing(https://arxiv.org/abs/2505.16522)
Keywords: large language model
Abstract: Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of biases in practical applications. To bridge this gap, we propose a multi-bias benchmark where each piece of data contains five types of biases. The evaluations conducted on this benchmark reveal that the performance of existing LLMs and debiasing methods is unsatisfying, highlighting the challenge of eliminating multiple types of biases simultaneously. To overcome this challenge, we propose a causal effect estimation-guided multi-bias elimination method (CMBE). This method first estimates the causal effect of multiple types of biases simultaneously. Subsequently, we eliminate the causal effect of biases from the total causal effect exerted by both the semantic information and biases during inference. Experimental results show that CMBE can effectively eliminate multiple types of bias simultaneously to enhance the generalizability of LLMs.

Title: CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving

Authors: Huitong Yang, Zhuoxiao Chen, Fengyi Zhang, Zi Huang, Yadan Luo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16524
Pdf URL: https://arxiv.org/pdf/2505.16524
Copy Paste: [[2505.16524]] CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving(https://arxiv.org/abs/2505.16524)
Keywords: robust
Abstract: Maintaining robust 3D perception under dynamic and unpredictable test-time conditions remains a critical challenge for autonomous driving systems. Existing test-time adaptation (TTA) methods often fail in high-variance tasks like 3D object detection due to unstable optimization and sharp minima. While recent model merging strategies based on linear mode connectivity (LMC) offer improved stability by interpolating between fine-tuned checkpoints, they are computationally expensive, requiring repeated checkpoint access and multiple forward passes. In this paper, we introduce CodeMerge, a lightweight and scalable model merging framework that bypasses these limitations by operating in a compact latent space. Instead of loading full models, CodeMerge represents each checkpoint with a low-dimensional fingerprint derived from the source model's penultimate features and constructs a key-value codebook. We compute merging coefficients using ridge leverage scores on these fingerprints, enabling efficient model composition without compromising adaptation quality. Our method achieves strong performance across challenging benchmarks, improving end-to-end 3D detection 14.9% NDS on nuScenes-C and LiDAR-based detection by over 7.6% mAP on nuScenes-to-KITTI, while benefiting downstream tasks such as online mapping, motion prediction and planning even without training. Code and pretrained models are released in the supplementary material.

Title: EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance

Authors: Heejae Suh, Yejin Jeon, Deokhyung Kang, Taehee Park, Yejin Min, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16526
Pdf URL: https://arxiv.org/pdf/2505.16526
Copy Paste: [[2505.16526]] EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance(https://arxiv.org/abs/2505.16526)
Keywords: robust, large language model
Abstract: Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.

Title: Joint Relational Database Generation via Graph-Conditional Diffusion Models

Authors: Mohamed Amine Ketata, David Lüdke, Leo Schwinn, Stephan Günnemann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16527
Pdf URL: https://arxiv.org/pdf/2505.16527
Copy Paste: [[2505.16527]] Joint Relational Database Generation via Graph-Conditional Diffusion Models(https://arxiv.org/abs/2505.16527)
Keywords: privacy, diffusion, generative
Abstract: Building generative models for relational databases (RDBs) is important for applications like privacy-preserving data release and augmenting real datasets. However, most prior work either focuses on single-table generation or relies on autoregressive factorizations that impose a fixed table order and generate tables sequentially. This approach limits parallelism, restricts flexibility in downstream applications like missing value imputation, and compounds errors due to commonly made conditional independence assumptions. We propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any order. By using a natural graph representation of RDBs, we propose the Graph-Conditional Relational Diffusion Model (GRDM). GRDM leverages a graph neural network to jointly denoise row attributes and capture complex inter-table dependencies. Extensive experiments on six real-world RDBs demonstrate that our approach substantially outperforms autoregressive baselines in modeling multi-hop inter-table correlations and achieves state-of-the-art performance on single-table fidelity metrics.

Title: DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection

Authors: Yuliang Yan, Haochun Tang, Shuo Yan, Enyan Dai
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16530
Pdf URL: https://arxiv.org/pdf/2505.16530
Copy Paste: [[2505.16530]] DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection(https://arxiv.org/abs/2505.16530)
Keywords: protect, steal, watermark, large language model
Abstract: Large language models (LLMs) are considered valuable Intellectual Properties (IP) for legitimate owners due to the enormous computational cost of training. It is crucial to protect the IP of LLMs from malicious stealing or unauthorized deployment. Despite existing efforts in watermarking and fingerprinting LLMs, these methods either impact the text generation process or are limited in white-box access to the suspect model, making them impractical. Hence, we propose DuFFin, a novel $\textbf{Du}$al-Level $\textbf{Fin}$gerprinting $\textbf{F}$ramework for black-box setting ownership verification. DuFFin extracts the trigger pattern and the knowledge-level fingerprints to identify the source of a suspect model. We conduct experiments on a variety of models collected from the open-source website, including four popular base models as protected LLMs and their fine-tuning, quantization, and safety alignment versions, which are released by large companies, start-ups, and individual users. Results show that our method can accurately verify the copyright of the base protected LLM on their model variants, achieving the IP-ROC metric greater than 0.95. Our code is available at this https URL.

Title: SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion

Authors: Asrar Alruwayqi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16535
Pdf URL: https://arxiv.org/pdf/2505.16535
Copy Paste: [[2505.16535]] SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion(https://arxiv.org/abs/2505.16535)
Keywords: robust, interpretability, diffusion, transformer, generative
Abstract: We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.

Title: Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

Authors: Ercong Nie, Helmut Schmid, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16538
Pdf URL: https://arxiv.org/pdf/2505.16538
Copy Paste: [[2505.16538]] Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models(https://arxiv.org/abs/2505.16538)
Keywords: robust, interpretability, large language model
Abstract: Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.

Title: TextureSAM: Towards a Texture Aware Foundation Model for Segmentation

Authors: Inbal Cohen, Boaz Meivar, Peihan Tu, Shai Avidan, Gal Oren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16540
Pdf URL: https://arxiv.org/pdf/2505.16540
Copy Paste: [[2505.16540]] TextureSAM: Towards a Texture Aware Foundation Model for Segmentation(https://arxiv.org/abs/2505.16540)
Keywords: segmentation
Abstract: Segment Anything Models (SAM) have achieved remarkable success in object segmentation tasks across diverse datasets. However, these models are predominantly trained on large-scale semantic segmentation datasets, which introduce a bias toward object shape rather than texture cues in the image. This limitation is critical in domains such as medical imaging, material classification, and remote sensing, where texture changes define object boundaries. In this study, we investigate SAM's bias toward semantics over textures and introduce a new texture-aware foundation model, TextureSAM, which performs superior segmentation in texture-dominant scenarios. To achieve this, we employ a novel fine-tuning approach that incorporates texture augmentation techniques, incrementally modifying training images to emphasize texture features. By leveraging a novel texture-alternation of the ADE20K dataset, we guide TextureSAM to prioritize texture-defined regions, thereby mitigating the inherent shape bias present in the original SAM model. Our extensive experiments demonstrate that TextureSAM significantly outperforms SAM-2 on both natural (+0.2 mIoU) and synthetic (+0.18 mIoU) texture-based segmentation datasets. The code and texture-augmented dataset will be publicly available.

Title: Incremental Sequence Classification with Temporal Consistency

Authors: Lucas Maystre, Gabriel Barello, Tudor Berariu, Aleix Cambray, Rares Dolga, Alvaro Ortega Gonzalez, Andrei Nica, David Barber
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16548
Pdf URL: https://arxiv.org/pdf/2505.16548
Copy Paste: [[2505.16548]] Incremental Sequence Classification with Temporal Consistency(https://arxiv.org/abs/2505.16548)
Keywords: large language model
Abstract: We address the problem of incremental sequence classification, where predictions are updated as new elements in the sequence are revealed. Drawing on temporal-difference learning from reinforcement learning, we identify a temporal-consistency condition that successive predictions should satisfy. We leverage this condition to develop a novel loss function for training incremental sequence classifiers. Through a concrete example, we demonstrate that optimizing this loss can offer substantial gains in data efficiency. We apply our method to text classification tasks and show that it improves predictive accuracy over competing approaches on several benchmark datasets. We further evaluate our approach on the task of verifying large language model generations for correctness in grade-school math problems. Our results show that models trained with our method are better able to distinguish promising generations from unpromising ones after observing only a few tokens.

Title: Towards Coordinate- and Dimension-Agnostic Machine Learning for Partial Differential Equations

Authors: Trung V. Phan, George A. Kevrekidis, Soledad Villar, Yannis G. Kevrekidis, Juan M. Bello-Rivas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16549
Pdf URL: https://arxiv.org/pdf/2505.16549
Copy Paste: [[2505.16549]] Towards Coordinate- and Dimension-Agnostic Machine Learning for Partial Differential Equations(https://arxiv.org/abs/2505.16549)
Keywords: diffusion
Abstract: The machine learning methods for data-driven identification of partial differential equations (PDEs) are typically defined for a given number of spatial dimensions and a choice of coordinates the data have been collected in. This dependence prevents the learned evolution equation from generalizing to other spaces. In this work, we reformulate the problem in terms of coordinate- and dimension-independent representations, paving the way toward what we call ``spatially liberated" PDE learning. To this end, we employ a machine learning approach to predict the evolution of scalar field systems expressed in the formalism of exterior calculus, which is coordinate-free and immediately generalizes to arbitrary dimensions by construction. We demonstrate the performance of this approach in the FitzHugh-Nagumo and Barkley reaction-diffusion models, as well as the Patlak-Keller-Segel model informed by in-situ chemotactic bacteria observations. We provide extensive numerical experiments that demonstrate that our approach allows for seamless transitions across various spatial contexts. We show that the field dynamics learned in one space can be used to make accurate predictions in other spaces with different dimensions, coordinate systems, boundary conditions, and curvatures.

Title: Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Authors: Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Ruihua Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16552
Pdf URL: https://arxiv.org/pdf/2505.16552
Copy Paste: [[2505.16552]] Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains(https://arxiv.org/abs/2505.16552)
Keywords: large language model
Abstract: Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.

Title: CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

Authors: Biao Yi, Tiansheng Huang, Baolei Zhang, Tong Li, Lihai Nie, Zheli Liu, Li Shen
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16559
Pdf URL: https://arxiv.org/pdf/2505.16559
Copy Paste: [[2505.16559]] CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning(https://arxiv.org/abs/2505.16559)
Keywords: defense, attack, large language model
Abstract: Fine-tuning-as-a-service, while commercially successful for Large Language Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a widely explored defense paradigm against such attacks, unlearning attempts to remove malicious knowledge from LLMs, thereby essentially preventing them from being used to perform malicious tasks. However, we highlight a critical flaw: the powerful general adaptability of LLMs allows them to easily bypass selective unlearning by rapidly relearning or repurposing their capabilities for harmful tasks. To address this fundamental limitation, we propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse--effectively forcing the model to "unlearn everything"--specifically in response to updates characteristic of malicious adaptation. This collapse directly neutralizes the very general capabilities that attackers exploit, tackling the core issue unaddressed by selective unlearning. We introduce the Collapse Trap (CTRAP) as a practical mechanism to implement this concept conditionally. Embedded during alignment, CTRAP pre-configures the model's reaction to subsequent fine-tuning dynamics. If updates during fine-tuning constitute a persistent attempt to reverse safety alignment, the pre-configured trap triggers a progressive degradation of the model's core language modeling abilities, ultimately rendering it inert and useless for the attacker. Crucially, this collapse mechanism remains dormant during benign fine-tuning, ensuring the model's utility and general capabilities are preserved for legitimate users. Extensive empirical results demonstrate that CTRAP effectively counters harmful fine-tuning risks across various LLMs and attack settings, while maintaining high performance in benign scenarios. Our code is available at this https URL.

Title: Auto-nnU-Net: Towards Automated Medical Image Segmentation

Authors: Jannis Becktepe, Leona Hennig, Steffen Oeltze-Jafra, Marius Lindauer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16561
Pdf URL: https://arxiv.org/pdf/2505.16561
Copy Paste: [[2505.16561]] Auto-nnU-Net: Towards Automated Medical Image Segmentation(https://arxiv.org/abs/2505.16561)
Keywords: segmentation
Abstract: Medical Image Segmentation (MIS) includes diverse tasks, from bone to organ segmentation, each with its own challenges in finding the best segmentation model. The state-of-the-art AutoML-related MIS-framework nnU-Net automates many aspects of model configuration but remains constrained by fixed hyperparameters and heuristic design choices. As a full-AutoML framework for MIS, we propose Auto-nnU-Net, a novel nnU-Net variant enabling hyperparameter optimization (HPO), neural architecture search (NAS), and hierarchical NAS (HNAS). Additionally, we propose Regularized PriorBand to balance model accuracy with the computational resources required for training, addressing the resource constraints often faced in real-world medical settings that limit the feasibility of extensive training procedures. We evaluate our approach across diverse MIS datasets from the well-established Medical Segmentation Decathlon, analyzing the impact of AutoML techniques on segmentation performance, computational efficiency, and model design choices. The results demonstrate that our AutoML approach substantially improves the segmentation performance of nnU-Net on 6 out of 10 datasets and is on par on the other datasets while maintaining practical resource requirements. Our code is available at this https URL.

Title: A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices

Authors: Chen Gong, Rui Xing, Zhenzhe Zheng, Fan Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16563
Pdf URL: https://arxiv.org/pdf/2505.16563
Copy Paste: [[2505.16563]] A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices(https://arxiv.org/abs/2505.16563)
Keywords: privacy
Abstract: The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework {\sf Titan} to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, {\sf Titan} filters out a candidate dataset with potentially high importance in a coarse-grained this http URL the second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To further enhance time-and-resource efficiency, {\sf Titan} leverages a pipeline to co-execute data selection and model training, and avoids resource conflicts by exploiting idle computing resources. We evaluate {\sf Titan} on real-world edge devices and three representative edge computing tasks with diverse models and data modalities. Empirical results demonstrate that {\sf Titan} achieves up to $43\%$ reduction in training time and $6.2\%$ increase in final accuracy with minor system overhead, such as data processing delay, memory footprint and energy consumption.

Title: M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Authors: Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16565
Pdf URL: https://arxiv.org/pdf/2505.16565
Copy Paste: [[2505.16565]] M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion(https://arxiv.org/abs/2505.16565)
Keywords: diffusion
Abstract: We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.

Title: ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Authors: Dongwon Noh, Donghyeok Koh, Junghun Yuk, Gyuwan Kim, Jaeyong Lee, Kyungtae Lim, Cheoneum Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16566
Pdf URL: https://arxiv.org/pdf/2505.16566
Copy Paste: [[2505.16566]] ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts(https://arxiv.org/abs/2505.16566)
Keywords: large language model
Abstract: Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

Title: Finetuning-Activated Backdoors in LLMs

Authors: Thibaud Gloaguen, Mark Vero, Robin Staab, Martin Vechev
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.16567
Pdf URL: https://arxiv.org/pdf/2505.16567
Copy Paste: [[2505.16567]] Finetuning-Activated Backdoors in LLMs(https://arxiv.org/abs/2505.16567)
Keywords: secure, security, attack, robust, large language model
Abstract: Finetuning openly accessible Large Language Models (LLMs) has become standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets led to predictable behaviors. In this paper, we demonstrate for the first time that an adversary can create poisoned LLMs that initially appear benign but exhibit malicious behaviors once finetuned by downstream users. To this end, our proposed attack, FAB (Finetuning-Activated Backdoor), poisons an LLM via meta-learning techniques to simulate downstream finetuning, explicitly optimizing for the emergence of malicious behaviors in the finetuned models. At the same time, the poisoned LLM is regularized to retain general capabilities and to exhibit no malicious behaviors prior to finetuning. As a result, when users finetune the seemingly benign model on their own datasets, they unknowingly trigger its hidden backdoor behavior. We demonstrate the effectiveness of FAB across multiple LLMs and three target behaviors: unsolicited advertising, refusal, and jailbreakability. Additionally, we show that FAB-backdoors are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler). Our findings challenge prevailing assumptions about the security of finetuning, revealing yet another critical attack vector exploiting the complexities of LLMs.

Title: URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training

Authors: Dongyang Fan, Vinko Sabolčec, Martin Jaggi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16570
Pdf URL: https://arxiv.org/pdf/2505.16570
Copy Paste: [[2505.16570]] URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training(https://arxiv.org/abs/2505.16570)
Keywords: large language model
Abstract: Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.

Title: EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions

Authors: Spencer Hong, Meng Luo, Xinyi Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16576
Pdf URL: https://arxiv.org/pdf/2505.16576
Copy Paste: [[2505.16576]] EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions(https://arxiv.org/abs/2505.16576)
Keywords: large language model
Abstract: Determining the veracity of atomic claims is an imperative component of many recently proposed fact-checking systems. Many approaches tackle this problem by first retrieving evidence by querying a search engine and then performing classification by providing the evidence set and atomic claim to a large language model, but this process deviates from what a human would do in order to perform the task. Recent work attempted to address this issue by proposing iterative evidence retrieval, allowing for evidence to be collected several times and only when necessary. Continuing along this line of research, we propose a novel claim verification system, called EMULATE, which is designed to better emulate human actions through the use of a multi-agent framework where each agent performs a small part of the larger task, such as ranking search results according to predefined criteria or evaluating webpage content. Extensive experiments on several benchmarks show clear improvements over prior work, demonstrating the efficacy of our new multi-agent framework.

Title: Large Language Model-Empowered Interactive Load Forecasting

Authors: Yu Zuo, Dalin Qin, Yi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16577
Pdf URL: https://arxiv.org/pdf/2505.16577
Copy Paste: [[2505.16577]] Large Language Model-Empowered Interactive Load Forecasting(https://arxiv.org/abs/2505.16577)
Keywords: large language model
Abstract: The growing complexity of power systems has made accurate load forecasting more important than ever. An increasing number of advanced load forecasting methods have been developed. However, the static design of current methods offers no mechanism for human-model interaction. As the primary users of forecasting models, system operators often find it difficult to understand and apply these advanced models, which typically requires expertise in artificial intelligence (AI). This also prevents them from incorporating their experience and real-world contextual understanding into the forecasting process. Recent breakthroughs in large language models (LLMs) offer a new opportunity to address this issue. By leveraging their natural language understanding and reasoning capabilities, we propose an LLM-based multi-agent collaboration framework to bridge the gap between human operators and forecasting models. A set of specialized agents is designed to perform different tasks in the forecasting workflow and collaborate via a dedicated communication mechanism. This framework embeds interactive mechanisms throughout the load forecasting pipeline, reducing the technical threshold for non-expert users and enabling the integration of human experience. Our experiments demonstrate that the interactive load forecasting accuracy can be significantly improved when users provide proper insight in key stages. Our cost analysis shows that the framework remains affordable, making it practical for real-world deployment.

Title: O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

Authors: Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, Yu Qiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16582
Pdf URL: https://arxiv.org/pdf/2505.16582
Copy Paste: [[2505.16582]] O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering(https://arxiv.org/abs/2505.16582)
Keywords: large language model
Abstract: Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.

Title: Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

Authors: Bowen Jiang, Runchuan Zhu, Jiang Wu, Zinco Jiang, Yifan He, Junyuan Gao, Jia Yu, Rui Min, Yinfan Wang, Haote Yang, Songyang Zhang, Dahua Lin, Lijun Wu, Conghui He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16591
Pdf URL: https://arxiv.org/pdf/2505.16591
Copy Paste: [[2505.16591]] Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering(https://arxiv.org/abs/2505.16591)
Keywords: robust, large language model
Abstract: We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at this https URL .

Title: Temporal Object Captioning for Street Scene Videos from LiDAR Tracks

Authors: Vignesh Gopinathan, Urs Zimmermann, Michael Arnold, Matthias Rottmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16594
Pdf URL: https://arxiv.org/pdf/2505.16594
Copy Paste: [[2505.16594]] Temporal Object Captioning for Street Scene Videos from LiDAR Tracks(https://arxiv.org/abs/2505.16594)
Keywords: extraction
Abstract: Video captioning models have seen notable advancements in recent years, especially with regard to their ability to capture temporal information. While many research efforts have focused on architectural advancements, such as temporal attention mechanisms, there remains a notable gap in understanding how models capture and utilize temporal semantics for effective temporal feature extraction, especially in the context of Advanced Driver Assistance Systems. We propose an automated LiDAR-based captioning procedure that focuses on the temporal dynamics of traffic participants. Our approach uses a rule-based system to extract essential details such as lane position and relative motion from object tracks, followed by a template-based caption generation. Our findings show that training SwinBERT, a video captioning model, using only front camera images and supervised with our template-based captions, specifically designed to encapsulate fine-grained temporal behavior, leads to improved temporal understanding consistently across three datasets. In conclusion, our results clearly demonstrate that integrating LiDAR-based caption supervision significantly enhances temporal understanding, effectively addressing and reducing the inherent visual/static biases prevalent in current state-of-the-art model architectures.

Title: Decoupled Geometric Parameterization and its Application in Deep Homography Estimation

Authors: Yao Huang, Si-Yuan Cao, Yaqing Ding, Hao Yin, Shibin Xie, Shuting Wang, Zhijun Fang, Jiachun Wang, Shen Cai, Junchi Yan, Shuhan Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16599
Pdf URL: https://arxiv.org/pdf/2505.16599
Copy Paste: [[2505.16599]] Decoupled Geometric Parameterization and its Application in Deep Homography Estimation(https://arxiv.org/abs/2505.16599)
Keywords: interpretability
Abstract: Planar homography, with eight degrees of freedom (DOFs), is fundamental in numerous computer vision tasks. While the positional offsets of four corners are widely adopted (especially in neural network predictions), this parameterization lacks geometric interpretability and typically requires solving a linear system to compute the homography matrix. This paper presents a novel geometric parameterization of homographies, leveraging the similarity-kernel-similarity (SKS) decomposition for projective transformations. Two independent sets of four geometric parameters are decoupled: one for a similarity transformation and the other for the kernel transformation. Additionally, the geometric interpretation linearly relating the four kernel transformation parameters to angular offsets is derived. Our proposed parameterization allows for direct homography estimation through matrix multiplication, eliminating the need for solving a linear system, and achieves performance comparable to the four-corner positional offsets in deep homography estimation.

Title: MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Authors: Bohan Zhou, Yi Zhan, Zhongbin Zhang, Zongqing Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16602
Pdf URL: https://arxiv.org/pdf/2505.16602
Copy Paste: [[2505.16602]] MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation(https://arxiv.org/abs/2505.16602)
Keywords: robust
Abstract: Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.

Title: From Generic Empathy to Personalized Emotional Support: A Self-Evolution Framework for User Preference Alignment

Authors: Jing Ye, Lu Xiang, Yaping Zhang, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16610
Pdf URL: https://arxiv.org/pdf/2505.16610
Copy Paste: [[2505.16610]] From Generic Empathy to Personalized Emotional Support: A Self-Evolution Framework for User Preference Alignment(https://arxiv.org/abs/2505.16610)
Keywords: large language model
Abstract: Effective emotional support hinges on understanding users' emotions and needs to provide meaningful comfort during multi-turn interactions. Large Language Models (LLMs) show great potential for expressing empathy; however, they often deliver generic and one-size-fits-all responses that fail to address users' specific needs. To tackle this issue, we propose a self-evolution framework designed to help LLMs improve their responses to better align with users' implicit preferences concerning user profiles (personalities), emotional states, and specific situations. Our framework consists of two distinct phases: \textit{(1)} \textit{Emotional Support Experience Acquisition}, where LLMs are fine-tuned on limited emotional support conversation data to provide basic support, and \textit{(2)} \textit{Self-Improvement for Personalized Emotional Support}, where LLMs leverage self-reflection and self-refinement to generate personalized responses. Through iterative direct preference optimization between the pre- and post-refined responses, our model generates responses that reflect a better understanding of the user's implicit preferences. Extensive experiments and evaluations demonstrate that our method significantly enhances the model's performance in emotional support, reducing unhelpful responses and minimizing discrepancies between user preferences and model outputs.

Title: Steering Large Language Models for Machine Translation Personalization

Authors: Daniel Scalena, Gabriele Sarti, Arianna Bisazza, Elisabetta Fersini, Malvina Nissim
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16612
Pdf URL: https://arxiv.org/pdf/2505.16612
Copy Paste: [[2505.16612]] Steering Large Language Models for Machine Translation Personalization(https://arxiv.org/abs/2505.16612)
Keywords: large language model
Abstract: High-quality machine translation systems based on large language models (LLMs) have simplified the production of personalized translations reflecting specific stylistic constraints. However, these systems still struggle in settings where stylistic requirements are less explicit and might be harder to convey via prompting. We explore various strategies for personalizing LLM-generated translations in low-resource settings, focusing on the challenging literary translation domain. We explore prompting strategies and inference-time interventions for steering model generations towards a personalized style, and propose a contrastive framework exploiting latent concepts extracted from sparse autoencoders to identify salient personalization properties. Our results show that steering achieves strong personalization while preserving translation quality. We further examine the impact of steering on LLM representations, finding model layers with a relevant impact for personalization are impacted similarly by multi-shot prompting and our steering method, suggesting similar mechanism at play.

Title: Energy Consumption Framework and Analysis of Post-Quantum Key-Generation on Embedded Devices

Authors: J Cameron Patterson, William J Buchanan, Callum Turino
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16614
Pdf URL: https://arxiv.org/pdf/2505.16614
Copy Paste: [[2505.16614]] Energy Consumption Framework and Analysis of Post-Quantum Key-Generation on Embedded Devices(https://arxiv.org/abs/2505.16614)
Keywords: robust
Abstract: The emergence of quantum computing and Shor's algorithm necessitates an imminent shift from current public key cryptography techniques to post-quantum robust techniques. NIST has responded by standardising Post-Quantum Cryptography (PQC) algorithms, with ML-KEM (FIPS-203) slated to replace ECDH (Elliptic Curve Diffie-Hellman) for key exchange. A key practical concern for PQC adoption is energy consumption. This paper introduces a new framework for measuring the PQC energy consumption on a Raspberry Pi when performing key generation. The framework uses both available traditional methods and the newly standardised ML-KEM algorithm via the commonly utilised OpenSSL library.

Title: CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models

Authors: Benjamin Herdeanu, Juan Nathaniel, Carla Roesch, Jatan Buch, Gregor Ramien, Johannes Haux, Pierre Gentine
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16620
Pdf URL: https://arxiv.org/pdf/2505.16620
Copy Paste: [[2505.16620]] CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models(https://arxiv.org/abs/2505.16620)
Keywords: robust
Abstract: Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on this https URL.

Title: Background Matters: A Cross-view Bidirectional Modeling Framework for Semi-supervised Medical Image Segmentation

Authors: Luyang Cao, Jianwei Li, Yinghuan Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16625
Pdf URL: https://arxiv.org/pdf/2505.16625
Copy Paste: [[2505.16625]] Background Matters: A Cross-view Bidirectional Modeling Framework for Semi-supervised Medical Image Segmentation(https://arxiv.org/abs/2505.16625)
Keywords: segmentation
Abstract: Semi-supervised medical image segmentation (SSMIS) leverages unlabeled data to reduce reliance on manually annotated images. However, current SOTA approaches predominantly focus on foreground-oriented modeling (i.e., segmenting only the foreground region) and have largely overlooked the potential benefits of explicitly modeling the background region. Our study theoretically and empirically demonstrates that highly certain predictions in background modeling enhance the confidence of corresponding foreground modeling. Building on this insight, we propose the Cross-view Bidirectional Modeling (CVBM) framework, which introduces a novel perspective by incorporating background modeling to improve foreground modeling performance. Within CVBM, background modeling serves as an auxiliary perspective, providing complementary supervisory signals to enhance the confidence of the foreground model. Additionally, CVBM introduces an innovative bidirectional consistency mechanism, which ensures mutual alignment between foreground predictions and background-guided predictions. Extensive experiments demonstrate that our approach achieves SOTA performance on the LA, Pancreas, ACDC, and HRF datasets. Notably, on the Pancreas dataset, CVBM outperforms fully supervised methods (i.e., DSC: 84.57% vs. 83.89%) while utilizing only 20% of the labeled data. Our code is publicly available at this https URL.

Title: Towards Texture- And Shape-Independent 3D Keypoint Estimation in Birds

Authors: Valentin Schmuker, Alex Hoi Hang Chan, Bastian Goldluecke, Urs Waldmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16633
Pdf URL: https://arxiv.org/pdf/2505.16633
Copy Paste: [[2505.16633]] Towards Texture- And Shape-Independent 3D Keypoint Estimation in Birds(https://arxiv.org/abs/2505.16633)
Keywords: robust, segmentation
Abstract: In this paper, we present a texture-independent approach to estimate and track 3D joint positions of multiple pigeons. For this purpose, we build upon the existing 3D-MuPPET framework, which estimates and tracks the 3D poses of up to 10 pigeons using a multi-view camera setup. We extend this framework by using a segmentation method that generates silhouettes of the individuals, which are then used to estimate 2D keypoints. Following 3D-MuPPET, these 2D keypoints are triangulated to infer 3D poses, and identities are matched in the first frame and tracked in 2D across subsequent frames. Our proposed texture-independent approach achieves comparable accuracy to the original texture-dependent 3D-MuPPET framework. Additionally, we explore our approach's applicability to other bird species. To do that, we infer the 2D joint positions of four bird species without additional fine-tuning the model trained on pigeons and obtain preliminary promising results. Thus, we think that our approach serves as a solid foundation and inspires the development of more robust and accurate texture-independent pose estimation frameworks.

Title: SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Authors: Wenjie Yang, Mao Zheng, Mingyang Song, Zheng Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16637
Pdf URL: https://arxiv.org/pdf/2505.16637
Copy Paste: [[2505.16637]] SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation(https://arxiv.org/abs/2505.16637)
Keywords: large language model
Abstract: Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.

Title: Reconsidering Fairness Through Unawareness from the Perspective of Model Multiplicity

Authors: Benedikt Höltgen, Nuria Oliver
Subjects: cs.LG, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16638
Pdf URL: https://arxiv.org/pdf/2505.16638
Copy Paste: [[2505.16638]] Reconsidering Fairness Through Unawareness from the Perspective of Model Multiplicity(https://arxiv.org/abs/2505.16638)
Keywords: protect, fair
Abstract: Fairness through Unawareness (FtU) describes the idea that discrimination against demographic groups can be avoided by not considering group membership in the decisions or predictions. This idea has long been criticized in the machine learning literature as not being sufficient to ensure fairness. In addition, the use of additional features is typically thought to increase the accuracy of the predictions for all groups, so that FtU is sometimes thought to be detrimental to all groups. In this paper, we show both theoretically and empirically that FtU can reduce algorithmic discrimination without necessarily reducing accuracy. We connect this insight with the literature on Model Multiplicity, to which we contribute with novel theoretical and empirical results. Furthermore, we illustrate how, in a real-life application, FtU can contribute to the deployment of more equitable policies without losing efficacy. Our findings suggest that FtU is worth considering in practical applications, particularly in high-risk scenarios, and that the use of protected attributes such as gender in predictive models should be accompanied by a clear and well-founded justification.

Title: BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization

Authors: Xueyang Zhou, Guiyao Tie, Guowen Zhang, Hechang Wang, Pan Zhou, Lichao Sun
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16640
Pdf URL: https://arxiv.org/pdf/2505.16640
Copy Paste: [[2505.16640]] BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization(https://arxiv.org/abs/2505.16640)
Keywords: secure, security, attack, robust, steal
Abstract: Vision-Language-Action (VLA) models have advanced robotic control by enabling end-to-end decision-making directly from multimodal inputs. However, their tightly coupled architectures expose novel security vulnerabilities. Unlike traditional adversarial perturbations, backdoor attacks represent a stealthier, persistent, and practically significant threat-particularly under the emerging Training-as-a-Service paradigm-but remain largely unexplored in the context of VLA models. To address this gap, we propose BadVLA, a backdoor attack method based on Objective-Decoupled Optimization, which for the first time exposes the backdoor vulnerabilities of VLA models. Specifically, it consists of a two-stage process: (1) explicit feature-space separation to isolate trigger representations from benign inputs, and (2) conditional control deviations that activate only in the presence of the trigger, while preserving clean-task performance. Empirical results on multiple VLA benchmarks demonstrate that BadVLA consistently achieves near-100% attack success rates with minimal impact on clean task accuracy. Further analyses confirm its robustness against common input perturbations, task transfers, and model fine-tuning, underscoring critical security vulnerabilities in current VLA deployments. Our work offers the first systematic investigation of backdoor vulnerabilities in VLA models, highlighting an urgent need for secure and trustworthy embodied model design practices. We have released the project page at this https URL.

Title: From Evaluation to Defense: Advancing Safety in Video Large Language Models

Authors: Yiwei Sun, Peiqi Jiang, Chuanbin Liu, Luohao Lin, Zhiying Lu, Hongtao Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16643
Pdf URL: https://arxiv.org/pdf/2505.16643
Copy Paste: [[2505.16643]] From Evaluation to Defense: Advancing Safety in Video Large Language Models(https://arxiv.org/abs/2505.16643)
Keywords: defense, attack, large language model
Abstract: While the safety risks of image-based large language models have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbf{VideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety}, which compromises 77,646 video-query pairs and spans 19 principal risk categories across 10 language communities. \textit{We reveal that integrating video modality degrades safety performance by an average of 42.3\%, exposing systemic risks in multimodal attack exploitation.} To address this vulnerability, we propose \textbf{VideoSafety-R1}, a dual-stage framework achieving unprecedented safety gains through two innovations: (1) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (2) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from passive harm recognition to active reasoning. The resulting framework achieves a 65.1\% improvement on VSB-Eval-HH, and improves by 59.1\%, 44.3\%, and 15.0\% on the image safety datasets MMBench, VLGuard, and FigStep, respectively. \textit{Our codes are available in the supplementary materials.} \textcolor{red}{Warning: This paper contains examples of harmful language and videos, and reader discretion is recommended.}

Title: Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models

Authors: Sushant Gautam, Michael A. Riegler, Pål Halvorsen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16647
Pdf URL: https://arxiv.org/pdf/2505.16647
Copy Paste: [[2505.16647]] Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models(https://arxiv.org/abs/2505.16647)
Keywords: robust
Abstract: We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at this https URL.

Title: Collaboration among Multiple Large Language Models for Medical Question Answering

Authors: Kexin Shang, Chia-Hsuan Chang, Christopher C. Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16648
Pdf URL: https://arxiv.org/pdf/2505.16648
Copy Paste: [[2505.16648]] Collaboration among Multiple Large Language Models for Medical Question Answering(https://arxiv.org/abs/2505.16648)
Keywords: large language model
Abstract: Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.

Title: Unsupervised Network Anomaly Detection with Autoencoders and Traffic Images

Authors: Michael Neri, Sara Baldoni
Subjects: cs.CV, cs.CR, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2505.16650
Pdf URL: https://arxiv.org/pdf/2505.16650
Copy Paste: [[2505.16650]] Unsupervised Network Anomaly Detection with Autoencoders and Traffic Images(https://arxiv.org/abs/2505.16650)
Keywords: security
Abstract: Due to the recent increase in the number of connected devices, the need to promptly detect security issues is emerging. Moreover, the high number of communication flows creates the necessity of processing huge amounts of data. Furthermore, the connected devices are heterogeneous in nature, having different computational capacities. For this reason, in this work we propose an image-based representation of network traffic which allows to realize a compact summary of the current network conditions with 1-second time windows. The proposed representation highlights the presence of anomalies thus reducing the need for complex processing architectures. Finally, we present an unsupervised learning approach which effectively detects the presence of anomalies. The code and the dataset are available at this https URL.

Title: Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

Authors: Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zelin Peng, Zhiwei Yang, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, Zongyuan Ge
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16652
Pdf URL: https://arxiv.org/pdf/2505.16652
Copy Paste: [[2505.16652]] Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding(https://arxiv.org/abs/2505.16652)
Keywords: large language model
Abstract: Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.

Title: SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images

Authors: Kaiyu Guo, Tan Pan, Chen Jiang, Zijian Wang, Brian C. Lovell, Limei Han, Yuan Cheng, Mahsa Baktashmotlagh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16659
Pdf URL: https://arxiv.org/pdf/2505.16659
Copy Paste: [[2505.16659]] SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images(https://arxiv.org/abs/2505.16659)
Keywords: privacy
Abstract: Medical anomaly detection (AD) is crucial for early clinical intervention, yet it faces challenges due to limited access to high-quality medical imaging data, caused by privacy concerns and data silos. Few-shot learning has emerged as a promising approach to alleviate these limitations by leveraging the large-scale prior knowledge embedded in vision-language models (VLMs). Recent advancements in few-shot medical AD have treated normal and abnormal cases as a one-class classification problem, often overlooking the distinction among multiple anomaly categories. Thus, in this paper, we propose a framework tailored for few-shot medical anomaly detection in the scenario where the identification of multiple anomaly categories is required. To capture the detailed radiological signs of medical anomaly categories, our framework incorporates diverse textual descriptions for each category generated by a Large-Language model, under the assumption that different anomalies in medical images may share common radiological signs in each category. Specifically, we introduce SD-MAD, a two-stage Sign-Driven few-shot Multi-Anomaly Detection framework: (i) Radiological signs are aligned with anomaly categories by amplifying inter-anomaly discrepancy; (ii) Aligned signs are selected further to mitigate the effect of the under-fitting and uncertain-sample issue caused by limited medical data, employing an automatic sign selection strategy at inference. Moreover, we propose three protocols to comprehensively quantify the performance of multi-anomaly detection. Extensive experiments illustrate the effectiveness of our method.

Title: A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

Authors: Issey Sukeda, Takuro Fujii, Kosei Buma, Shunsuke Sasaki, Shinnosuke Ono
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16661
Pdf URL: https://arxiv.org/pdf/2505.16661
Copy Paste: [[2505.16661]] A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP(https://arxiv.org/abs/2505.16661)
Keywords: secure
Abstract: We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at this https URL.

Title: End-to-End Framework for Predicting the Remaining Useful Life of Lithium-Ion Batteries

Authors: Khoa Tran, Tri Le, Bao Huynh, Hung-Cuong Trinh, Vy-Rin Nguyen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16664
Pdf URL: https://arxiv.org/pdf/2505.16664
Copy Paste: [[2505.16664]] End-to-End Framework for Predicting the Remaining Useful Life of Lithium-Ion Batteries(https://arxiv.org/abs/2505.16664)
Keywords: robust
Abstract: Accurate prediction of the Remaining Useful Life (RUL) is essential for enabling timely maintenance of lithium-ion batteries, impacting the operational efficiency of electric applications that rely on them. This paper proposes a RUL prediction approach that leverages data from recent charge-discharge cycles to estimate the number of remaining usable cycles. The approach introduces both a novel signal processing pipeline and a deep learning prediction model. In the signal preprocessing pipeline, a derived capacity feature is computed based on current and capacity signals. Alongside original capacity, voltage and current, these features are denoised and enhanced using statistical metrics and a delta-based method to capture differences between the current and previous cycles. In the prediction model, the processed features are then fed into a hybrid deep learning architecture composed of 1D Convolutional Neural Networks (CNN), Attentional Long Short-Term Memory (A-LSTM), and Ordinary Differential Equation-based LSTM (ODE-LSTM) modules. This architecture is designed to capture both local signal characteristics and long-range temporal dependencies while modeling the continuous-time dynamics of battery degradation. The model is further evaluated using transfer learning across different learning strategies and target data partitioning scenarios. Results indicate that the model maintains robust performance, even when fine-tuned on limited target data. Experimental results on two publicly available large-scale datasets demonstrate that the proposed method outperforms a baseline deep learning approach and machine learning techniques, achieving an RMSE of 101.59, highlighting its strong potential for real-world RUL prediction applications.

Title: BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models

Authors: Xiaobei Yan, Yiming Li, Zhaoxin Fan, Han Qiu, Tianwei Zhang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16670
Pdf URL: https://arxiv.org/pdf/2505.16670
Copy Paste: [[2505.16670]] BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models(https://arxiv.org/abs/2505.16670)
Keywords: attack, large language model
Abstract: Large language models (LLMs) have shown impressive capabilities across a wide range of applications, but their ever-increasing size and resource demands make them vulnerable to inference cost attacks, where attackers induce victim LLMs to generate the longest possible output content. In this paper, we revisit existing inference cost attacks and reveal that these methods can hardly produce large-scale malicious effects since they are self-targeting, where attackers are also the users and therefore have to execute attacks solely through the inputs, whose generated content will be charged by LLMs and can only directly influence themselves. Motivated by these findings, this paper introduces a new type of inference cost attacks (dubbed 'bit-flip inference cost attack') that target the victim model itself rather than its inputs. Specifically, we design a simple yet effective method (dubbed 'BitHydra') to effectively flip critical bits of model parameters. This process is guided by a loss function designed to suppress token's probability with an efficient critical bit search algorithm, thus explicitly defining the attack objective and enabling effective optimization. We evaluate our method on 11 LLMs ranging from 1.5B to 14B parameters under both int8 and float16 settings. Experimental results demonstrate that with just 4 search samples and as few as 3 bit flips, BitHydra can force 100% of test prompts to reach the maximum generation length (e.g., 2048 tokens) on representative LLMs such as LLaMA3, highlighting its efficiency, scalability, and strong transferability across unseen inputs.

Title: R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

Authors: Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, Jiaxing Huang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16673
Pdf URL: https://arxiv.org/pdf/2505.16673
Copy Paste: [[2505.16673]] R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO(https://arxiv.org/abs/2505.16673)
Keywords: large language model
Abstract: In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at this https URL.

Title: Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge

Authors: Marcella Astrid, Abdelrahman Shabayek, Djamila Aouada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16674
Pdf URL: https://arxiv.org/pdf/2505.16674
Copy Paste: [[2505.16674]] Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge(https://arxiv.org/abs/2505.16674)
Keywords: robust
Abstract: Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.

Title: Semantic Compression of 3D Objects for Open and Collaborative Virtual Worlds

Authors: Jordan Dotzel, Tony Montes, Mohamed S. Abdelfattah, Zhiru Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16679
Pdf URL: https://arxiv.org/pdf/2505.16679
Copy Paste: [[2505.16679]] Semantic Compression of 3D Objects for Open and Collaborative Virtual Worlds(https://arxiv.org/abs/2505.16679)
Keywords: generative
Abstract: Traditional methods for 3D object compression operate only on structural information within the object vertices, polygons, and textures. These methods are effective at compression rates up to 10x for standard object sizes but quickly deteriorate at higher compression rates with texture artifacts, low-polygon counts, and mesh gaps. In contrast, semantic compression ignores structural information and operates directly on the core concepts to push to extreme levels of compression. In addition, it uses natural language as its storage format, which makes it natively human-readable and a natural fit for emerging applications built around large-scale, collaborative projects within augmented and virtual reality. It deprioritizes structural information like location, size, and orientation and predicts the missing information with state-of-the-art deep generative models. In this work, we construct a pipeline for 3D semantic compression from public generative models and explore the quality-compression frontier for 3D object compression. We apply this pipeline to achieve rates as high as 105x for 3D objects taken from the Objaverse dataset and show that semantic compression can outperform traditional methods in the important quality-preserving region around 100x compression.

Title: Learning Genomic Structure from $k$-mers

Authors: Filip Thor, Carl Nettelblad
Subjects: cs.LG, q-bio.GN, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.16680
Pdf URL: https://arxiv.org/pdf/2505.16680
Copy Paste: [[2505.16680]] Learning Genomic Structure from $k$-mers(https://arxiv.org/abs/2505.16680)
Keywords: robust
Abstract: Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data using contrastive learning, in which an encoder model is trained to produce embeddings that cluster together sequences from the same genomic region. The sequential nature of genomic regions is preserved in the form of trajectories through this embedding space. Trained solely to reflect the structure of the genome, the resulting model provides a general representation of $k$-mer sequences, suitable for a range of downstream tasks involving read data. We apply our framework to learn the structure of the $E.\ coli$ genome, and demonstrate its use in simulated ancient DNA (aDNA) read mapping and identification of structural variations. Furthermore, we illustrate the potential of using this type of model for metagenomic species identification. We show how incorporating a domain-specific noise model can enhance embedding robustness, and how a supervised contrastive learning setting can be adopted when a linear reference genome is available, by introducing a distance thresholding parameter $\Gamma$. The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly using specialized algorithms. Small prediction heads based on a pre-trained embedding are shown to perform on par with BWA-aln, the current gold standard approach for aDNA mapping, in terms of accuracy and runtime for short genomes. Given the method's favorable scaling properties with respect to total genome size, inference using our approach is highly promising for metagenomic applications and for mapping to genomes comparable in size to the human genome.

Title: One-Step Diffusion-Based Image Compression with Semantic Distillation

Authors: Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, Yan Lu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16687
Pdf URL: https://arxiv.org/pdf/2505.16687
Copy Paste: [[2505.16687]] One-Step Diffusion-Based Image Compression with Semantic Distillation(https://arxiv.org/abs/2505.16687)
Keywords: diffusion, generative
Abstract: While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 40% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Code will be released later.

Title: Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

Authors: Beier Luo, Shuoyuan Wang, Yixuan Li, Hongxin Wei
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16690
Pdf URL: https://arxiv.org/pdf/2505.16690
Copy Paste: [[2505.16690]] Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator(https://arxiv.org/abs/2505.16690)
Keywords: large language model
Abstract: Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.

Title: Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16694
Pdf URL: https://arxiv.org/pdf/2505.16694
Copy Paste: [[2505.16694]] Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence(https://arxiv.org/abs/2505.16694)
Keywords: transformer, large language model
Abstract: Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.

Title: Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs

Authors: Zeping Yu, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16703
Pdf URL: https://arxiv.org/pdf/2505.16703
Copy Paste: [[2505.16703]] Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs(https://arxiv.org/abs/2505.16703)
Keywords: large language model
Abstract: Although multimodal large language models (MLLMs) have achieved impressive performance, the multimodal instruction tuning stage often causes catastrophic forgetting of the base LLM's language ability, even in strong models like Llama3. To address this, we propose Locate-then-Merge, a training-free parameter fusion framework that first locates important parameters and then selectively merges them. We further introduce Neuron-Fusion, a neuron-level strategy that preserves the influence of neurons with large parameter shifts--neurons likely responsible for newly acquired visual capabilities--while attenuating the influence of neurons with smaller changes that likely encode general-purpose language skills. This design enables better retention of visual adaptation while mitigating language degradation. Experiments on 13 benchmarks across both language and visual tasks show that Neuron-Fusion consistently outperforms existing model merging methods. Further analysis reveals that our method effectively reduces context hallucination in generation.

Title: An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations

Authors: Seonghwan Park, Jueun Mun, Donghyun Oh, Namhoon Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16705
Pdf URL: https://arxiv.org/pdf/2505.16705
Copy Paste: [[2505.16705]] An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations(https://arxiv.org/abs/2505.16705)
Keywords: robust, interpretability
Abstract: Concept bottleneck models (CBMs) ensure interpretability by decomposing predictions into human interpretable concepts. Yet the annotations used for training CBMs that enable this transparency are often noisy, and the impact of such corruption is not well understood. In this study, we present the first systematic study of noise in CBMs and show that even moderate corruption simultaneously impairs prediction performance, interpretability, and the intervention effectiveness. Our analysis identifies a susceptible subset of concepts whose accuracy declines far more than the average gap between noisy and clean supervision and whose corruption accounts for most performance loss. To mitigate this vulnerability we propose a two-stage framework. During training, sharpness-aware minimization stabilizes the learning of noise-sensitive concepts. During inference, where clean labels are unavailable, we rank concepts by predictive entropy and correct only the most uncertain ones, using uncertainty as a proxy for susceptibility. Theoretical analysis and extensive ablations elucidate why sharpness-aware training confers robustness and why uncertainty reliably identifies susceptible concepts, providing a principled basis that preserves both interpretability and resilience in the presence of noise.

Title: KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Authors: Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16707
Pdf URL: https://arxiv.org/pdf/2505.16707
Copy Paste: [[2505.16707]] KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models(https://arxiv.org/abs/2505.16707)
Keywords: generative
Abstract: Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

Title: Training Long-Context LLMs Efficiently via Chunk-wise Optimization

Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16710
Pdf URL: https://arxiv.org/pdf/2505.16710
Copy Paste: [[2505.16710]] Training Long-Context LLMs Efficiently via Chunk-wise Optimization(https://arxiv.org/abs/2505.16710)
Keywords: large language model
Abstract: While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed -- achieving up to 3x faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at \href{this https URL}{here}.

Title: Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Authors: Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16722
Pdf URL: https://arxiv.org/pdf/2505.16722
Copy Paste: [[2505.16722]] Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification(https://arxiv.org/abs/2505.16722)
Keywords: large language model
Abstract: As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at this https URL.

Title: Robust LLM Fingerprinting via Domain-Specific Watermarks

Authors: Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16723
Pdf URL: https://arxiv.org/pdf/2505.16723
Copy Paste: [[2505.16723]] Robust LLM Fingerprinting via Domain-Specific Watermarks(https://arxiv.org/abs/2505.16723)
Keywords: robust, steal, watermark
Abstract: As open-source language models (OSMs) grow more capable and are widely shared and finetuned, ensuring model provenance, i.e., identifying the origin of a given model instance, has become an increasingly important issue. At the same time, existing backdoor-based model fingerprinting techniques often fall short of achieving key requirements of real-world model ownership detection. In this work, we build on the observation that while current open-source model watermarks fail to achieve reliable content traceability, they can be effectively adapted to address the challenge of model provenance. To this end, we introduce the concept of domain-specific watermarking for model fingerprinting. Rather than watermarking all generated content, we train the model to embed watermarks only within specified subdomains (e.g., particular languages or topics). This targeted approach ensures detection reliability, while improving watermark durability and quality under a range of real-world deployment settings. Our evaluations show that domain-specific watermarking enables model fingerprinting with strong statistical guarantees, controllable false positive rates, high detection power, and preserved generation quality. Moreover, we find that our fingerprints are inherently stealthy and naturally robust to real-world variability across deployment scenarios.

Title: Advancing Brainwave Modeling with a Codebook-Based Foundation Model

Authors: Konstantinos Barmpas, Na Lee, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.16724
Pdf URL: https://arxiv.org/pdf/2505.16724
Copy Paste: [[2505.16724]] Advancing Brainwave Modeling with a Codebook-Based Foundation Model(https://arxiv.org/abs/2505.16724)
Keywords: robust
Abstract: Recent advances in large-scale pre-trained Electroencephalogram (EEG) models have shown great promise, driving progress in Brain-Computer Interfaces (BCIs) and healthcare applications. However, despite their success, many existing pre-trained models have struggled to fully capture the rich information content of neural oscillations, a limitation that fundamentally constrains their performance and generalizability across diverse BCI tasks. This limitation is frequently rooted in suboptimal architectural design choices which constrain their representational capacity. In this work, we introduce LaBraM++, an enhanced Large Brainwave Foundation Model (LBM) that incorporates principled improvements grounded in robust signal processing foundations. LaBraM++ demonstrates substantial gains across a variety of tasks, consistently outperforming its originally-based architecture and achieving competitive results when compared to other open-source LBMs. Its superior performance and training efficiency highlight its potential as a strong foundation for future advancements in LBMs.

Title: Masked Conditioning for Deep Generative Models

Authors: Phillip Mueller, Jannik Wiese, Sebastian Mueller, Lars Mikelsons
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16725
Pdf URL: https://arxiv.org/pdf/2505.16725
Copy Paste: [[2505.16725]] Masked Conditioning for Deep Generative Models(https://arxiv.org/abs/2505.16725)
Keywords: diffusion, generative
Abstract: Datasets in engineering domains are often small, sparsely labeled, and contain numerical as well as categorical conditions. Additionally. computational resources are typically limited in practical applications which hinders the adoption of generative models for engineering tasks. We introduce a novel masked-conditioning approach, that enables generative models to work with sparse, mixed-type data. We mask conditions during training to simulate sparse conditions at inference time. For this purpose, we explore the use of various sparsity schedules that show different strengths and weaknesses. In addition, we introduce a flexible embedding that deals with categorical as well as numerical conditions. We integrate our method into an efficient variational autoencoder as well as a latent diffusion model and demonstrate the applicability of our approach on two engineering-related datasets of 2D point clouds and images. Finally, we show that small models trained on limited data can be coupled with large pretrained foundation models to improve generation quality while retaining the controllability induced by our conditioning scheme.

Title: Forward-only Diffusion Probabilistic Models

Authors: Ziwei Luo, Fredrik K. Gustafsson, Jens Sjölund, Thomas B. Schön
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16733
Pdf URL: https://arxiv.org/pdf/2505.16733
Copy Paste: [[2505.16733]] Forward-only Diffusion Probabilistic Models(https://arxiv.org/abs/2505.16733)
Keywords: diffusion, generative
Abstract: This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent linear stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves competitive performance on various image-conditioned (e.g., image restoration) and unconditional generation tasks, demonstrating its effectiveness in generative modelling. Our code is available at this https URL.

Title: Maximum Total Correlation Reinforcement Learning

Authors: Bang You, Puze Liu, Huaping Liu, Jan Peters, Oleg Arenz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16734
Pdf URL: https://arxiv.org/pdf/2505.16734
Copy Paste: [[2505.16734]] Maximum Total Correlation Reinforcement Learning(https://arxiv.org/abs/2505.16734)
Keywords: robust
Abstract: Simplicity is a powerful inductive bias. In reinforcement learning, regularization is used for simpler policies, data augmentation for simpler representations, and sparse reward functions for simpler objectives, all that, with the underlying motivation to increase generalizability and robustness by focusing on the essentials. Supplementary to these techniques, we investigate how to promote simple behavior throughout the episode. To that end, we introduce a modification of the reinforcement learning problem that additionally maximizes the total correlation within the induced trajectories. We propose a practical algorithm that optimizes all models, including policy and state representation, based on a lower-bound approximation. In simulated robot environments, our method naturally generates policies that induce periodic and compressible trajectories, and that exhibit superior robustness to noise and changes in dynamics compared to baseline methods, while also improving performance in the original tasks.

Title: Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

Authors: Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, Meng Sun
Subjects: cs.LG, cs.AI, cs.CL, cs.CR, math.OC
Abstract URL: https://arxiv.org/abs/2505.16737
Pdf URL: https://arxiv.org/pdf/2505.16737
Copy Paste: [[2505.16737]] Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization(https://arxiv.org/abs/2505.16737)
Keywords: large language model
Abstract: The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at this https URL.

Title: Robust Vision-Based Runway Detection through Conformal Prediction and Conformal mAP

Authors: Alya Zouzou, Léo andéol, Mélanie Ducoffe, Ryma Boumazouza
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16740
Pdf URL: https://arxiv.org/pdf/2505.16740
Copy Paste: [[2505.16740]] Robust Vision-Based Runway Detection through Conformal Prediction and Conformal mAP(https://arxiv.org/abs/2505.16740)
Keywords: robust
Abstract: We explore the use of conformal prediction to provide statistical uncertainty guarantees for runway detection in vision-based landing systems (VLS). Using fine-tuned YOLOv5 and YOLOv6 models on aerial imagery, we apply conformal prediction to quantify localization reliability under user-defined risk levels. We also introduce Conformal mean Average Precision (C-mAP), a novel metric aligning object detection performance with conformal guarantees. Our results show that conformal prediction can improve the reliability of runway detection by quantifying uncertainty in a statistically sound way, increasing safety on-board and paving the way for certification of ML system in the aerospace domain.

Title: TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

Authors: Florentin Beck, William Rudman, Carsten Eickhoff
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16743
Pdf URL: https://arxiv.org/pdf/2505.16743
Copy Paste: [[2505.16743]] TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning(https://arxiv.org/abs/2505.16743)
Keywords: large language model
Abstract: Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: this https URL

Title: PyTupli: A Scalable Infrastructure for Collaborative Offline Reinforcement Learning Projects

Authors: Hannah Markgraf, Michael Eichelbeck, Daria Cappey, Selin Demirtürk, Yara Schattschneider, Matthias Althoff
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16754
Pdf URL: https://arxiv.org/pdf/2505.16754
Copy Paste: [[2505.16754]] PyTupli: A Scalable Infrastructure for Collaborative Offline Reinforcement Learning Projects(https://arxiv.org/abs/2505.16754)
Keywords: secure, robust
Abstract: Offline reinforcement learning (RL) has gained traction as a powerful paradigm for learning control policies from pre-collected data, eliminating the need for costly or risky online interactions. While many open-source libraries offer robust implementations of offline RL algorithms, they all rely on datasets composed of experience tuples consisting of state, action, next state, and reward. Managing, curating, and distributing such datasets requires suitable infrastructure. Although static datasets exist for established benchmark problems, no standardized or scalable solution supports developing and sharing datasets for novel or user-defined benchmarks. To address this gap, we introduce PyTupli, a Python-based tool to streamline the creation, storage, and dissemination of benchmark environments and their corresponding tuple datasets. PyTupli includes a lightweight client library with defined interfaces for uploading and retrieving benchmarks and data. It supports fine-grained filtering at both the episode and tuple level, allowing researchers to curate high-quality, task-specific datasets. A containerized server component enables production-ready deployment with authentication, access control, and automated certificate provisioning for secure use. By addressing key barriers in dataset infrastructure, PyTupli facilitates more collaborative, reproducible, and scalable offline RL research.

Title: Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval

Authors: Hailong Ning, Siying Wang, Tao Lei, Xiaopeng Cao, Huanmin Dou, Bin Zhao, Asoke K. Nandi, Petia Radeva
Subjects: cs.CV, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16756
Pdf URL: https://arxiv.org/pdf/2505.16756
Copy Paste: [[2505.16756]] Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval(https://arxiv.org/abs/2505.16756)
Keywords: robust
Abstract: Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.

Title: When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

Authors: Jianing Geng, Biao Yi, Zekun Fei, Tongxi Wu, Lihai Nie, Zheli Liu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16765
Pdf URL: https://arxiv.org/pdf/2505.16765
Copy Paste: [[2505.16765]] When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques(https://arxiv.org/abs/2505.16765)
Keywords: security, attack, steal, large language model
Abstract: Jailbreak attacks pose a serious threat to large language models (LLMs) by bypassing built-in safety mechanisms and leading to harmful outputs. Studying these attacks is crucial for identifying vulnerabilities and improving model security. This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth. We find that existing attacks struggle to simultaneously achieve toxic stealth (concealing toxic content) and linguistic stealth (maintaining linguistic naturalness). Motivated by this, we propose StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text. The attack then prompts the LLM to extract the hidden query and respond in an encrypted manner. This approach effectively hides malicious intent while preserving naturalness, allowing it to evade both built-in and external safety mechanisms. We evaluate StegoAttack on four safety-aligned LLMs from major providers, benchmarking against eight state-of-the-art methods. StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%. Its ASR drops by less than 1% even under external detection (e.g., Llama Guard). Moreover, it attains the optimal comprehensive scores on stealth detection metrics, demonstrating both high efficacy and exceptional stealth capabilities. The code is available at this https URL

Title: Mitigating Overfitting in Medical Imaging: Self-Supervised Pretraining vs. ImageNet Transfer Learning for Dermatological Diagnosis

Authors: Iván Matas, Carmen Serrano, Miguel Nogales, David Moreno, Lara Ferrándiz, Teresa Ojeda, Begoña Acha
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16773
Pdf URL: https://arxiv.org/pdf/2505.16773
Copy Paste: [[2505.16773]] Mitigating Overfitting in Medical Imaging: Self-Supervised Pretraining vs. ImageNet Transfer Learning for Dermatological Diagnosis(https://arxiv.org/abs/2505.16773)
Keywords: extraction
Abstract: Deep learning has transformed computer vision but relies heavily on large labeled datasets and computational resources. Transfer learning, particularly fine-tuning pretrained models, offers a practical alternative; however, models pretrained on natural image datasets such as ImageNet may fail to capture domain-specific characteristics in medical imaging. This study introduces an unsupervised learning framework that extracts high-value dermatological features instead of relying solely on ImageNet-based pretraining. We employ a Variational Autoencoder (VAE) trained from scratch on a proprietary dermatological dataset, allowing the model to learn a structured and clinically relevant latent space. This self-supervised feature extractor is then compared to an ImageNet-pretrained backbone under identical classification conditions, highlighting the trade-offs between general-purpose and domain-specific pretraining. Our results reveal distinct learning patterns. The self-supervised model achieves a final validation loss of 0.110 (-33.33%), while the ImageNet-pretrained model stagnates at 0.100 (-16.67%), indicating overfitting. Accuracy trends confirm this: the self-supervised model improves from 45% to 65% (+44.44%) with a near-zero overfitting gap, whereas the ImageNet-pretrained model reaches 87% (+50.00%) but plateaus at 75% (+19.05%), with its overfitting gap increasing to +0.060. These findings suggest that while ImageNet pretraining accelerates convergence, it also amplifies overfitting on non-clinically relevant features. In contrast, self-supervised learning achieves steady improvements, stronger generalization, and superior adaptability, underscoring the importance of domain-specific feature extraction in medical imaging.

Title: IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

Authors: Yiming Gao, Bin Wang, Chengwei Wei, Shuo Sun, AiTi Aw
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16774
Pdf URL: https://arxiv.org/pdf/2505.16774
Copy Paste: [[2505.16774]] IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models(https://arxiv.org/abs/2505.16774)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.

Title: Single Domain Generalization for Few-Shot Counting via Universal Representation Matching

Authors: Xianing Chen, Si Huo, Borui Jiang, Hailin Hu, Xinghao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16778
Pdf URL: https://arxiv.org/pdf/2505.16778
Copy Paste: [[2505.16778]] Single Domain Generalization for Few-Shot Counting via Universal Representation Matching(https://arxiv.org/abs/2505.16778)
Keywords: robust
Abstract: Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.

Title: Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

Authors: Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16782
Pdf URL: https://arxiv.org/pdf/2505.16782
Copy Paste: [[2505.16782]] Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning(https://arxiv.org/abs/2505.16782)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved impressive performance on complex reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional CoT relies on reasoning steps explicitly verbalized in natural language, introducing inefficiencies and limiting its applicability to abstract reasoning. To address this, there has been growing research interest in latent CoT reasoning, where inference occurs within latent spaces. By decoupling reasoning from language, latent reasoning promises richer cognitive representations and more flexible, faster inference. Researchers have explored various directions in this promising field, including training methodologies, structural innovations, and internal reasoning mechanisms. This paper presents a comprehensive overview and analysis of this reasoning paradigm. We begin by proposing a unified taxonomy from four perspectives: token-wise strategies, internal mechanisms, analysis, and applications. We then provide in-depth discussions and comparative analyses of representative methods, highlighting their design patterns, strengths, and open challenges. We aim to provide a structured foundation for advancing this emerging direction in LLM reasoning. The relevant papers will be regularly updated at this https URL.

Title: CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models

Authors: Zhenzhen Ren, GuoBiao Li, Sheng Li, Zhenxing Qian, Xinpeng Zhang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16785
Pdf URL: https://arxiv.org/pdf/2505.16785
Copy Paste: [[2505.16785]] CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models(https://arxiv.org/abs/2505.16785)
Keywords: robust, steal, large language model
Abstract: Despite providing superior performance, open-source large language models (LLMs) are vulnerable to abusive usage. To address this issue, recent works propose LLM fingerprinting methods to identify the specific source LLMs behind suspect applications. However, these methods fail to provide stealthy and robust fingerprint verification. In this paper, we propose a novel LLM fingerprinting scheme, namely CoTSRF, which utilizes the Chain of Thought (CoT) as the fingerprint of an LLM. CoTSRF first collects the responses from the source LLM by querying it with crafted CoT queries. Then, it applies contrastive learning to train a CoT extractor that extracts the CoT feature (i.e., fingerprint) from the responses. Finally, CoTSRF conducts fingerprint verification by comparing the Kullback-Leibler divergence between the CoT features of the source and suspect LLMs against an empirical threshold. Various experiments have been conducted to demonstrate the advantage of our proposed CoTSRF for fingerprinting LLMs, particularly in stealthy and robust fingerprint verification.

Title: FlowMixer: A Constrained Neural Architecture for Interpretable Spatiotemporal Forecasting

Authors: Fares B. Mehouachi, Saif Eddin Jabari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16786
Pdf URL: https://arxiv.org/pdf/2505.16786
Copy Paste: [[2505.16786]] FlowMixer: A Constrained Neural Architecture for Interpretable Spatiotemporal Forecasting(https://arxiv.org/abs/2505.16786)
Keywords: robust, interpretability
Abstract: We introduce FlowMixer, a neural architecture that leverages constrained matrix operations to model structured spatiotemporal patterns. At its core, FlowMixer incorporates non-negative matrix mixing layers within a reversible mapping framework-applying transforms before mixing and their inverses afterward. This shape-preserving design enables a Kronecker-Koopman eigenmode framework that bridges statistical learning with dynamical systems theory, providing interpretable spatiotemporal patterns and facilitating direct algebraic manipulation of prediction horizons without retraining. Extensive experiments across diverse domains demonstrate FlowMixer's robust long-horizon forecasting capabilities while effectively modeling physical phenomena such as chaotic attractors and turbulent flows. These results suggest that architectural constraints can simultaneously enhance predictive performance and mathematical interpretability in neural forecasting systems.

Title: Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

Authors: Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16789
Pdf URL: https://arxiv.org/pdf/2505.16789
Copy Paste: [[2505.16789]] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability(https://arxiv.org/abs/2505.16789)
Keywords: defense, attack, large language model
Abstract: As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment. Our code is available at this https URL.

Title: Learning Flexible Forward Trajectories for Masked Molecular Diffusion

Authors: Hyunjin Seo, Taewon Kim, Sihyun Yu, SungSoo Ahn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16790
Pdf URL: https://arxiv.org/pdf/2505.16790
Copy Paste: [[2505.16790]] Learning Flexible Forward Trajectories for Masked Molecular Diffusion(https://arxiv.org/abs/2505.16790)
Keywords: diffusion
Abstract: Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.

Title: Cohort-Based Active Modality Acquisition

Authors: Tillmann Rheude, Roland Eils, Benjamin Wild
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16791
Pdf URL: https://arxiv.org/pdf/2505.16791
Copy Paste: [[2505.16791]] Cohort-Based Active Modality Acquisition(https://arxiv.org/abs/2505.16791)
Keywords: robust, generative
Abstract: Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored despite their importance in many real-world settings. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on common multimodal datasets demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of new samples in comparison to those relying solely on unimodal information, entropy guidance, and random selections. Our work provides an effective solution for optimizing modality acquisition at the cohort level, enabling better utilization of resources in constrained settings.

Title: REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Authors: Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, Yang You
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16792
Pdf URL: https://arxiv.org/pdf/2505.16792
Copy Paste: [[2505.16792]] REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training(https://arxiv.org/abs/2505.16792)
Keywords: diffusion, transformer, generative
Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .

Title: REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Authors: Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, Tianjin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16793
Pdf URL: https://arxiv.org/pdf/2505.16793
Copy Paste: [[2505.16793]] REOBench: Benchmarking Robustness of Earth Observation Foundation Models(https://arxiv.org/abs/2505.16793)
Keywords: robust
Abstract: Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models.

Title: V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Authors: Hanyue Lou, Jinxiu Liang, Minggui Teng, Yi Wang, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16797
Pdf URL: https://arxiv.org/pdf/2505.16797
Copy Paste: [[2505.16797]] V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation(https://arxiv.org/abs/2505.16797)
Keywords: robust
Abstract: Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.

Title: Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

Authors: Changbing Yang, Garrett Nicolai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16800
Pdf URL: https://arxiv.org/pdf/2505.16800
Copy Paste: [[2505.16800]] Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation(https://arxiv.org/abs/2505.16800)
Keywords: transformer, large language model, segmentation
Abstract: We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.

Title: SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Authors: Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16805
Pdf URL: https://arxiv.org/pdf/2505.16805
Copy Paste: [[2505.16805]] SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving(https://arxiv.org/abs/2505.16805)
Keywords: robust, interpretability
Abstract: The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.

Title: Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement

Authors: Kexin Zhang, Junlan Chen, Daifeng Li, Yuxuan Zhang, Yangyang Feng, Bowen Deng, Weixu Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.16806
Pdf URL: https://arxiv.org/pdf/2505.16806
Copy Paste: [[2505.16806]] Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement(https://arxiv.org/abs/2505.16806)
Keywords: robust, large language model
Abstract: Large language models (LLMs) encounter difficulties in knowledge-intensive multi-step reasoning (KIMSR) tasks. One challenge is how to effectively extract and represent rationale evidence. The current methods often extract semantically relevant but logically irrelevant evidence, resulting in flawed reasoning and inaccurate responses. We propose a two-way evidence self-alignment (TW-ESA) module, which utilizes the mutual alignment between strict reasoning and LLM reasoning to enhance its understanding of the causal logic of evidence, thereby addressing the first challenge. Another challenge is how to utilize the rationale evidence and LLM's intrinsic knowledge for accurate reasoning when the evidence contains uncertainty. We propose a dual-gated reasoning enhancement (DGR) module to gradually fuse useful knowledge of LLM within strict reasoning, which can enable the model to perform accurate reasoning by focusing on causal elements in the evidence and exhibit greater robustness. The two modules are collaboratively trained in a unified framework ESA-DGR. Extensive experiments on three diverse and challenging KIMSR datasets reveal that ESA-DGR significantly surpasses state-of-the-art LLM-based fine-tuning methods, with remarkable average improvements of 4% in exact match (EM) and 5% in F1 score. The implementation code is available at this https URL.

Title: Hypergraph Tversky-Aware Domain Incremental Learning for Brain Tumor Segmentation with Missing Modalities

Authors: Junze Wang (1), Lei Fan (2,3), Weipeng Jing (1), Donglin Di (4), Yang Song (3), Sidong Liu (5), Cong Cong (5) ((1) College of Computer and Control Engineering, Northeast Forestry University, Harbin, China, (2) The Centre for Healthy Brain Ageing (CHeBA), University of New South Wales, Sydney, Australia, (3) School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, (4) Space AI, Li Auto, Beijing, China, (5) Centre for Health Informatics, Macquarie University, Sydney, Australia)
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16809
Pdf URL: https://arxiv.org/pdf/2505.16809
Copy Paste: [[2505.16809]] Hypergraph Tversky-Aware Domain Incremental Learning for Brain Tumor Segmentation with Missing Modalities(https://arxiv.org/abs/2505.16809)
Keywords: segmentation
Abstract: Existing methods for multimodal MRI segmentation with missing modalities typically assume that all MRI modalities are available during training. However, in clinical practice, some modalities may be missing due to the sequential nature of MRI acquisition, leading to performance degradation. Furthermore, retraining models to accommodate newly available modalities can be inefficient and may cause overfitting, potentially compromising previously learned knowledge. To address these challenges, we propose Replay-based Hypergraph Domain Incremental Learning (ReHyDIL) for brain tumor segmentation with missing modalities. ReHyDIL leverages Domain Incremental Learning (DIL) to enable the segmentation model to learn from newly acquired MRI modalities without forgetting previously learned information. To enhance segmentation performance across diverse patient scenarios, we introduce the Cross-Patient Hypergraph Segmentation Network (CHSNet), which utilizes hypergraphs to capture high-order associations between patients. Additionally, we incorporate Tversky-Aware Contrastive (TAC) loss to effectively mitigate information imbalance both across and within different modalities. Extensive experiments on the BraTS2019 dataset demonstrate that ReHyDIL outperforms state-of-the-art methods, achieving an improvement of over 2\% in the Dice Similarity Coefficient across various tumor regions. Our code is available at ReHyDIL.

Title: Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

Authors: Gaurav Kamath, Sowmya Vajjala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16814
Pdf URL: https://arxiv.org/pdf/2505.16814
Copy Paste: [[2505.16814]] Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?(https://arxiv.org/abs/2505.16814)
Keywords: robust
Abstract: Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

Title: Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts

Authors: Taewon Kang, Ming C. Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16819
Pdf URL: https://arxiv.org/pdf/2505.16819
Copy Paste: [[2505.16819]] Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts(https://arxiv.org/abs/2505.16819)
Keywords: large language model
Abstract: Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.

Title: Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Authors: Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16831
Pdf URL: https://arxiv.org/pdf/2505.16831
Copy Paste: [[2505.16831]] Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs(https://arxiv.org/abs/2505.16831)
Keywords: large language model
Abstract: Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: this https URL.

Title: SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

Authors: Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.16834
Pdf URL: https://arxiv.org/pdf/2505.16834
Copy Paste: [[2505.16834]] SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis(https://arxiv.org/abs/2505.16834)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at this https URL.

Title: R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search

Authors: Yibo Wang, Li Shen, Huanjin Yao, Tiansheng Huang, Rui Liu, Naiqiang Tan, Jiaxing Huang, Kai Zhang, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16838
Pdf URL: https://arxiv.org/pdf/2505.16838
Copy Paste: [[2505.16838]] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search(https://arxiv.org/abs/2505.16838)
Keywords: large language model
Abstract: Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at this https URL

Title: LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Authors: Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16839
Pdf URL: https://arxiv.org/pdf/2505.16839
Copy Paste: [[2505.16839]] LaViDa: A Large Diffusion Language Model for Multimodal Understanding(https://arxiv.org/abs/2505.16839)
Keywords: diffusion
Abstract: Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Title: ATR-Bench: A Federated Learning Benchmark for Adaptation, Trust, and Reasoning

Authors: Tajamul Ashraf, Mohammed Mohsen Peerzada, Moloud Abdar, Yutong Xie, Yuyin Zhou, Xiaofeng Liu, Iqra Altaf Gillani, Janibul Bashir
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16850
Pdf URL: https://arxiv.org/pdf/2505.16850
Copy Paste: [[2505.16850]] ATR-Bench: A Federated Learning Benchmark for Adaptation, Trust, and Reasoning(https://arxiv.org/abs/2505.16850)
Keywords: privacy, federate, fair
Abstract: Federated Learning (FL) has emerged as a promising paradigm for collaborative model training while preserving data privacy across decentralized participants. As FL adoption grows, numerous techniques have been proposed to tackle its practical challenges. However, the lack of standardized evaluation across key dimensions hampers systematic progress and fair comparison of FL methods. In this work, we introduce ATR-Bench, a unified framework for analyzing federated learning through three foundational dimensions: Adaptation, Trust, and Reasoning. We provide an in-depth examination of the conceptual foundations, task formulations, and open research challenges associated with each theme. We have extensively benchmarked representative methods and datasets for adaptation to heterogeneous clients and trustworthiness in adversarial or unreliable environments. Due to the lack of reliable metrics and models for reasoning in FL, we only provide literature-driven insights for this dimension. ATR-Bench lays the groundwork for a systematic and holistic evaluation of federated learning with real-world relevance. We will make our complete codebase publicly accessible and a curated repository that continuously tracks new developments and research in the FL literature.

Title: Redefining Clustered Federated Learning for System Identification: The Path of ClusterCraft

Authors: Ertuğrul Keçeci, Müjde Güzelkaya, Tufan Kumbasar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16857
Pdf URL: https://arxiv.org/pdf/2505.16857
Copy Paste: [[2505.16857]] Redefining Clustered Federated Learning for System Identification: The Path of ClusterCraft(https://arxiv.org/abs/2505.16857)
Keywords: federate
Abstract: This paper addresses the System Identification (SYSID) problem within the framework of federated learning. We introduce a novel algorithm, Incremental Clustering-based federated learning method for SYSID (IC-SYSID), designed to tackle SYSID challenges across multiple data sources without prior knowledge. IC-SYSID utilizes an incremental clustering method, ClusterCraft (CC), to eliminate the dependency on the prior knowledge of the dataset. CC starts with a single cluster model and assigns similar local workers to the same clusters by dynamically increasing the number of clusters. To reduce the number of clusters generated by CC, we introduce ClusterMerge, where similar cluster models are merged. We also introduce enhanced ClusterCraft to reduce the generation of similar cluster models during the training. Moreover, IC-SYSID addresses cluster model instability by integrating a regularization term into the loss function and initializing cluster models with scaled Glorot initialization. It also utilizes a mini-batch deep learning approach to manage large SYSID datasets during local training. Through the experiments conducted on a real-world representing SYSID problem, where a fleet of vehicles collaboratively learns vehicle dynamics, we show that IC-SYSID achieves a high SYSID performance while preventing the learning of unstable clusters.

Title: Conditional Panoramic Image Generation via Masked Autoregressive Modeling

Authors: Chaoyang Wang, Xiangtai Li, Lu Qi, Xiaofan Lin, Jinbin Bai, Qianyu Zhou, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16862
Pdf URL: https://arxiv.org/pdf/2505.16862
Copy Paste: [[2505.16862]] Conditional Panoramic Image Generation via Masked Autoregressive Modeling(https://arxiv.org/abs/2505.16862)
Keywords: diffusion, generative
Abstract: Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.

Title: Training-Free Efficient Video Generation via Dynamic Token Carving

Authors: Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16864
Pdf URL: https://arxiv.org/pdf/2505.16864
Copy Paste: [[2505.16864]] Training-Free Efficient Video Generation via Dynamic Token Carving(https://arxiv.org/abs/2505.16864)
Keywords: diffusion, transformer
Abstract: Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: this https URL

Title: MPO: Multilingual Safety Alignment via Reward Gap Optimization

Authors: Weixiang Zhao, Yulin Hu, Yang Deng, Tongtong Wu, Wenxuan Zhang, Jiahe Guo, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16869
Pdf URL: https://arxiv.org/pdf/2505.16869
Copy Paste: [[2505.16869]] MPO: Multilingual Safety Alignment via Reward Gap Optimization(https://arxiv.org/abs/2505.16869)
Keywords: secure, robust, large language model
Abstract: Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO's efficacy in multilingual safety alignment without degrading general multilingual utility.

Title: A Multi-Step Comparative Framework for Anomaly Detection in IoT Data Streams

Authors: Mohammed Al-Qudah, Fadi AlMahamid
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16872
Pdf URL: https://arxiv.org/pdf/2505.16872
Copy Paste: [[2505.16872]] A Multi-Step Comparative Framework for Anomaly Detection in IoT Data Streams(https://arxiv.org/abs/2505.16872)
Keywords: security
Abstract: The rapid expansion of Internet of Things (IoT) devices has introduced critical security challenges, underscoring the need for accurate anomaly detection. Although numerous studies have proposed machine learning (ML) methods for this purpose, limited research systematically examines how different preprocessing steps--normalization, transformation, and feature selection--interact with distinct model architectures. To address this gap, this paper presents a multi-step evaluation framework assessing the combined impact of preprocessing choices on three ML algorithms: RNN-LSTM, autoencoder neural networks (ANN), and Gradient Boosting (GBoosting). Experiments on the IoTID20 dataset shows that GBoosting consistently delivers superior accuracy across preprocessing configurations, while RNN-LSTM shows notable gains with z-score normalization and autoencoders excel in recall, making them well-suited for unsupervised scenarios. By offering a structured analysis of preprocessing decisions and their interplay with various ML techniques, the proposed framework provides actionable guidance to enhance anomaly detection performance in IoT environments.

Title: T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

Authors: Zhehao Huang, Yuhang Liu, Yixin Lou, Zhengbao He, Mingzhen He, Wenxing Zhou, Tao Li, Kehan Li, Zeyi Huang, Xiaolin Huang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16875
Pdf URL: https://arxiv.org/pdf/2505.16875
Copy Paste: [[2505.16875]] T2I-ConBench: Text-to-Image Benchmark for Continual Post-training(https://arxiv.org/abs/2505.16875)
Keywords: diffusion
Abstract: Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.

Title: CASTILLO: Characterizing Response Length Distributions of Large Language Models

Authors: Daniel F. Perez-Ramirez, Dejan Kostic, Magnus Boman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16881
Pdf URL: https://arxiv.org/pdf/2505.16881
Copy Paste: [[2505.16881]] CASTILLO: Characterizing Response Length Distributions of Large Language Models(https://arxiv.org/abs/2505.16881)
Keywords: generative, large language model
Abstract: Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each $\langle$prompt, model$\rangle$ sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.

Title: CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework

Authors: Viet Pham, Thai Le
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16888
Pdf URL: https://arxiv.org/pdf/2505.16888
Copy Paste: [[2505.16888]] CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework(https://arxiv.org/abs/2505.16888)
Keywords: security, attack, robust, large language model
Abstract: Large language models (LLMs) have advanced many applications, but are also known to be vulnerable to adversarial attacks. In this work, we introduce a novel security threat: hijacking AI-human conversations by manipulating LLMs' system prompts to produce malicious answers only to specific targeted questions (e.g., "Who should I vote for US President?", "Are Covid vaccines safe?"), while behaving benignly on others. This attack is detrimental as it can enable malicious actors to exercise large-scale information manipulation by spreading harmful but benign-looking system prompts online. To demonstrate such an attack, we develop CAIN, an algorithm that can automatically curate such harmful system prompts for a specific target question in a black-box setting or without the need to access the LLM's parameters. Evaluated on both open-source and commercial LLMs, CAIN demonstrates significant adversarial impact. In untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves up to 40% F1 degradation on targeted questions while preserving high accuracy on benign inputs. For targeted attacks or forcing LLMs to output specific harmful answers, CAIN achieves over 70% F1 scores on these targeted responses with minimal impact on benign questions. Our results highlight the critical need for enhanced robustness measures to safeguard the integrity and safety of LLMs in real-world applications. All source code will be publicly available.

Title: Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs

Authors: Zeyu Wei, Shuo Wang, Xiaohui Rong, Xuemin Liu, He Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16894
Pdf URL: https://arxiv.org/pdf/2505.16894
Copy Paste: [[2505.16894]] Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs(https://arxiv.org/abs/2505.16894)
Keywords: diffusion, large language model
Abstract: Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.

Title: Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality

Authors: Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, MingKai Zheng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16900
Pdf URL: https://arxiv.org/pdf/2505.16900
Copy Paste: [[2505.16900]] Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality(https://arxiv.org/abs/2505.16900)
Keywords: large language model
Abstract: During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.

Title: Unsupervised Prompting for Graph Neural Networks

Authors: Peyman Baghershahi, Sourav Medya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16903
Pdf URL: https://arxiv.org/pdf/2505.16903
Copy Paste: [[2505.16903]] Unsupervised Prompting for Graph Neural Networks(https://arxiv.org/abs/2505.16903)
Keywords: large language model
Abstract: Prompt tuning methods for Graph Neural Networks (GNNs) have become popular to address the semantic gap between pre-training and fine-tuning steps. However, existing GNN prompting methods rely on labeled data and involve lightweight fine-tuning for downstream tasks. Meanwhile, in-context learning methods for Large Language Models (LLMs) have shown promising performance with no parameter updating and no or minimal labeled data. Inspired by these approaches, in this work, we first introduce a challenging problem setup to evaluate GNN prompting methods. This setup encourages a prompting function to enhance a pre-trained GNN's generalization to a target dataset under covariate shift without updating the GNN's parameters and with no labeled data. Next, we propose a fully unsupervised prompting method based on consistency regularization through pseudo-labeling. We use two regularization techniques to align the prompted graphs' distribution with the original data and reduce biased predictions. Through extensive experiments under our problem setting, we demonstrate that our unsupervised approach outperforms the state-of-the-art prompting methods that have access to labels.

Title: Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Authors: Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, Mang Ye
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16916
Pdf URL: https://arxiv.org/pdf/2505.16916
Copy Paste: [[2505.16916]] Backdoor Cleaning without External Guidance in MLLM Fine-tuning(https://arxiv.org/abs/2505.16916)
Keywords: security, defense, attack, robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

Title: Scalable and Interpretable Contextual Bandits: A Literature Review and Retail Offer Prototype

Authors: Nikola Tankovic, Robert Sajina
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16918
Pdf URL: https://arxiv.org/pdf/2505.16918
Copy Paste: [[2505.16918]] Scalable and Interpretable Contextual Bandits: A Literature Review and Retail Offer Prototype(https://arxiv.org/abs/2505.16918)
Keywords: interpretability, large language model
Abstract: This paper presents a concise review of Contextual Multi-Armed Bandit (CMAB) methods and introduces an experimental framework for scalable, interpretable offer selection, addressing the challenge of fast-changing offers. The approach models context at the product category level, allowing offers to span multiple categories and enabling knowledge transfer across similar offers. This improves learning efficiency and generalization in dynamic environments. The framework extends standard CMAB methodology to support multi-category contexts, and achieves scalability through efficient feature engineering and modular design. Advanced features such as MPG (Member Purchase Gap) and MF (Matrix Factorization) capture nuanced user-offer interactions, with implementation in Python for practical deployment. A key contribution is interpretability at scale: logistic regression models yield transparent weight vectors, accessible via a large language model (LLM) interface for real-time, user-level tracking and explanation of evolving preferences. This enables the generation of detailed member profiles and identification of behavioral patterns, supporting personalized offer optimization and enhancing trust in automated decisions. By situating our prototype alongside established paradigms like Generalized Linear Models and Thompson Sampling, we demonstrate its value for both research and real-world CMAB applications.

Title: UNCLE: Uncertainty Expressions in Long-Form Generation

Authors: Ruihan Yang, Caiqi Zhang, Zhisong Zhang, Xinting Huang, Dong Yu, Nigel Collier, Deqing Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16922
Pdf URL: https://arxiv.org/pdf/2505.16922
Copy Paste: [[2505.16922]] UNCLE: Uncertainty Expressions in Long-Form Generation(https://arxiv.org/abs/2505.16922)
Keywords: fair, large language model
Abstract: Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE spans five domains and comprises 4k long-form QA instances and over 20k short-form QA pairs. Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers. Along with the benchmark, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. Using UNCLE, we then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models' performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.

Title: LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Authors: Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16933
Pdf URL: https://arxiv.org/pdf/2505.16933
Copy Paste: [[2505.16933]] LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning(https://arxiv.org/abs/2505.16933)
Keywords: diffusion, large language model
Abstract: In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: this https URL.

Title: In-Context Watermarks for Large Language Models

Authors: Yepeng Liu, Xuandong Zhao, Christopher Kruegel, Dawn Song, Yuheng Bu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16934
Pdf URL: https://arxiv.org/pdf/2505.16934
Copy Paste: [[2505.16934]] In-Context Watermarks for Large Language Models(https://arxiv.org/abs/2505.16934)
Keywords: watermark, large language model
Abstract: The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.

Title: SPAR: Self-supervised Placement-Aware Representation Learning for Multi-Node IoT Systems

Authors: Yizhuo Chen, Tianchen Wang, You Lyu, Yanlan Hu, Jinyang Li, Tomoyoshi Kimura, Hongjue Zhao, Yigong Hu, Denizhan Kara, Tarek Abdelzaher
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16936
Pdf URL: https://arxiv.org/pdf/2505.16936
Copy Paste: [[2505.16936]] SPAR: Self-supervised Placement-Aware Representation Learning for Multi-Node IoT Systems(https://arxiv.org/abs/2505.16936)
Keywords: robust
Abstract: This work develops the underpinnings of self-supervised placement-aware representation learning given spatially-distributed (multi-view and multimodal) sensor observations, motivated by the need to represent external environmental state in multi-sensor IoT systems in a manner that correctly distills spatial phenomena from the distributed multi-vantage observations. The objective of sensing in IoT systems is, in general, to collectively represent an externally observed environment given multiple vantage points from which sensory observations occur. Pretraining of models that help interpret sensor data must therefore encode the relation between signals observed by sensors and the observers' vantage points in order to attain a representation that encodes the observed spatial phenomena in a manner informed by the specific placement of the measuring instruments, while allowing arbitrary placement. The work significantly advances self-supervised model pretraining from IoT signals beyond current solutions that often overlook the distinctive spatial nature of IoT data. Our framework explicitly learns the dependencies between measurements and geometric observer layouts and structural characteristics, guided by a core design principle: the duality between signals and observer positions. We further provide theoretical analyses from the perspectives of information theory and occlusion-invariant representation learning to offer insight into the rationale behind our design. Experiments on three real-world datasets--covering vehicle monitoring, human activity recognition, and earthquake localization--demonstrate the superior generalizability and robustness of our method across diverse modalities, sensor placements, application-level inference tasks, and spatial scales.

Title: FoMoH: A clinically meaningful foundation model evaluation for structured electronic health records

Authors: Chao Pang, Vincent Jeanselme, Young Sang Choi, Xinzhuo Jiang, Zilin Jing, Aparajita Kashyap, Yuta Kobayashi, Yanwei Li, Florent Pollet, Karthik Natarajan, Shalmali Joshi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16941
Pdf URL: https://arxiv.org/pdf/2505.16941
Copy Paste: [[2505.16941]] FoMoH: A clinically meaningful foundation model evaluation for structured electronic health records(https://arxiv.org/abs/2505.16941)
Keywords: robust
Abstract: Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models' potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.

Title: MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Authors: Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16947
Pdf URL: https://arxiv.org/pdf/2505.16947
Copy Paste: [[2505.16947]] MixAT: Combining Continuous and Discrete Adversarial Training for LLMs(https://arxiv.org/abs/2505.16947)
Keywords: defense, attack, robust, large language model
Abstract: Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at this https URL.

Title: Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning

Authors: Adnan Oomerjee, Zafeirios Fountas, Zhongwei Yu, Haitham Bou-Ammar, Jun Wang
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2505.16950
Pdf URL: https://arxiv.org/pdf/2505.16950
Copy Paste: [[2505.16950]] Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning(https://arxiv.org/abs/2505.16950)
Keywords: transformer, large language model
Abstract: Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.

Title: A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization

Authors: Shengyu Feng, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16952
Pdf URL: https://arxiv.org/pdf/2505.16952
Copy Paste: [[2505.16952]] A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization(https://arxiv.org/abs/2505.16952)
Keywords: robust, large language model
Abstract: Machine learning (ML) has demonstrated considerable potential in supporting model design and optimization for combinatorial optimization (CO) problems. However, much of the progress to date has been evaluated on small-scale, synthetic datasets, raising concerns about the practical effectiveness of ML-based solvers in real-world, large-scale CO scenarios. Additionally, many existing CO benchmarks lack sufficient training data, limiting their utility for evaluating data-driven approaches. To address these limitations, we introduce FrontierCO, a comprehensive benchmark that covers eight canonical CO problem types and evaluates 16 representative ML-based solvers--including graph neural networks and large language model (LLM) agents. FrontierCO features challenging instances drawn from industrial applications and frontier CO research, offering both realistic problem difficulty and abundant training data. Our empirical results provide critical insights into the strengths and limitations of current ML methods, helping to guide more robust and practically relevant advances at the intersection of machine learning and combinatorial optimization. Our data is available at this https URL.

Title: Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models

Authors: Junjie Xiong, Changjia Zhu, Shuhang Lin, Chong Zhang, Yongfeng Zhang, Yao Liu, Lingyao Li
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16957
Pdf URL: https://arxiv.org/pdf/2505.16957
Copy Paste: [[2505.16957]] Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models(https://arxiv.org/abs/2505.16957)
Keywords: security, attack, large language model
Abstract: Large Language Models (LLMs) are increasingly equipped with capabilities of real-time web search and integrated with protocols like Model Context Protocol (MCP). This extension could introduce new security vulnerabilities. We present a systematic investigation of LLM vulnerabilities to hidden adversarial prompts through malicious font injection in external resources like webpages, where attackers manipulate code-to-glyph mapping to inject deceptive content which are invisible to users. We evaluate two critical attack scenarios: (1) "malicious content relay" and (2) "sensitive data leakage" through MCP-enabled tools. Our experiments reveal that indirect prompts with injected malicious font can bypass LLM safety mechanisms through external resources, achieving varying success rates based on data sensitivity and prompt design. Our research underscores the urgent need for enhanced security measures in LLM deployments when processing external content.

Title: Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models

Authors: Alessandro Favero, Antonio Sclocchi, Matthieu Wyart
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16959
Pdf URL: https://arxiv.org/pdf/2505.16959
Copy Paste: [[2505.16959]] Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models(https://arxiv.org/abs/2505.16959)
Keywords: privacy, diffusion, generative
Abstract: Diffusion probabilistic models have become a cornerstone of modern generative AI, yet the mechanisms underlying their generalization remain poorly understood. In fact, if these models were perfectly minimizing their training loss, they would just generate data belonging to their training set, i.e., memorize, as empirically found in the overparameterized regime. We revisit this view by showing that, in highly overparameterized diffusion models, generalization in natural data domains is progressively achieved during training before the onset of memorization. Our results, ranging from image to language diffusion models, systematically support the empirical law that memorization time is proportional to the dataset size. Generalization vs. memorization is then best understood as a competition between time scales. We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules, where generalization corresponds to the hierarchical acquisition of deeper grammar rules as training time grows, and the generalization cost of early stopping can be characterized. We summarize these results in a phase diagram. Overall, our results support that a principled early-stopping criterion - scaling with dataset size - can effectively optimize generalization while avoiding memorization, with direct implications for hyperparameter transfer and privacy-sensitive applications.

Title: BP-Seg: A graphical model approach to unsupervised and non-contiguous text segmentation using belief propagation

Authors: Fengyi Li, Kayhan Behdin, Natesh Pillai, Xiaofeng Wang, Zhipeng Wang, Ercan Yildiz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16965
Pdf URL: https://arxiv.org/pdf/2505.16965
Copy Paste: [[2505.16965]] BP-Seg: A graphical model approach to unsupervised and non-contiguous text segmentation using belief propagation(https://arxiv.org/abs/2505.16965)
Keywords: segmentation
Abstract: Text segmentation based on the semantic meaning of sentences is a fundamental task with broad utility in many downstream applications. In this paper, we propose a graphical model-based unsupervised learning approach, named BP-Seg for efficient text segmentation. Our method not only considers local coherence, capturing the intuition that adjacent sentences are often more related, but also effectively groups sentences that are distant in the text yet semantically similar. This is achieved through belief propagation on the carefully constructed graphical models. Experimental results on both an illustrative example and a dataset with long-form documents demonstrate that our method performs favorably compared to competing approaches.

Title: UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation

Authors: Himangi Mittal, Peiye Zhuang, Hsin-Ying Lee, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16971
Pdf URL: https://arxiv.org/pdf/2505.16971
Copy Paste: [[2505.16971]] UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation(https://arxiv.org/abs/2505.16971)
Keywords: robust
Abstract: We propose UniPhy, a common latent-conditioned neural constitutive model that can encode the physical properties of diverse materials. At inference UniPhy allows `inverse simulation' i.e. inferring material properties by optimizing the scene-specific latent to match the available observations via differentiable simulation. In contrast to existing methods that treat such inference as system identification, UniPhy does not rely on user-specified material type information. Compared to prior neural constitutive modeling approaches which learn instance specific networks, the shared training across materials improves both, robustness and accuracy of the estimates. We train UniPhy using simulated trajectories across diverse geometries and materials -- elastic, plasticine, sand, and fluids (Newtonian & non-Newtonian). At inference, given an object with unknown material properties, UniPhy can infer the material properties via latent optimization to match the motion observations, and can then allow re-simulating the object under diverse scenarios. We compare UniPhy against prior inverse simulation methods, and show that the inference from UniPhy enables more accurate replay and re-simulation under novel conditions.

Title: OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

Authors: Zongyan Han, Jiale Cao, Shuo Chen, Tong Wang, Jorma Laaksonen, Rao Muhammad Anwer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16974
Pdf URL: https://arxiv.org/pdf/2505.16974
Copy Paste: [[2505.16974]] OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning(https://arxiv.org/abs/2505.16974)
Keywords: interpretability, segmentation
Abstract: Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at this https URL.

Title: Creatively Upscaling Images with Global-Regional Priors

Authors: Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16976
Pdf URL: https://arxiv.org/pdf/2505.16976
Copy Paste: [[2505.16976]] Creatively Upscaling Images with Global-Regional Priors(https://arxiv.org/abs/2505.16976)
Keywords: diffusion
Abstract: Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.

Title: Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

Authors: Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16977
Pdf URL: https://arxiv.org/pdf/2505.16977
Copy Paste: [[2505.16977]] Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On(https://arxiv.org/abs/2505.16977)
Keywords: diffusion
Abstract: Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: this https URL.

Title: Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

Authors: Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16980
Pdf URL: https://arxiv.org/pdf/2505.16980
Copy Paste: [[2505.16980]] Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction(https://arxiv.org/abs/2505.16980)
Keywords: diffusion
Abstract: Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.

Title: LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

Authors: Junlong Tong, Jinlan Fu, Zixuan Lin, Yingqi Fan, Anhao Zhao, Hui Su, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16983
Pdf URL: https://arxiv.org/pdf/2505.16983
Copy Paste: [[2505.16983]] LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding(https://arxiv.org/abs/2505.16983)
Keywords: large language model
Abstract: Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository this https URL.

Title: UFT: Unifying Supervised and Reinforcement Fine-Tuning

Authors: Mingyang Liu, Gabriele Farina, Asuman Ozdaglar
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16984
Pdf URL: https://arxiv.org/pdf/2505.16984
Copy Paste: [[2505.16984]] UFT: Unifying Supervised and Reinforcement Fine-Tuning(https://arxiv.org/abs/2505.16984)
Keywords: large language model
Abstract: Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

Title: Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation

Authors: Moru Liu, Hao Dong, Jessica Kelly, Olga Fink, Mario Trapp
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.16985
Pdf URL: https://arxiv.org/pdf/2505.16985
Copy Paste: [[2505.16985]] Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation(https://arxiv.org/abs/2505.16985)
Keywords: segmentation
Abstract: Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at this https URL.

Title: T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Authors: Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, Genta Indra Winata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16986
Pdf URL: https://arxiv.org/pdf/2505.16986
Copy Paste: [[2505.16986]] T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning(https://arxiv.org/abs/2505.16986)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.

Title: MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

Authors: Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.16988
Pdf URL: https://arxiv.org/pdf/2505.16988
Copy Paste: [[2505.16988]] MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems(https://arxiv.org/abs/2505.16988)
Keywords: fair
Abstract: LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.

Title: Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Authors: Runpeng Yu, Xinyin Ma, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16990
Pdf URL: https://arxiv.org/pdf/2505.16990
Copy Paste: [[2505.16990]] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding(https://arxiv.org/abs/2505.16990)
Keywords: diffusion, large language model
Abstract: In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.

Title: Native Segmentation Vision Transformers

Authors: Guillem Brasó, Aljoša Ošep, Laura Leal-Taixé
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16993
Pdf URL: https://arxiv.org/pdf/2505.16993
Copy Paste: [[2505.16993]] Native Segmentation Vision Transformers(https://arxiv.org/abs/2505.16993)
Keywords: extraction, transformer, segmentation
Abstract: Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is this https URL.

Title: DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization

Authors: Chao Zhang, Xin Shi, Xueqiao Zhang, Yifan Zhu, Yi Yang, Yawei Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16995
Pdf URL: https://arxiv.org/pdf/2505.16995
Copy Paste: [[2505.16995]] DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization(https://arxiv.org/abs/2505.16995)
Keywords: large language model
Abstract: Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.

Title: Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

Authors: Jin Jiang, Jianing Wang, Yuchen Yan, Yang Liu, Jianhua Zhu, Mengdi Zhang, Xunliang Cai, Liangcai Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16998
Pdf URL: https://arxiv.org/pdf/2505.16998
Copy Paste: [[2505.16998]] Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?(https://arxiv.org/abs/2505.16998)
Keywords: large language model
Abstract: Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at this https URL.

Title: Guided Diffusion Sampling on Function Spaces with Applications to PDEs

Authors: Jiachen Yao, Abbas Mammadov, Julius Berner, Gavin Kerrigan, Jong Chul Ye, Kamyar Azizzadenesheli, Anima Anandkumar
Subjects: cs.LG, cs.AI, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2505.17004
Pdf URL: https://arxiv.org/pdf/2505.17004
Copy Paste: [[2505.17004]] Guided Diffusion Sampling on Function Spaces with Applications to PDEs(https://arxiv.org/abs/2505.17004)
Keywords: diffusion
Abstract: We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at this https URL

Title: R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning

Authors: Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.17005
Pdf URL: https://arxiv.org/pdf/2505.17005
Copy Paste: [[2505.17005]] R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning(https://arxiv.org/abs/2505.17005)
Keywords: large language model
Abstract: Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.

Title: CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Authors: Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.17006
Pdf URL: https://arxiv.org/pdf/2505.17006
Copy Paste: [[2505.17006]] CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning(https://arxiv.org/abs/2505.17006)
Keywords: robust, diffusion
Abstract: Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.

Title: Deep mineralogical segmentation of thin section images based on QEMSCAN maps

Authors: Jean Pablo Vieira de Mello, Matheus Augusto Alves Cuglieri, Leandro P. de Figueiredo, Fernando Bordignon, Marcelo Ramalho Albuquerque, Rodrigo Surmas, Bruno Cavalcanti de Paula
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.17008
Pdf URL: https://arxiv.org/pdf/2505.17008
Copy Paste: [[2505.17008]] Deep mineralogical segmentation of thin section images based on QEMSCAN maps(https://arxiv.org/abs/2505.17008)
Keywords: segmentation
Abstract: Interpreting the mineralogical aspects of rock thin sections is an important task for oil and gas reservoirs evaluation. However, human analysis tend to be subjective and laborious. Technologies like QEMSCAN(R) are designed to automate the mineralogical mapping process, but also suffer from limitations like high monetary costs and time-consuming analysis. This work proposes a Convolutional Neural Network model for automatic mineralogical segmentation of thin section images of carbonate rocks. The model is able to mimic the QEMSCAN mapping itself in a low-cost, generalized and efficient manner. For this, the U-Net semantic segmentation architecture is trained on plane and cross polarized thin section images using the corresponding QEMSCAN maps as target, which is an approach not widely explored. The model was instructed to differentiate occurrences of Calcite, Dolomite, Mg-Clay Minerals, Quartz, Pores and the remaining mineral phases as an unique class named "Others", while it was validated on rock facies both seen and unseen during training, in order to address its generalization capability. Since the images and maps are provided in different resolutions, image registration was applied to align then spatially. The study reveals that the quality of the segmentation is very much dependent on these resolution differences and on the variety of learnable rock textures. However, it shows promising results, especially with regard to the proper delineation of minerals boundaries on solid textures and precise estimation of the minerals distributions, describing a nearly linear relationship between expected and predicted distributions, with coefficient of determination (R^2) superior to 0.97 for seen facies and 0.88 for unseen.

Title: Understanding Prompt Tuning and In-Context Learning via Meta-Learning

Authors: Tim Genewein, Kevin Wenliang Li, Jordi Grau-Moya, Anian Ruoss, Laurent Orseau, Marcus Hutter
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.17010
Pdf URL: https://arxiv.org/pdf/2505.17010
Copy Paste: [[2505.17010]] Understanding Prompt Tuning and In-Context Learning via Meta-Learning(https://arxiv.org/abs/2505.17010)
Keywords: transformer
Abstract: Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.

Title: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

Authors: Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.17011
Pdf URL: https://arxiv.org/pdf/2505.17011
Copy Paste: [[2505.17011]] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space(https://arxiv.org/abs/2505.17011)
Keywords: generative
Abstract: We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.

Title: SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

Authors: Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17012
Pdf URL: https://arxiv.org/pdf/2505.17012
Copy Paste: [[2505.17012]] SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding(https://arxiv.org/abs/2505.17012)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.

Title: When Are Concepts Erased From Diffusion Models?

Authors: Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.17013
Pdf URL: https://arxiv.org/pdf/2505.17013
Copy Paste: [[2505.17013]] When Are Concepts Erased From Diffusion Models?(https://arxiv.org/abs/2505.17013)
Keywords: attack, robust, diffusion
Abstract: Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.

Title: Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Authors: Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.17015
Pdf URL: https://arxiv.org/pdf/2505.17015
Copy Paste: [[2505.17015]] Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models(https://arxiv.org/abs/2505.17015)
Keywords: robust, large language model
Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

Title: Interactive Post-Training for Vision-Language-Action Models

Authors: Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Krähenbühl
Subjects: cs.LG, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.17016
Pdf URL: https://arxiv.org/pdf/2505.17016
Copy Paste: [[2505.17016]] Interactive Post-Training for Vision-Language-Action Models(https://arxiv.org/abs/2505.17016)
Keywords: robust
Abstract: We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.

Title: Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Authors: Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17017
Pdf URL: https://arxiv.org/pdf/2505.17017
Copy Paste: [[2505.17017]] Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO(https://arxiv.org/abs/2505.17017)
Keywords: robust, large language model
Abstract: Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at this https URL

Title: SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Authors: Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.17018
Pdf URL: https://arxiv.org/pdf/2505.17018
Copy Paste: [[2505.17018]] SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward(https://arxiv.org/abs/2505.17018)
Keywords: large language model
Abstract: Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final this http URL a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at this https URL.

Title: Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework

Authors: Chenhao Zhang, Yazhe Niu
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.17019
Pdf URL: https://arxiv.org/pdf/2505.17019
Copy Paste: [[2505.17019]] Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework(https://arxiv.org/abs/2505.17019)
Keywords: large language model
Abstract: Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.

Title: CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Authors: Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.17020
Pdf URL: https://arxiv.org/pdf/2505.17020
Copy Paste: [[2505.17020]] CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms(https://arxiv.org/abs/2505.17020)
Keywords: large language model
Abstract: The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.