2025-06-05

Title: Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL

Authors: Zhaoyang Chen, Cody Fleming
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03154
Pdf URL: https://arxiv.org/pdf/2506.03154
Copy Paste: [[2506.03154]] Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL(https://arxiv.org/abs/2506.03154)
Keywords: diffusion
Abstract: Classifier free guidance has shown strong potential in diffusion-based reinforcement learning. However, existing methods rely on joint training of the guidance module and the diffusion model, which can be suboptimal during the early stages when the guidance is inaccurate and provides noisy learning signals. In offline RL, guidance depends solely on offline data: observations, actions, and rewards, and is independent of the policy module's behavior, suggesting that joint training is not required. This paper proposes modular training methods that decouple the guidance module from the diffusion model, based on three key findings: Guidance Necessity: We explore how the effectiveness of guidance varies with the training stage and algorithm choice, uncovering the roles of guidance and diffusion. A lack of good guidance in the early stage presents an opportunity for optimization. Guidance-First Diffusion Training: We introduce a method where the guidance module is first trained independently as a value estimator, then frozen to guide the diffusion model using classifier-free reward guidance. This modularization reduces memory usage, improves computational efficiency, and enhances both sample efficiency and final performance. Cross-Module Transferability: Applying two independently trained guidance models, one during training and the other during inference, can significantly reduce normalized score variance (e.g., reducing IQR by 86%). We show that guidance modules trained with one algorithm (e.g., IDQL) can be directly reused with another (e.g., DQL), with no additional training required, demonstrating baseline-level performance as well as strong modularity and transferability. We provide theoretical justification and empirical validation on bullet D4RL benchmarks. Our findings suggest a new paradigm for offline RL: modular, reusable, and composable training pipelines.

Title: Applying MambaAttention, TabPFN, and TabTransformers to Classify SAE Automation Levels in Crashes

Authors: Shriyank Somvanshi, Anannya Ghosh Tusti, Mahmuda Sultana Mimi, Md Monzurul Islam, Sazzad Bin Bashar Polock, Anandi Dutta, Subasish Das
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03160
Pdf URL: https://arxiv.org/pdf/2506.03160
Copy Paste: [[2506.03160]] Applying MambaAttention, TabPFN, and TabTransformers to Classify SAE Automation Levels in Crashes(https://arxiv.org/abs/2506.03160)
Keywords: robust, transformer
Abstract: The increasing presence of automated vehicles (AVs) presents new challenges for crash classification and safety analysis. Accurately identifying the SAE automation level involved in each crash is essential to understanding crash dynamics and system accountability. However, existing approaches often overlook automation-specific factors and lack model sophistication to capture distinctions between different SAE levels. To address this gap, this study evaluates the performance of three advanced tabular deep learning models MambaAttention, TabPFN, and TabTransformer for classifying SAE automation levels using structured crash data from Texas (2024), covering 4,649 cases categorized as Assisted Driving (SAE Level 1), Partial Automation (SAE Level 2), and Advanced Automation (SAE Levels 3-5 combined). Following class balancing using SMOTEENN, the models were trained and evaluated on a unified dataset of 7,300 records. MambaAttention demonstrated the highest overall performance (F1-scores: 88% for SAE 1, 97% for SAE 2, and 99% for SAE 3-5), while TabPFN excelled in zero-shot inference with high robustness for rare crash categories. In contrast, TabTransformer underperformed, particularly in detecting Partial Automation crashes (F1-score: 55%), suggesting challenges in modeling shared human-system control dynamics. These results highlight the capability of deep learning models tailored for tabular data to enhance the accuracy and efficiency of automation-level classification. Integrating such models into crash analysis frameworks can support policy development, AV safety evaluation, and regulatory decisions, especially in distinguishing high-risk conditions for mid- and high-level automation technologies.

Title: Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

Authors: Damith Chamalke Senadeera, Xiaoyun Yang, Dimitrios Kollias, Gregory Slabaugh
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03162
Pdf URL: https://arxiv.org/pdf/2506.03162
Copy Paste: [[2506.03162]] Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection(https://arxiv.org/abs/2506.03162)
Keywords: transformer
Abstract: The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics, with continuous fusion via a gating mechanism. We also present a new benchmark by merging RWF-2000, RLVS, and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Our model achieves state-of-the-art performance on this benchmark offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, real-time surveillance violence detection.

Title: Causal Discovery in Dynamic Fading Wireless Networks

Authors: Oluwaseyi Giwa
Subjects: cs.LG, eess.SP, stat.ME
Abstract URL: https://arxiv.org/abs/2506.03163
Pdf URL: https://arxiv.org/pdf/2506.03163
Copy Paste: [[2506.03163]] Causal Discovery in Dynamic Fading Wireless Networks(https://arxiv.org/abs/2506.03163)
Keywords: robust
Abstract: Dynamic causal discovery in wireless networks is essential due to evolving interference, fading, and mobility, which complicate traditional static causal models. This paper addresses causal inference challenges in dynamic fading wireless environments by proposing a sequential regression-based algorithm with a novel application of the NOTEARS acyclicity constraint, enabling efficient online updates. We derive theoretical lower and upper bounds on the detection delay required to identify structural changes, explicitly quantifying their dependence on network size, noise variance, and fading severity. Monte Carlo simulations validate these theoretical results, demonstrating linear increases in detection delay with network size, quadratic growth with noise variance, and inverse-square dependence on the magnitude of structural changes. Our findings provide rigorous theoretical insights and practical guidelines for designing robust online causal inference mechanisms to maintain network reliability under nonstationary wireless conditions.

Title: Test-Time Scaling of Diffusion Models via Noise Trajectory Search

Authors: Vignav Ramesh, Morteza Mardani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03164
Pdf URL: https://arxiv.org/pdf/2506.03164
Copy Paste: [[2506.03164]] Test-Time Scaling of Diffusion Models via Noise Trajectory Search(https://arxiv.org/abs/2506.03164)
Keywords: diffusion
Abstract: The iterative and stochastic nature of diffusion models enables test-time scaling, whereby spending additional compute during denoising generates higher-fidelity samples. Increasing the number of denoising steps is the primary scaling axis, but this yields quickly diminishing returns. Instead optimizing the noise trajectory--the sequence of injected noise vectors--is promising, as the specific noise realizations critically affect sample quality; but this is challenging due to a high-dimensional search space, complex noise-outcome interactions, and costly trajectory evaluations. We address this by first casting diffusion as a Markov Decision Process (MDP) with a terminal reward, showing tree-search methods such as Monte Carlo tree search (MCTS) to be meaningful but impractical. To balance performance and efficiency, we then resort to a relaxation of MDP, where we view denoising as a sequence of independent contextual bandits. This allows us to introduce an $\epsilon$-greedy search algorithm that globally explores at extreme timesteps and locally exploits during the intermediate steps where de-mixing occurs. Experiments on EDM and Stable Diffusion reveal state-of-the-art scores for class-conditioned/text-to-image generation, exceeding baselines by up to $164\%$ and matching/exceeding MCTS performance. To our knowledge, this is the first practical method for test-time noise trajectory optimization of arbitrary (non-differentiable) rewards.

Title: Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs

Authors: Dawen Jiang, Zhishu Shen, Qiushi Zheng, Tiehua Zhang, Wei Xiang, Jiong Jin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03168
Pdf URL: https://arxiv.org/pdf/2506.03168
Copy Paste: [[2506.03168]] Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs(https://arxiv.org/abs/2506.03168)
Keywords: large language model
Abstract: Amid the challenges posed by global population growth and climate change, traditional agricultural Internet of Things (IoT) systems is currently undergoing a significant digital transformation to facilitate efficient big data processing. While smart agriculture utilizes artificial intelligence (AI) technologies to enable precise control, it still encounters significant challenges, including excessive reliance on agricultural expert knowledge, difficulties in fusing multimodal data, poor adaptability to dynamic environments, and bottlenecks in real-time decision-making at the edge. Large language models (LLMs), with their exceptional capabilities in knowledge acquisition and semantic understanding, provide a promising solution to address these challenges. To this end, we propose Farm-LightSeek, an edge-centric multimodal agricultural IoT data analytics framework that integrates LLMs with edge computing. This framework collects real-time farmland multi-source data (images, weather, geographic information) via sensors, performs cross-modal reasoning and disease detection at edge nodes, conducts low-latency management decisions, and enables cloud collaboration for model updates. The main innovations of Farm-LightSeek include: (1) an agricultural "perception-decision-action" closed-loop architecture; (2) cross-modal adaptive monitoring; and (3)a lightweight LLM deployment strategy balancing performance and efficiency. Experiments conducted on two real-world datasets demonstrate that Farm-LightSeek consistently achieves reliable performance in mission-critical tasks, even under the limitations of edge computing resources. This work advances intelligent real-time agricultural solutions and highlights the potential for deeper integration of agricultural IoT with LLMs.

Title: Improvement of human health lifespan with hybrid group pose estimation methods

Authors: Arindam Chaudhuri
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03169
Pdf URL: https://arxiv.org/pdf/2506.03169
Copy Paste: [[2506.03169]] Improvement of human health lifespan with hybrid group pose estimation methods(https://arxiv.org/abs/2506.03169)
Keywords: robust
Abstract: Human beings rely heavily on estimation of poses in order to access their body movements. Human pose estimation methods take advantage of computer vision advances in order to track human body movements in real life applications. This comes from videos which are recorded through available devices. These para-digms provide potential to make human movement measurement more accessible to users. The consumers of pose estimation movements believe that human poses content tend to supplement available videos. This has increased pose estimation software usage to estimate human poses. In order to address this problem, we develop hybrid-ensemble-based group pose estimation method to improve human health. This proposed hybrid-ensemble-based group pose estimation method aims to detect multi-person poses using modified group pose estimation and modified real time pose estimation. This ensemble allows fusion of performance of stated methods in real time. The input poses from images are fed into individual meth-ods. The pose transformation method helps to identify relevant features for en-semble to perform training effectively. After this, customized pre-trained hybrid ensemble is trained on public benchmarked datasets which is being evaluated through test datasets. The effectiveness and viability of proposed method is estab-lished based on comparative analysis of group pose estimation methods and ex-periments conducted on benchmarked datasets. It provides best optimized results in real-time pose estimation. It makes pose estimation method more robust to oc-clusion and improves dense regression accuracy. These results have affirmed po-tential application of this method in several real-time situations with improvement in human health life span

Title: PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models

Authors: Murthy L, Subarna Tripathi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03170
Pdf URL: https://arxiv.org/pdf/2506.03170
Copy Paste: [[2506.03170]] PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models(https://arxiv.org/abs/2506.03170)
Keywords: robust, diffusion, generative
Abstract: The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work that aim for addressing neural fingerprinting. A trade-off between the attribution accuracy and generation quality of such models has been studied extensively. None of the existing methods yet achieved $100\%$ attribution accuracy. However, any model with less than \emph{perfect} accuracy is practically non-deployable. In this work, we propose an accurate method to incorporate neural fingerprinting for text-to-image diffusion models leveraging the concepts of cyclic error correcting codes from the literature of coding theory.

Title: EdgeVidSum: Real-Time Personalized Video Summarization at the Edge

Authors: Ghulam Mujtaba, Eun-Seok Ryu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03171
Pdf URL: https://arxiv.org/pdf/2506.03171
Copy Paste: [[2506.03171]] EdgeVidSum: Real-Time Personalized Video Summarization at the Edge(https://arxiv.org/abs/2506.03171)
Keywords: privacy
Abstract: EdgeVidSum is a lightweight method that generates personalized, fast-forward summaries of long-form videos directly on edge devices. The proposed approach enables real-time video summarization while safeguarding user privacy through local data processing using innovative thumbnail-based techniques and efficient neural architectures. Unlike conventional methods that process entire videos frame by frame, the proposed method uses thumbnail containers to significantly reduce computational complexity without sacrificing semantic relevance. The framework employs a hierarchical analysis approach, where a lightweight 2D CNN model identifies user-preferred content from thumbnails and generates timestamps to create fast-forward summaries. Our interactive demo highlights the system's ability to create tailored video summaries for long-form videos, such as movies, sports events, and TV shows, based on individual user preferences. The entire computation occurs seamlessly on resource-constrained devices like Jetson Nano, demonstrating how EdgeVidSum addresses the critical challenges of computational efficiency, personalization, and privacy in modern video consumption environments.

Title: FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution

Authors: Xiaoyi Liu, Hao Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03173
Pdf URL: https://arxiv.org/pdf/2506.03173
Copy Paste: [[2506.03173]] FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution(https://arxiv.org/abs/2506.03173)
Keywords: robust
Abstract: Physical intelligence -- anticipating and shaping the world from partial, multisensory observations -- is critical for next-generation world models. We propose FOLIAGE, a physics-informed multimodal world model for unbounded accretive surface growth. In its Action-Perception loop, a unified context encoder maps images, mesh connectivity, and point clouds to a shared latent state. A physics-aware predictor, conditioned on physical control actions, advances this latent state in time to align with the target latent of the surface, yielding a Modality-Agnostic Growth Embedding (MAGE) that interfaces with critic heads for downstream objectives. FOLIAGE's Accretive Graph Network (AGN) captures dynamic connectivity through Age Positional Encoding and Energy-Gated Message-Passing. Geometry-Correspondence Fusion and Cross-Patch Masking enhance MAGE's expressiveness, while Hierarchical Pooling balances global context with local dynamics. We create SURF-GARDEN, a world model learning platform comprising a Counterfactual Physics Simulator, a Multimodal Correspondence Extractor, and Evolution Tracing, which generates 7,200 diverse surface-growth sequences. SURF-BENCH, our physical-intelligence evaluation suite, evaluates six core tasks -- topology recognition, inverse material estimation, growth-stage classification, latent roll-out, cross-modal retrieval, and dense correspondence -- and four stress tests -- sensor dropout, zero-shot modality transfer, long-horizon prediction, and physics ablation -- to probe resilience. FOLIAGE outperforms specialized baselines while remaining robust across dynamic environments, establishing a new world-model based, multimodal pathway to physical intelligence.

Title: Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks

Authors: Koki Matsuishi, Kosuke Ukita, Tsuyoshi Okita
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03174
Pdf URL: https://arxiv.org/pdf/2506.03174
Copy Paste: [[2506.03174]] Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks(https://arxiv.org/abs/2506.03174)
Keywords: transformer
Abstract: In recent years, the widespread adoption of wearable devices has highlighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundational model integrating four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person perspectives alone fail to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model's overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961.

Title: Vid-SME: Membership Inference Attacks against Large Video Understanding Models

Authors: Qi Li, Runpeng Yu, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03179
Pdf URL: https://arxiv.org/pdf/2506.03179
Copy Paste: [[2506.03179]] Vid-SME: Membership Inference Attacks against Large Video Understanding Models(https://arxiv.org/abs/2506.03179)
Keywords: privacy, attack, robust, membership infer, large language model
Abstract: Multimodal large language models (MLLMs) demonstrate remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies. To address these challenges, we introduce Vid-SME, the first membership inference method tailored for video data used in video understanding LLMs (VULLMs). Vid-SME leverages the confidence of model output and integrates adaptive parameterization to compute Sharma-Mittal entropy (SME) for video inputs. By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model's training set. Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.

Title: Continual Learning in Vision-Language Models via Aligned Model Merging

Authors: Ghada Sokar, Gintare Karolina Dziugaite, Anurag Arnab, Ahmet Iscen, Pablo Samuel Castro, Cordelia Schmid
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03189
Pdf URL: https://arxiv.org/pdf/2506.03189
Copy Paste: [[2506.03189]] Continual Learning in Vision-Language Models via Aligned Model Merging(https://arxiv.org/abs/2506.03189)
Keywords: robust
Abstract: Continual learning is conventionally tackled through sequential fine-tuning, a process that, while enabling adaptation, inherently favors plasticity over the stability needed to retain prior knowledge. While existing approaches attempt to mitigate catastrophic forgetting, a bias towards recent tasks persists as they build upon this sequential nature. In this work we present a new perspective based on model merging to maintain stability while still retaining plasticity. Rather than just sequentially updating the model weights, we propose merging newly trained task parameters with previously learned ones, promoting a better balance. To maximize the effectiveness of the merging process, we propose a simple mechanism that promotes learning aligned weights with previous ones, thereby avoiding interference when merging. We evaluate this approach on large Vision-Language Models (VLMs), and demonstrate its effectiveness in reducing forgetting, increasing robustness to various task orders and similarities, and improving generalization.

Title: Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward

Authors: Muhammad Islam, Tao Huang, Euijoon Ahn, Usman Naseem
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03191
Pdf URL: https://arxiv.org/pdf/2506.03191
Copy Paste: [[2506.03191]] Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward(https://arxiv.org/abs/2506.03191)
Keywords: diffusion, transformer, generative, large language model
Abstract: This paper presents an in-depth survey on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation, offering insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis. Focusing exclusively on text and motion modalities, this research investigates how textual descriptions can guide the generation of complex, human-like motion sequences. The paper explores various generative approaches, including autoregressive models, diffusion models, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and transformer-based models, by analyzing their strengths and limitations in terms of motion quality, computational efficiency, and adaptability. It highlights recent advances in text-conditioned motion generation, where textual inputs are used to control and refine motion outputs with greater precision. The integration of LLMs further enhances these models by enabling semantic alignment between instructions and motion, improving coherence and contextual relevance. This systematic survey underscores the transformative potential of text-to-motion GenAI and LLM architectures in applications such as healthcare, humanoids, gaming, animation, and assistive technologies, while addressing ongoing challenges in generating efficient and realistic human motion.

Title: HueManity: Probing Fine-Grained Visual Perception in MLLMs

Authors: Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03194
Pdf URL: https://arxiv.org/pdf/2506.03194
Copy Paste: [[2506.03194]] HueManity: Probing Fine-Grained Visual Perception in MLLMs(https://arxiv.org/abs/2506.03194)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.

Title: Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

Authors: Yunqi Hong, Sohyun An, Andrew Bai, Neil Y.C. Lin, Cho-Jui Hsieh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03195
Pdf URL: https://arxiv.org/pdf/2506.03195
Copy Paste: [[2506.03195]] Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs(https://arxiv.org/abs/2506.03195)
Keywords: large language model
Abstract: Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: this https URL

Title: Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Authors: Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Yanjie Liang, Zuming Huang, Haozhe Wang, Jun Huang, Ling Chen, Wei Chu, Yuan Qi
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03197
Pdf URL: https://arxiv.org/pdf/2506.03197
Copy Paste: [[2506.03197]] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing(https://arxiv.org/abs/2506.03197)
Keywords: robust, extraction
Abstract: Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.

Title: Fingerprinting Deep Learning Models via Network Traffic Patterns in Federated Learning

Authors: Md Nahid Hasan Shuvo, Moinul Hossain
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.03207
Pdf URL: https://arxiv.org/pdf/2506.03207
Copy Paste: [[2506.03207]] Fingerprinting Deep Learning Models via Network Traffic Patterns in Federated Learning(https://arxiv.org/abs/2506.03207)
Keywords: security, privacy, attack, federate
Abstract: Federated Learning (FL) is increasingly adopted as a decentralized machine learning paradigm due to its capability to preserve data privacy by training models without centralizing user data. However, FL is susceptible to indirect privacy breaches via network traffic analysis-an area not explored in existing research. The primary objective of this research is to study the feasibility of fingerprinting deep learning models deployed within FL environments by analyzing their network-layer traffic information. In this paper, we conduct an experimental evaluation using various deep learning architectures (i.e., CNN, RNN) within a federated learning testbed. We utilize machine learning algorithms, including Support Vector Machines (SVM), Random Forest, and Gradient-Boosting, to fingerprint unique patterns within the traffic data. Our experiments show high fingerprinting accuracy, achieving 100% accuracy using Random Forest and around 95.7% accuracy using SVM and Gradient Boosting classifiers. This analysis suggests that we can identify specific architectures running within the subsection of the network traffic. Hence, if an adversary knows about the underlying DL architecture, they can exploit that information and conduct targeted attacks. These findings suggest a notable security vulnerability in FL systems and the necessity of strengthening it at the network level.

Title: FuXi-Ocean: A Global Ocean Forecasting System with Sub-Daily Resolution

Authors: Qiusheng Huang, Yuan Niu, Xiaohui Zhong, Anboyu Guo, Lei Chen, Dianjun Zhang, Xuefeng Zhang, Hao Li
Subjects: cs.LG, cs.AI, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2506.03210
Pdf URL: https://arxiv.org/pdf/2506.03210
Copy Paste: [[2506.03210]] FuXi-Ocean: A Global Ocean Forecasting System with Sub-Daily Resolution(https://arxiv.org/abs/2506.03210)
Keywords: extraction
Abstract: Accurate, high-resolution ocean forecasting is crucial for maritime operations and environmental monitoring. While traditional numerical models are capable of producing sub-daily, eddy-resolving forecasts, they are computationally intensive and face challenges in maintaining accuracy at fine spatial and temporal scales. In contrast, recent data-driven approaches offer improved computational efficiency and emerging potential, yet typically operate at daily resolution and struggle with sub-daily predictions due to error accumulation over time. We introduce FuXi-Ocean, the first data-driven global ocean forecasting model achieving six-hourly predictions at eddy-resolving 1/12° spatial resolution, reaching depths of up to 1500 meters. The model architecture integrates a context-aware feature extraction module with a predictive network employing stacked attention blocks. The core innovation is the Mixture-of-Time (MoT) module, which adaptively integrates predictions from multiple temporal contexts by learning variable-specific reliability , mitigating cumulative errors in sequential forecasting. Through comprehensive experimental evaluation, FuXi-Ocean demonstrates superior skill in predicting key variables, including temperature, salinity, and currents, across multiple depths.

Title: Channel-adaptive Cross-modal Generative Semantic Communication for Point Cloud Transmission

Authors: Wanting Yang, Zehui Xiong, Qianqian Yang, Ping Zhang, Merouane Debbah, Rahim Tafazolli
Subjects: cs.CV, cs.NI
Abstract URL: https://arxiv.org/abs/2506.03211
Pdf URL: https://arxiv.org/pdf/2506.03211
Copy Paste: [[2506.03211]] Channel-adaptive Cross-modal Generative Semantic Communication for Point Cloud Transmission(https://arxiv.org/abs/2506.03211)
Keywords: robust, extraction, diffusion, generative
Abstract: With the rapid development of autonomous driving and extended reality, efficient transmission of point clouds (PCs) has become increasingly important. In this context, we propose a novel channel-adaptive cross-modal generative semantic communication (SemCom) for PC transmission, called GenSeC-PC. GenSeC-PC employs a semantic encoder that fuses images and point clouds, where images serve as non-transmitted side information. Meanwhile, the decoder is built upon the backbone of PointDif. Such a cross-modal design not only ensures high compression efficiency but also delivers superior reconstruction performance compared to PointDif. Moreover, to ensure robust transmission and reduce system complexity, we design a streamlined and asymmetric channel-adaptive joint semantic-channel coding architecture, where only the encoder needs the feedback of average signal-to-noise ratio (SNR) and available bandwidth. In addition, rectified denoising diffusion implicit models is employed to accelerate the decoding process to the millisecond level, enabling real-time PC communication. Unlike existing methods, GenSeC-PC leverages generative priors to ensure reliable reconstruction even from noisy or incomplete source PCs. More importantly, it supports fully analog transmission, improving compression efficiency by eliminating the need for error-free side information transmission common in prior SemCom approaches. Simulation results confirm the effectiveness of cross-modal semantic extraction and dual-metric guided fine-tuning, highlighting the framework's robustness across diverse conditions, including low SNR, bandwidth limitations, varying numbers of 2D images, and previously unseen objects.

Title: ConMamba: Contrastive Vision Mamba for Plant Disease Detection

Authors: Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Mohammad Awrangjeb
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03213
Pdf URL: https://arxiv.org/pdf/2506.03213
Copy Paste: [[2506.03213]] ConMamba: Contrastive Vision Mamba for Plant Disease Detection(https://arxiv.org/abs/2506.03213)
Keywords: robust, transformer
Abstract: Plant Disease Detection (PDD) is a key aspect of precision agriculture. However, existing deep learning methods often rely on extensively annotated datasets, which are time-consuming and costly to generate. Self-supervised Learning (SSL) offers a promising alternative by exploiting the abundance of unlabeled data. However, most existing SSL approaches suffer from high computational costs due to convolutional neural networks or transformer-based architectures. Additionally, they struggle to capture long-range dependencies in visual representation and rely on static loss functions that fail to align local and global features effectively. To address these challenges, we propose ConMamba, a novel SSL framework specially designed for PDD. ConMamba integrates the Vision Mamba Encoder (VME), which employs a bidirectional State Space Model (SSM) to capture long-range dependencies efficiently. Furthermore, we introduce a dual-level contrastive loss with dynamic weight adjustment to optimize local-global feature alignment. Experimental results on three benchmark datasets demonstrate that ConMamba significantly outperforms state-of-the-art methods across multiple evaluation metrics. This provides an efficient and robust solution for PDD.

Title: OpenCarbon: A Contrastive Learning-based Cross-Modality Neural Approach for High-Resolution Carbon Emission Prediction Using Open Data

Authors: Jinwei Zeng, Yu Liu, Guozhen Zhang, Jingtao Ding, Yuming Lin, Jian Yuan, Yong Li
Subjects: cs.CV, cs.AI, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2506.03224
Pdf URL: https://arxiv.org/pdf/2506.03224
Copy Paste: [[2506.03224]] OpenCarbon: A Contrastive Learning-based Cross-Modality Neural Approach for High-Resolution Carbon Emission Prediction Using Open Data(https://arxiv.org/abs/2506.03224)
Keywords: extraction
Abstract: Accurately estimating high-resolution carbon emissions is crucial for effective emission governance and mitigation planning. While conventional methods for precise carbon accounting are hindered by substantial data collection efforts, the rise of open data and advanced learning techniques offers a promising solution. Once an open data-based prediction model is developed and trained, it can easily infer emissions for new areas based on available open data. To address this, we incorporate two modalities of open data, satellite images and point-of-interest (POI) data, to predict high-resolution urban carbon emissions, with satellite images providing macroscopic and static and POI data offering fine-grained and relatively dynamic functionality information. However, estimating high-resolution carbon emissions presents two significant challenges: the intertwined and implicit effects of various functionalities on carbon emissions, and the complex spatial contiguity correlations that give rise to the agglomeration effect. Our model, OpenCarbon, features two major designs that target the challenges: a cross-modality information extraction and fusion module to extract complementary functionality information from two modules and model their interactions, and a neighborhood-informed aggregation module to capture the spatial contiguity correlations. Extensive experiments demonstrate our model's superiority, with a significant performance gain of 26.6\% on R2. Further generalizability tests and case studies also show OpenCarbon's capacity to capture the intrinsic relation between urban functionalities and carbon emissions, validating its potential to empower efficient carbon governance and targeted carbon mitigation planning. Codes and data are available: this https URL.

Title: DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

Authors: Selcuk Gurses, Aozhong Zhang, Yanxia Deng, Xun Dong, Xin Li, Naigang Wang, Penghang Yin, Zi Yang
Subjects: cs.LG, cs.AI, cs.CL, math.OC
Abstract URL: https://arxiv.org/abs/2506.03230
Pdf URL: https://arxiv.org/pdf/2506.03230
Copy Paste: [[2506.03230]] DiaBlo: Diagonal Blocks Are Sufficient For Finetuning(https://arxiv.org/abs/2506.03230)
Keywords: robust, large language model
Abstract: Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at this https URL.

Title: BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

Authors: Kaiwen Duan, Hongwei Yao, Yufei Chen, Ziyun Li, Tong Qiao, Zhan Qin, Cong Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03234
Pdf URL: https://arxiv.org/pdf/2506.03234
Copy Paste: [[2506.03234]] BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF(https://arxiv.org/abs/2506.03234)
Keywords: defense, attack, robust, steal
Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning text-to-image (T2I) models with human preferences. However, RLHF's feedback mechanism also opens new pathways for adversaries. This paper demonstrates the feasibility of hijacking T2I models by poisoning a small fraction of preference data with natural-appearing examples. Specifically, we propose BadReward, a stealthy clean-label poisoning attack targeting the reward model in multi-modal RLHF. BadReward operates by inducing feature collisions between visually contradicted preference data instances, thereby corrupting the reward model and indirectly compromising the T2I model's integrity. Unlike existing alignment poisoning techniques focused on single (text) modality, BadReward is independent of the preference annotation process, enhancing its stealth and practical threat. Extensive experiments on popular T2I models show that BadReward can consistently guide the generation towards improper outputs, such as biased or violent imagery, for targeted concepts. Our findings underscore the amplified threat landscape for RLHF in multi-modal systems, highlighting the urgent need for robust defenses. Disclaimer. This paper contains uncensored toxic content that might be offensive or disturbing to the readers.

Title: Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems

Authors: Michael E. Garcia-Alcoser, Mobina GhojoghNejad, Fakrul Islam Tushar, David Kim, Kyle J. Lafata, Geoffrey D. Rubin, Joseph Y. Lo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03259
Pdf URL: https://arxiv.org/pdf/2506.03259
Copy Paste: [[2506.03259]] Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems(https://arxiv.org/abs/2506.03259)
Keywords: large language model
Abstract: Purpose: This study aims to evaluate the effectiveness of large language models (LLMs) in automating disease annotation of CT radiology reports. We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis (CAP) CT reports. Materials and Methods: This retrospective study analyzed 40,833 CT reports from 29,540 patients, with 1,789 CAP reports manually annotated across three organ systems. External validation was conducted using the CT-RATE dataset. Three open-weight LLMs were tested with zero-shot prompting. Performance was evaluated using Cohen's Kappa and micro/macro-averaged F1 scores. Results: In 12,197 Duke CAP reports from 8,854 patients, Llama-3.1 8B and Gemma-3 27B showed the highest agreement ($\kappa$ median: 0.87). On the manually annotated set, Gemma-3 27B achieved the top macro-F1 (0.82), followed by Llama-3.1 8B (0.79), while the RBA scored lowest (0.64). On the CT-RATE dataset (lungs/pleura only), Llama-3.1 8B performed best (0.91), with Gemma-3 27B close behind (0.89). Performance differences were mainly due to differing labeling practices, especially for lung atelectasis. Conclusion: Lightweight LLMs outperform rule-based methods for CT report annotation and generalize across organ systems with zero-shot prompting. However, binary labels alone cannot capture the full nuance of report language. LLMs can provide a flexible, efficient solution aligned with clinical judgment and user needs.

Title: Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

Authors: Austin Silveria, Soham V. Govande, Daniel Y. Fu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03275
Pdf URL: https://arxiv.org/pdf/2506.03275
Copy Paste: [[2506.03275]] Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas(https://arxiv.org/abs/2506.03275)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in high-quality image and video generation but incur substantial compute cost at inference. A common observation is that DiT latent noise vectors change slowly across inference steps, which suggests that the DiT compute may be redundant across steps. In this paper, we aim to speed up inference by reducing this redundancy, without additional training. We first study how activations change between steps in two state-of-the-art open-source DiTs. We find that just 5-25% of the values in attention and MLP explain 70-90% of the change in activations across steps. This finding motivates our approach, Chipmunk, which uses dynamic sparsity at inference time to recompute only the fastest-changing intermediate activations, while caching the rest. Dynamic sparsity introduces two systems challenges: (1) sparse attention and MLP operations tend to underutilize GPU tensor cores; and (2) computing dynamic sparsity patterns at runtime and caching activations both introduce overhead. To address these challenges, Chipmunk first uses a voxel-based reordering of input tokens to introduce column-wise sparsity. We implement column-sparse kernels utilizing efficient sparse gathers from global to shared GPU memory, achieving a 9.3x speedup at 93% sparsity compared to highly-optimized dense baselines. Second, Chipmunk overlaps the computation of sparsity patterns and cache updates with other parts of the computation (e.g., second layer of the MLP) to hide the extra latency. Chipmunk achieves up to 2.16x speedup on HunyuanVideo and 1.41x on FLUX.1-dev without compromising generation quality. Furthermore, we show that Chipmunk can be stacked on top of full step caching, achieving a 3.72x speedup on HunyuanVideo, a 2.67x speedup on WAN2.1, and a 2.25x speedup on FLUX.1-dev with minimal quality impact.

Title: FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

Authors: Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, Jayant Kalagnanam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03278
Pdf URL: https://arxiv.org/pdf/2506.03278
Copy Paste: [[2506.03278]] FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes(https://arxiv.org/abs/2506.03278)
Keywords: large language model
Abstract: We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at this https URL.

Title: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

Authors: Yubo Wang, Ping Nie, Kai Zou, Lijun Wu, Wenhu Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03295
Pdf URL: https://arxiv.org/pdf/2506.03295
Copy Paste: [[2506.03295]] Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem(https://arxiv.org/abs/2506.03295)
Keywords: robust
Abstract: We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

Title: From Instructions to ODRL Usage Policies: An Ontology Guided Approach

Authors: Daham M. Mustafa, Abhishek Nadgeri, Diego Collarana, Benedikt T. Arnold, Christoph Quix, Christoph Lange, Stefan Decker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03301
Pdf URL: https://arxiv.org/pdf/2506.03301
Copy Paste: [[2506.03301]] From Instructions to ODRL Usage Policies: An Ontology Guided Approach(https://arxiv.org/abs/2506.03301)
Keywords: large language model
Abstract: This study presents an approach that uses large language models such as GPT-4 to generate usage policies in the W3C Open Digital Rights Language ODRL automatically from natural language instructions. Our approach uses the ODRL ontology and its documentation as a central part of the prompt. Our research hypothesis is that a curated version of existing ontology documentation will better guide policy generation. We present various heuristics for adapting the ODRL ontology and its documentation to guide an end-to-end KG construction process. We evaluate our approach in the context of dataspaces, i.e., distributed infrastructures for trustworthy data exchange between multiple participating organizations for the cultural domain. We created a benchmark consisting of 12 use cases of varying complexity. Our evaluation shows excellent results with up to 91.95% accuracy in the resulting knowledge graph.

Title: Multi-Exit Kolmogorov-Arnold Networks: enhancing accuracy and parsimony

Authors: James Bagrow, Josh Bongard
Subjects: cs.LG, cs.NE, physics.data-an, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03302
Pdf URL: https://arxiv.org/pdf/2506.03302
Copy Paste: [[2506.03302]] Multi-Exit Kolmogorov-Arnold Networks: enhancing accuracy and parsimony(https://arxiv.org/abs/2506.03302)
Keywords: interpretability
Abstract: Kolmogorov-Arnold Networks (KANs) uniquely combine high accuracy with interpretability, making them valuable for scientific modeling. However, it is unclear a priori how deep a network needs to be for any given task, and deeper KANs can be difficult to optimize. Here we introduce multi-exit KANs, where each layer includes its own prediction branch, enabling the network to make accurate predictions at multiple depths simultaneously. This architecture provides deep supervision that improves training while discovering the right level of model complexity for each task. Multi-exit KANs consistently outperform standard, single-exit versions on synthetic functions, dynamical systems, and real-world datasets. Remarkably, the best predictions often come from earlier, simpler exits, revealing that these networks naturally identify smaller, more parsimonious and interpretable models without sacrificing accuracy. To automate this discovery, we develop a differentiable "learning to exit" algorithm that balances contributions from exits during training. Our approach offers scientists a practical way to achieve both high performance and interpretability, addressing a fundamental challenge in machine learning for scientific discovery.

Title: Hermes: High-Performance Homomorphically Encrypted Vector Databases

Authors: Dongfang Zhao
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2506.03308
Pdf URL: https://arxiv.org/pdf/2506.03308
Copy Paste: [[2506.03308]] Hermes: High-Performance Homomorphically Encrypted Vector Databases(https://arxiv.org/abs/2506.03308)
Keywords: secure
Abstract: Fully Homomorphic Encryption (FHE) has long promised the ability to compute over encrypted data without revealing sensitive contents -- a foundational goal for secure cloud analytics. Yet despite decades of cryptographic advances, practical integration of FHE into real-world relational databases remains elusive. This paper presents \textbf{Hermes}, the first system to enable FHE-native vector query processing inside a standard SQL engine. By leveraging the multi-slot capabilities of modern schemes, Hermes introduces a novel data model that packs multiple records per ciphertext and embeds encrypted auxiliary statistics (e.g., local sums) to support in-place updates and aggregation. To reconcile ciphertext immutability with record-level mutability, we develop new homomorphic algorithms based on slot masking, shifting, and rewriting. Hermes is implemented as native C++ loadable functions in MySQL using OpenFHE v1.2.4, comprising over 3,500 lines of code. Experiments on real-world datasets show up to 1{,}600$\times$ throughput gain in encryption and over 30$\times$ speedup in insertion compared to per-tuple baselines. Hermes brings FHE from cryptographic promise to practical reality -- realizing a long-standing vision at the intersection of databases and secure computation.

Title: The Future of Continual Learning in the Era of Foundation Models: Three Key Directions

Authors: Jack Bell, Luigi Quarantiello, Eric Nuertey Coleman, Lanpei Li, Malio Li, Mauro Madeddu, Elia Piccoli, Vincenzo Lomonaco
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03320
Pdf URL: https://arxiv.org/pdf/2506.03320
Copy Paste: [[2506.03320]] The Future of Continual Learning in the Era of Foundation Models: Three Key Directions(https://arxiv.org/abs/2506.03320)
Keywords: robust, large language model
Abstract: Continual learning--the ability to acquire, retain, and refine knowledge over time--has always been fundamental to intelligence, both human and artificial. Historically, different AI paradigms have acknowledged this need, albeit with varying priorities: early expert and production systems focused on incremental knowledge consolidation, while reinforcement learning emphasised dynamic adaptation. With the rise of deep learning, deep continual learning has primarily focused on learning robust and reusable representations over time to solve sequences of increasingly complex tasks. However, the emergence of Large Language Models (LLMs) and foundation models has raised the question: Do we still need continual learning when centralised, monolithic models can tackle diverse tasks with access to internet-scale knowledge? We argue that continual learning remains essential for three key reasons: (i) continual pre-training is still necessary to ensure foundation models remain up to date, mitigating knowledge staleness and distribution shifts while integrating new information; (ii) continual fine-tuning enables models to specialise and personalise, adapting to domain-specific tasks, user preferences, and real-world constraints without full retraining, avoiding the need for computationally expensive long context-windows; (iii) continual compositionality offers a scalable and modular approach to intelligence, enabling the orchestration of foundation models and agents to be dynamically composed, recombined, and adapted. While continual pre-training and fine-tuning are explored as niche research directions, we argue it is continual compositionality that will mark the rebirth of continual learning. The future of AI will not be defined by a single static model but by an ecosystem of continually evolving and interacting models, making continual learning more relevant than ever.

Title: Mitigating Non-IID Drift in Zeroth-Order Federated LLM Fine-Tuning with Transferable Sparsity

Authors: Yide Ran, Wentao Guo, Jingwei Sun, Yanzhou Pan, Xiaodong Yu, Hao Wang, Jianwen Xie, Yiran Chen, Denghui Zhang, Zhaozhuo Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03337
Pdf URL: https://arxiv.org/pdf/2506.03337
Copy Paste: [[2506.03337]] Mitigating Non-IID Drift in Zeroth-Order Federated LLM Fine-Tuning with Transferable Sparsity(https://arxiv.org/abs/2506.03337)
Keywords: federate, large language model
Abstract: Federated Learning enables collaborative fine-tuning of Large Language Models (LLMs) across decentralized Non-Independent and Identically Distributed (Non-IID) clients, but such models' massive parameter sizes lead to significant memory and communication challenges. This work introduces Meerkat, a sparse zeroth-order optimization (ZO) method designed for federated LLM fine-tuning. By limiting fine-tuning to a transferable, static, extremely sparse subset of parameters, Meerkat achieves remarkable communication efficiency, enabling cost-effective high-frequency synchronization. With theoretical analysis and experiments, we show that this high-frequency communication effectively mitigates Non-IID data challenges and leads to superior performance compared to full-parameter ZO. Furthermore, experiment results show that Meerkat outperforms existing sparsity baselines with better performance at the same communication frequency. To further handle Non-IID drift, Meerkat leverages traceable local updates and forms a virtual path for each client. This virtual path mechanism reveals the GradIP phenomenon: the inner products between LLM pre-training gradients maintained by server and client gradients estimated via ZO converges for extreme Non-IID clients but oscillates for IID ones. This distinct behavior provides a signal for identifying clients with extreme data heterogeneity. Using this signal, Meerkat-vp is proposed to analyze GradIP trajectories to identify extreme Non-IID clients and applies early stopping to enhance aggregated model quality. Experiments confirm that Meerkat and Meerkat-vp significantly improve the efficiency and effectiveness of ZO federated LLM fine-tuning.

Title: Semiconductor SEM Image Defect Classification Using Supervised and Semi-Supervised Learning with Vision Transformers

Authors: Chien-Fu (Frank)Huang, Katherine Sieg, Leonid Karlinksy, Nash Flores, Rebekah Sheraw, Xin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03345
Pdf URL: https://arxiv.org/pdf/2506.03345
Copy Paste: [[2506.03345]] Semiconductor SEM Image Defect Classification Using Supervised and Semi-Supervised Learning with Vision Transformers(https://arxiv.org/abs/2506.03345)
Keywords: transformer
Abstract: Controlling defects in semiconductor processes is important for maintaining yield, improving production cost, and preventing time-dependent critical component failures. Electron beam-based imaging has been used as a tool to survey wafers in the line and inspect for defects. However, manual classification of images for these nano-scale defects is limited by time, labor constraints, and human biases. In recent years, deep learning computer vision algorithms have shown to be effective solutions for image-based inspection applications in industry. This work proposes application of vision transformer (ViT) neural networks for automatic defect classification (ADC) of scanning electron microscope (SEM) images of wafer defects. We evaluated our proposed methods on 300mm wafer semiconductor defect data from our fab in IBM Albany. We studied 11 defect types from over 7400 total images and investigated the potential of transfer learning of DinoV2 and semi-supervised learning for improved classification accuracy and efficient computation. We were able to achieve classification accuracies of over 90% with less than 15 images per defect class. Our work demonstrates the potential to apply the proposed framework for a platform agnostic in-house classification tool with faster turnaround time and flexibility.

Title: Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Authors: Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.03355
Pdf URL: https://arxiv.org/pdf/2506.03355
Copy Paste: [[2506.03355]] Robustness in Both Domains: CLIP Needs a Robust Text Encoder(https://arxiv.org/abs/2506.03355)
Keywords: attack, robust, diffusion, generative
Abstract: Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.

Title: Ask a Local: Detecting Hallucinations With Specialized Model Divergence

Authors: Aldan Creo, Héctor Cerezo-Costas, Pedro Alonso-Doval, Maximiliano Hormazábal-Lagos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03357
Pdf URL: https://arxiv.org/pdf/2506.03357
Copy Paste: [[2506.03357]] Ask a Local: Detecting Hallucinations With Specialized Model Divergence(https://arxiv.org/abs/2506.03357)
Keywords: large language model
Abstract: Hallucinations in large language models (LLMs) - instances where models generate plausible but factually incorrect information - present a significant challenge for AI. We introduce "Ask a Local", a novel hallucination detection method exploiting the intuition that specialized models exhibit greater surprise when encountering domain-specific inaccuracies. Our approach computes divergence between perplexity distributions of language-specialized models to identify potentially hallucinated spans. Our method is particularly well-suited for a multilingual context, as it naturally scales to multiple languages without the need for adaptation, relying on external data sources, or performing training. Moreover, we select computationally efficient models, providing a scalable solution that can be applied to a wide range of languages and domains. Our results on a human-annotated question-answer dataset spanning 14 languages demonstrate consistent performance across languages, with Intersection-over-Union (IoU) scores around 0.3 and comparable Spearman correlation values. Our model shows particularly strong performance on Italian and Catalan, with IoU scores of 0.42 and 0.38, respectively, while maintaining cross-lingual effectiveness without language-specific adaptations. We release our code and architecture to facilitate further research in multilingual hallucination detection.

Title: A Multimodal, Multilingual, and Multidimensional Pipeline for Fine-grained Crowdsourcing Earthquake Damage Evaluation

Authors: Zihui Ma, Lingyao Li, Juan Li, Wenyue Hua, Jingxiao Liu, Qingyuan Feng, Yuki Miura
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.03360
Pdf URL: https://arxiv.org/pdf/2506.03360
Copy Paste: [[2506.03360]] A Multimodal, Multilingual, and Multidimensional Pipeline for Fine-grained Crowdsourcing Earthquake Damage Evaluation(https://arxiv.org/abs/2506.03360)
Keywords: large language model
Abstract: Rapid, fine-grained disaster damage assessment is essential for effective emergency response, yet remains challenging due to limited ground sensors and delays in official reporting. Social media provides a rich, real-time source of human-centric observations, but its multimodal and unstructured nature presents challenges for traditional analytical methods. In this study, we propose a structured Multimodal, Multilingual, and Multidimensional (3M) pipeline that leverages multimodal large language models (MLLMs) to assess disaster impacts. We evaluate three foundation models across two major earthquake events using both macro- and micro-level analyses. Results show that MLLMs effectively integrate image-text signals and demonstrate a strong correlation with ground-truth seismic data. However, performance varies with language, epicentral distance, and input modality. This work highlights the potential of MLLMs for disaster assessment and provides a foundation for future research in applying MLLMs to real-time crisis contexts. The code and data are released at: this https URL

Title: Comparison of different Unique hard attention transformer models by the formal languages they can recognize

Authors: Leonid Ryvkin
Subjects: cs.LG, cs.CL, cs.FL
Abstract URL: https://arxiv.org/abs/2506.03370
Pdf URL: https://arxiv.org/pdf/2506.03370
Copy Paste: [[2506.03370]] Comparison of different Unique hard attention transformer models by the formal languages they can recognize(https://arxiv.org/abs/2506.03370)
Keywords: transformer
Abstract: This note is a survey of various results on the capabilities of unique hard attention transformers encoders (UHATs) to recognize formal languages. We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions. We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.

Title: Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views

Authors: Xiaonan Wang, Bo Shao, Hansaem Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03371
Pdf URL: https://arxiv.org/pdf/2506.03371
Copy Paste: [[2506.03371]] Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views(https://arxiv.org/abs/2506.03371)
Keywords: privacy
Abstract: Recent advances in vision-language models (VLMs) have enabled accurate image-based geolocation, raising serious concerns about location privacy risks in everyday social media posts. However, current benchmarks remain coarse-grained, linguistically biased, and lack multimodal and privacy-aware evaluations. To address these gaps, we present KoreaGEO Bench, the first fine-grained, multimodal geolocation benchmark for Korean street views. Our dataset comprises 1,080 high-resolution images sampled across four urban clusters and nine place types, enriched with multi-contextual annotations and two styles of Korean captions simulating real-world privacy exposure. We introduce a three-path evaluation protocol to assess ten mainstream VLMs under varying input modalities and analyze their accuracy, spatial bias, and reasoning behavior. Results reveal modality-driven shifts in localization precision and highlight structural prediction biases toward core cities.

Title: A Foundation Model for Spatial Proteomics

Authors: Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian, Cristina Almagro-Perez, Sophia J. Wagner, Sharifa Sahai, Ming Y. Lu, Richard J. Chen, Andrew Zhang, Mark Edward M. Gonzales, Ahmad Makky, Jia-Ying Joey Lee, Hao Cheng, Nourhan El Ahmar, Sayed Matar, Maximilian Haist, Darci Phillips, Yuqi Tan, Garry P. Nolan, W. Richard Burack, Jacob D. Estes, Jonathan T.C. Liu, Toni K Choueiri, Neeraj Agarwal, Marc Barry, Scott J. Rodig, Long Phi Le, Georg Gerber, Christian M. Schürch, Fabian J. Theis, Youn H Kim, Joe Yeong, Sabina Signoretti, Brooke E. Howitt, Lit-Hsin Loo, Qin Ma, Sizun Jiang, Faisal Mahmood
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03373
Pdf URL: https://arxiv.org/pdf/2506.03373
Copy Paste: [[2506.03373]] A Foundation Model for Spatial Proteomics(https://arxiv.org/abs/2506.03373)
Keywords: segmentation
Abstract: Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at this https URL.

Title: Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

Authors: Pengyu Chen, Xiao Huang, Teng Fei, Sicheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03388
Pdf URL: https://arxiv.org/pdf/2506.03388
Copy Paste: [[2506.03388]] Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery(https://arxiv.org/abs/2506.03388)
Keywords: segmentation
Abstract: Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony--Geophony--Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.

Title: Trajectory Prediction Meets Large Language Models: A Survey

Authors: Yi Xu, Ruining Yang, Yitian Zhang, Yizhou Wang, Jianglin Lu, Mingyuan Zhang, Lili Su, Yun Fu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.03408
Pdf URL: https://arxiv.org/pdf/2506.03408
Copy Paste: [[2506.03408]] Trajectory Prediction Meets Large Language Models: A Survey(https://arxiv.org/abs/2506.03408)
Keywords: interpretability, large language model
Abstract: Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.

Title: Technical Options for Flexible Hardware-Enabled Guarantees

Authors: James Petrie, Onni Aarne
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.03409
Pdf URL: https://arxiv.org/pdf/2506.03409
Copy Paste: [[2506.03409]] Technical Options for Flexible Hardware-Enabled Guarantees(https://arxiv.org/abs/2506.03409)
Keywords: secure, security, protect
Abstract: Frontier AI models pose increasing risks to public safety and international security, creating a pressing need for AI developers to provide credible guarantees about their development activities without compromising proprietary information. We propose Flexible Hardware-Enabled Guarantees (flexHEG), a system integrated with AI accelerator hardware to enable verifiable claims about compute usage in AI development. The flexHEG system consists of two primary components: an auditable Guarantee Processor that monitors accelerator usage and verifies compliance with specified rules, and a Secure Enclosure that provides physical tamper protection. In this report, we analyze technical implementation options ranging from firmware modifications to custom hardware approaches, with focus on an "Interlock" design that provides the Guarantee Processor direct access to accelerator data paths. Our proposed architecture could support various guarantee types, from basic usage auditing to sophisticated automated verification. This work establishes technical foundations for hardware-based AI governance mechanisms that could be deployed by 2027 to address emerging regulatory and international security needs in frontier AI development.

Title: DistRAG: Towards Distance-Based Spatial Reasoning in LLMs

Authors: Nicole R Schneider, Nandini Ramachandran, Kent O'Sullivan, Hanan Samet
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.03424
Pdf URL: https://arxiv.org/pdf/2506.03424
Copy Paste: [[2506.03424]] DistRAG: Towards Distance-Based Spatial Reasoning in LLMs(https://arxiv.org/abs/2506.03424)
Keywords: large language model
Abstract: Many real world tasks where Large Language Models (LLMs) can be used require spatial reasoning, like Point of Interest (POI) recommendation and itinerary planning. However, on their own LLMs lack reliable spatial reasoning capabilities, especially about distances. To address this problem, we develop a novel approach, DistRAG, that enables an LLM to retrieve relevant spatial information not explicitly learned during training. Our method encodes the geodesic distances between cities and towns in a graph and retrieves a context subgraph relevant to the question. Using this technique, our method enables an LLM to answer distance-based reasoning questions that it otherwise cannot answer. Given the vast array of possible places an LLM could be asked about, DistRAG offers a flexible first step towards providing a rudimentary `world model' to complement the linguistic knowledge held in LLMs.

Title: Adaptive Task Vectors for Large Language Models

Authors: Joonseong Kang, Soojeong Lee, Subeen Park, Sumin Park, Taero Kim, Jihee Kim, Ryunyi Lee, Kyungwoo Song
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.03426
Pdf URL: https://arxiv.org/pdf/2506.03426
Copy Paste: [[2506.03426]] Adaptive Task Vectors for Large Language Models(https://arxiv.org/abs/2506.03426)
Keywords: large language model
Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks without parameter updates by conditioning on a few demonstrations provided in the prompt. Despite its success, ICL suffers from several limitations, including sensitivity to demonstration order, context length constraints, and computational inefficiency. To address these challenges, task vector-based approaches compress task information into a single vector. However, these methods typically construct task vectors from fixed sets of demonstrations and reuse them across input queries, without conditioning on the specific input. This limitation can lead models to struggle with effective adaptation when the input query is not well aligned with the underlying demonstrations, consequently degrading their generalization performance on unseen tasks. To overcome this limitation, we propose Adaptive Task Vectors (ATV), a simple and effective framework that dynamically generates task vectors conditioned on each input query. ATV employs a small language model to generate task vectors, which are then transformed to match the target LLM's architecture and applied to guide its output generation. In contrast to ICL and previous vector-based approaches, which rely on fixed demonstration sets and their corresponding vectors, ATV dynamically generates task vectors tailored to each specific input query and task. Consequently, ATV demonstrates strong performance and generalization capabilities, even for unseen tasks. Furthermore, we provide a theoretical analysis indicating that ATV is expressively equivalent to LoRA under equal rank budgets and more expressive than Prefix-Tuning, thereby offering formal support for its representational advantage.

Title: ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Authors: Yifan Li, Xin Li, Tianqin Li, Wenbin He, Yu Kong, Liu Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03433
Pdf URL: https://arxiv.org/pdf/2506.03433
Copy Paste: [[2506.03433]] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads(https://arxiv.org/abs/2506.03433)
Keywords: segmentation
Abstract: Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ while achieving comparable or even better results on ADE20K, compared to other VFM adapters.

Title: Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models

Authors: Ahmad Dawar Hakimi, Ali Modarressi, Philipp Wicke, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03434
Pdf URL: https://arxiv.org/pdf/2506.03434
Copy Paste: [[2506.03434]] Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models(https://arxiv.org/abs/2506.03434)
Keywords: interpretability, large language model
Abstract: Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability and reliability. In this work, we analyze the evolution of factual knowledge representation in the OLMo-7B model by tracking the roles of its attention heads and feed forward networks (FFNs) over the course of pre-training. We classify these components into four roles: general, entity, relation-answer, and fact-answer specific, and examine their stability and transitions. Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses. Once the model reliably predicts answers, some components are repurposed, suggesting an adaptive learning process. Notably, attention heads display the highest turnover. We also present evidence that FFNs remain more stable throughout training. Furthermore, our probing experiments reveal that location-based relations converge to high accuracy earlier in training than name-based relations, highlighting how task complexity shapes acquisition dynamics. These insights offer a mechanistic view of knowledge formation in LLMs.

Title: Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos

Authors: Tanqiu Qiao, Ruochen Li, Frederick W. B. Li, Yoshiki Kubotani, Shigeo Morishima, Hubert P. H. Shum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03440
Pdf URL: https://arxiv.org/pdf/2506.03440
Copy Paste: [[2506.03440]] Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos(https://arxiv.org/abs/2506.03440)
Keywords: robust
Abstract: Human-Object Interaction (HOI) recognition in videos requires understanding both visual patterns and geometric relationships as they evolve over time. Visual and geometric features offer complementary strengths. Visual features capture appearance context, while geometric features provide structural patterns. Effectively fusing these multimodal features without compromising their unique characteristics remains challenging. We observe that establishing robust, entity-specific representations before modeling interactions helps preserve the strengths of each modality. Therefore, we hypothesize that a bottom-up approach is crucial for effective multimodal fusion. Following this insight, we propose the Geometric Visual Fusion Graph Neural Network (GeoVis-GNN), which uses dual-attention feature fusion combined with interdependent entity graph learning. It progressively builds from entity-specific representations toward high-level interaction understanding. To advance HOI recognition to real-world scenarios, we introduce the Concurrent Partial Interaction Dataset (MPHOI-120). It captures dynamic multi-person interactions involving concurrent actions and partial engagement. This dataset helps address challenges like complex human-object dynamics and mutual occlusions. Extensive experiments demonstrate the effectiveness of our method across various HOI scenarios. These scenarios include two-person interactions, single-person activities, bimanual manipulations, and complex concurrent partial interactions. Our method achieves state-of-the-art performance.

Title: RoNFA: Robust Neural Field-based Approach for Few-Shot Image Classification with Noisy Labels

Authors: Nan Xiang, Lifeng Xing, Dequan Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03461
Pdf URL: https://arxiv.org/pdf/2506.03461
Copy Paste: [[2506.03461]] RoNFA: Robust Neural Field-based Approach for Few-Shot Image Classification with Noisy Labels(https://arxiv.org/abs/2506.03461)
Keywords: robust
Abstract: In few-shot learning (FSL), the labeled samples are scarce. Thus, label errors can significantly reduce classification accuracy. Since label errors are inevitable in realistic learning tasks, improving the robustness of the model in the presence of label errors is critical. This paper proposes a new robust neural field-based image approach (RoNFA) for few-shot image classification with noisy labels. RoNFA consists of two neural fields for feature and category representation. They correspond to the feature space and category set. Each neuron in the field for category representation (FCR) has a receptive field (RF) on the field for feature representation (FFR) centered at the representative neuron for its category generated by soft clustering. In the prediction stage, the range of these receptive fields adapts according to the neuronal activation in FCR to ensure prediction accuracy. These learning strategies provide the proposed model with excellent few-shot learning capability and strong robustness against label noises. The experimental results on real-world FSL datasets with three different types of label noise demonstrate that the proposed method significantly outperforms state-of-the-art FSL methods. Its accuracy obtained in the presence of noisy labels even surpasses the results obtained by state-of-the-art FSL methods trained on clean support sets, indicating its strong robustness against noisy labels.

Title: Delta-KNN: Improving Demonstration Selection in In-Context Learning for Alzheimer's Disease Detection

Authors: Chuyuan Li, Raymond Li, Thalia S. Field, Giuseppe Carenini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03476
Pdf URL: https://arxiv.org/pdf/2506.03476
Copy Paste: [[2506.03476]] Delta-KNN: Improving Demonstration Selection in In-Context Learning for Alzheimer's Disease Detection(https://arxiv.org/abs/2506.03476)
Keywords: generative, large language model
Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that leads to dementia, and early intervention can greatly benefit from analyzing linguistic abnormalities. In this work, we explore the potential of Large Language Models (LLMs) as health assistants for AD diagnosis from patient-generated text using in-context learning (ICL), where tasks are defined through a few input-output examples. Empirical results reveal that conventional ICL methods, such as similarity-based selection, perform poorly for AD diagnosis, likely due to the inherent complexity of this task. To address this, we introduce Delta-KNN, a novel demonstration selection strategy that enhances ICL performance. Our method leverages a delta score to assess the relative gains of each training example, coupled with a KNN-based retriever that dynamically selects optimal "representatives" for a given input. Experiments on two AD detection datasets across three open-source LLMs demonstrate that Delta-KNN consistently outperforms existing ICL baselines. Notably, when using the Llama-3.1 model, our approach achieves new state-of-the-art results, surpassing even supervised classifiers.

Title: APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training

Authors: Jun Rao, Zepeng Lin, Xuebo Liu, Xiaopeng Ke, Lian Lian, Dong Jin, Shengjun Cheng, Jun Yu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03483
Pdf URL: https://arxiv.org/pdf/2506.03483
Copy Paste: [[2506.03483]] APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training(https://arxiv.org/abs/2506.03483)
Keywords: large language model
Abstract: Large Language Models (LLMs) often require domain-specific fine-tuning to address targeted tasks, which risks degrading their general capabilities. Maintaining a balance between domain-specific enhancements and general model utility is a key challenge. This paper proposes a novel approach named APT (Weakness Case Acquisition and Iterative Preference Training) to enhance domain-specific performance with self-generated dis-preferred weakness data (bad cases and similar cases). APT uniquely focuses on training the model using only those samples where errors occur, alongside a small, similar set of samples retrieved for this purpose. This targeted training minimizes interference with the model's existing knowledge base, effectively retaining generic capabilities. Experimental results on the LLama-2 and Mistral-V0.3 models across various benchmarks demonstrate that APT ensures no reduction in generic capacity and achieves superior performance on downstream tasks compared to various existing methods. This validates our method as an effective strategy for enhancing domain-specific capabilities without sacrificing the model's broader applicability.

Title: Explainable AI: XAI-Guided Context-Aware Data Augmentation

Authors: Melkamu Abay Mersha, Mesay Gemeda Yigezu, Atnafu Lambebo Tonja, Hassan Shakil, Samer Iskander, Olga Kolesnikova, Jugal Kalita
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03484
Pdf URL: https://arxiv.org/pdf/2506.03484
Copy Paste: [[2506.03484]] Explainable AI: XAI-Guided Context-Aware Data Augmentation(https://arxiv.org/abs/2506.03484)
Keywords: robust, interpretability, explainability
Abstract: Explainable AI (XAI) has emerged as a powerful tool for improving the performance of AI models, going beyond providing model transparency and interpretability. The scarcity of labeled data remains a fundamental challenge in developing robust and generalizable AI models, particularly for low-resource languages. Conventional data augmentation techniques introduce noise, cause semantic drift, disrupt contextual coherence, lack control, and lead to overfitting. To address these challenges, we propose XAI-Guided Context-Aware Data Augmentation. This novel framework leverages XAI techniques to modify less critical features while selectively preserving most task-relevant features. Our approach integrates an iterative feedback loop, which refines augmented data over multiple augmentation cycles based on explainability-driven insights and the model performance gain. Our experimental results demonstrate that XAI-SR-BT and XAI-PR-BT improve the accuracy of models on hate speech and sentiment analysis tasks by 6.6% and 8.1%, respectively, compared to the baseline, using the Amharic dataset with the XLM-R model. XAI-SR-BT and XAI-PR-BT outperform existing augmentation techniques by 4.8% and 5%, respectively, on the same dataset and model. Overall, XAI-SR-BT and XAI-PR-BT consistently outperform both baseline and conventional augmentation techniques across all tasks and models. This study provides a more controlled, interpretable, and context-aware solution to data augmentation, addressing critical limitations of existing augmentation techniques and offering a new paradigm shift for leveraging XAI techniques to enhance AI model training.

Title: EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding

Authors: Mingxu Tao, Jie Hu, Mingchuan Yang, Yunhuai Liu, Dongyan Zhao, Yansong Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03489
Pdf URL: https://arxiv.org/pdf/2506.03489
Copy Paste: [[2506.03489]] EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding(https://arxiv.org/abs/2506.03489)
Keywords: robust, large language model
Abstract: The remarkable performance of Large language models (LLMs) relies heavily on the availability of abundant high-quality training data. However, the high cost of acquiring annotated data often prevents models from obtaining capabilities to tackle downstream tasks. In this paper, we introduce a novel method, EpiCoDe that boosts model performance in data-scarcity scenarios without extra training. We first employ model extrapolation to enhance a finetuned model with its inferior version, and then adopt contrastive decoding to further reduce predicted errors, by comparing the logit scores given by the extrapolated and the vanilla finetuned model. Experiments across three tasks over four different LLMs show that EpiCoDe consistently outperforms existing methods with significant and robust improvement. We also propose a new theoretical framework to reveal the mechanism behind contrastive decoding in data-scarcity scenarios, which further helps us better understand the effectiveness of EpiCoDe.

Title: Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing

Authors: Shigeng Chen, Linhao Luo, Zhangchi Qiu, Yanan Cao, Carl Yang, Shirui Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03490
Pdf URL: https://arxiv.org/pdf/2506.03490
Copy Paste: [[2506.03490]] Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing(https://arxiv.org/abs/2506.03490)
Keywords: large language model
Abstract: Recently, knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining. Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored. Medical knowledge editing is particularly challenging, as it requires LLMs to internalize the knowledge and generalize to unseen scenarios for effective and interpretable decision-making. In this work, we propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain. In MedEditBench, we introduce a new medical knowledge editing benchmark as well as three different knowledge editing paradigms, which are designed to assess the impact of different knowledge sources for editing. Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios. To overcome this limitation, we present Self-Generated Rationale Editing (SGR-Edit), which utilizes model-derived rationales as the target knowledge for editing, thereby uncovering the underlying reasoning process and demonstrating significant improvements over existing KE approaches. Additionally, we offer deeper insights into medical knowledge editing, including the localization of medical knowledge in LLMs and the impact of sequential editing on evolving knowledge. This could provide practical guidance for implementing KE methods in real-world medical applications.

Title: Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing

Authors: Yuchen Guo, Zhicheng Dou, Huy H. Nguyen, Ching-Chun Chang, Saku Sugawara, Isao Echizen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03501
Pdf URL: https://arxiv.org/pdf/2506.03501
Copy Paste: [[2506.03501]] Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing(https://arxiv.org/abs/2506.03501)
Keywords: robust, generative, large language model
Abstract: Content creation has dramatically progressed with the rapid advancement of large language models like ChatGPT and Claude. While this progress has greatly enhanced various aspects of life and work, it has also negatively affected certain areas of society. A recent survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports. Most countermeasures treat the detection of AI-generated text as a binary classification task and thus lack robustness. This approach overlooks human involvement in the generation of content even though human-machine collaboration is becoming mainstream. Besides generating entire texts, people may use machines to complete or revise texts. Such human involvement varies case by case, which makes binary classification a less than satisfactory approach. We refer to this situation as participation detection obfuscation. We propose using BERTScore as a metric to measure human involvement in the generation process and a multi-task RoBERTa-based regressor trained on a token classification task to address this problem. To evaluate the effectiveness of this approach, we simulated academic-based scenarios and created a continuous dataset reflecting various levels of human involvement. All of the existing detectors we examined failed to detect the level of human involvement on this dataset. Our method, however, succeeded (F1 score of 0.9423 and a regressor mean squared error of 0.004). Moreover, it demonstrated some generalizability across generative models. Our code is available at this https URL

Title: CHIME: Conditional Hallucination and Integrated Multi-scale Enhancement for Time Series Diffusion Model

Authors: Yuxuan Chen, Haipeng Xie
Subjects: cs.CV, eess.SY
Abstract URL: https://arxiv.org/abs/2506.03502
Pdf URL: https://arxiv.org/pdf/2506.03502
Copy Paste: [[2506.03502]] CHIME: Conditional Hallucination and Integrated Multi-scale Enhancement for Time Series Diffusion Model(https://arxiv.org/abs/2506.03502)
Keywords: diffusion, generative
Abstract: The denoising diffusion probabilistic model has become a mainstream generative model, achieving significant success in various computer vision tasks. Recently, there has been initial exploration of applying diffusion models to time series tasks. However, existing studies still face challenges in multi-scale feature alignment and generative capabilities across different entities and long-time scales. In this paper, we propose CHIME, a conditional hallucination and integrated multi-scale enhancement framework for time series diffusion models. By employing multi-scale decomposition and adaptive integration, CHIME captures the decomposed features of time series, achieving in-domain distribution alignment between generated and original samples. In addition, we introduce a feature hallucination module in the conditional denoising process, enabling the transfer of temporal features through the training of category-independent transformation layers. Experimental results on publicly available real-world datasets demonstrate that CHIME achieves state-of-the-art performance and exhibits excellent generative generalization capabilities in few-shot scenarios.

Title: Accurate Sublayer Pruning for Large Language Models by Exploiting Latency and Tunability Information

Authors: Seungcheol Park, Sojin Lee, Jongjin Kim, Jinsik Lee, Hyunjik Jo, U Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03510
Pdf URL: https://arxiv.org/pdf/2506.03510
Copy Paste: [[2506.03510]] Accurate Sublayer Pruning for Large Language Models by Exploiting Latency and Tunability Information(https://arxiv.org/abs/2506.03510)
Keywords: large language model
Abstract: How can we accelerate large language models(LLMs) without sacrificing accuracy? The slow inference speed of LLMs hinders us to benefit from their remarkable performance in diverse applications. This is mainly because numerous sublayers are stacked together in LLMs. Sublayer pruning compresses and expedites LLMs via removing unnecessary sublayers. However, existing sublayer pruning algorithms are limited in accuracy since they naively select sublayers to prune, overlooking the different characteristics of each sublayer. In this paper, we propose SPRINT (Sublayer PRuning wIth LateNcy and Tunability Information), an accurate sublayer pruning method for LLMs. SPRINT accurately selects a target sublayer to prune by considering 1) the amount of latency reduction after pruning and 2) the tunability of sublayers. SPRINT iteratively prunes redundant sublayers and swiftly tunes the parameters of remaining sublayers. Experiments show that SPRINT achieves the best accuracy-speedup trade-off, exhibiting up to 23.88%p higher accuracy on zero-shot commonsense reasoning benchmarks compared to existing pruning algorithms.

Title: DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Authors: Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03517
Pdf URL: https://arxiv.org/pdf/2506.03517
Copy Paste: [[2506.03517]] DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models(https://arxiv.org/abs/2506.03517)
Keywords: diffusion
Abstract: Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

Title: Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation

Authors: Weinan He, Zilei Wang, Yixin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03521
Pdf URL: https://arxiv.org/pdf/2506.03521
Copy Paste: [[2506.03521]] Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation(https://arxiv.org/abs/2506.03521)
Keywords: robust
Abstract: Universal Domain Adaptation (UniDA) focuses on transferring source domain knowledge to the target domain under both domain shift and unknown category shift. Its main challenge lies in identifying common class samples and aligning them. Current methods typically obtain target domain semantics centers from an unconstrained continuous image representation space. Due to domain shift and the unknown number of clusters, these centers often result in complex and less robust alignment algorithm. In this paper, based on vision-language models, we search for semantic centers in a semantically meaningful and discrete text representation space. The constrained space ensures almost no domain bias and appropriate semantic granularity for these centers, enabling a simple and robust adaptation algorithm. Specifically, we propose TArget Semantics Clustering (TASC) via Text Representations, which leverages information maximization as a unified objective and involves two stages. First, with the frozen encoders, a greedy search-based framework is used to search for an optimal set of text embeddings to represent target semantics. Second, with the search results fixed, encoders are refined based on gradient descent, simultaneously achieving robust domain alignment and private class clustering. Additionally, we propose Universal Maximum Similarity (UniMS), a scoring function tailored for detecting open-set samples in UniDA. Experimentally, we evaluate the universality of UniDA algorithms under four category shift scenarios. Extensive experiments on four benchmarks demonstrate the effectiveness and robustness of our method, which has achieved state-of-the-art performance.

Title: Path Generation and Evaluation in Video Games: A Nonparametric Statistical Approach

Authors: Daniel Campa, Mehdi Saeedi, Ian Colbert, Srinjoy Das
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03522
Pdf URL: https://arxiv.org/pdf/2506.03522
Copy Paste: [[2506.03522]] Path Generation and Evaluation in Video Games: A Nonparametric Statistical Approach(https://arxiv.org/abs/2506.03522)
Keywords: interpretability, generative
Abstract: Navigation path traces play a crucial role in video game design, serving as a vital resource for both enhancing player engagement and fine-tuning non-playable character behavior. Generating such paths with human-like realism can enrich the overall gaming experience, and evaluating path traces can provide game designers insights into player interactions. Despite the impressive recent advancements in deep learning-based generative modeling, the video game industry hesitates to adopt such models for path generation, often citing their complex training requirements and interpretability challenges. To address these problems, we propose a novel path generation and evaluation approach that is grounded in principled nonparametric statistics and provides precise control while offering interpretable insights. Our path generation method fuses two statistical techniques: (1) nonparametric model-free transformations that capture statistical characteristics of path traces through time; and (2) copula models that capture statistical dependencies in space. For path evaluation, we adapt a nonparametric three-sample hypothesis test designed to determine if the generated paths are overfit (mimicking the original data too closely) or underfit (diverging too far from it). We demonstrate the precision and reliability of our proposed methods with empirical analysis on two existing gaming benchmarks to showcase controlled generation of diverse navigation paths. Notably, our novel path generator can be fine-tuned with user controllable parameters to create navigation paths that exhibit varying levels of human-likeness in contrast to those produced by neural network-based agents. The code is available at this https URL.

Title: TokAlign: Efficient Vocabulary Adaptation via Token Alignment

Authors: Chong Li, Jiajun Zhang, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03523
Pdf URL: https://arxiv.org/pdf/2506.03523
Copy Paste: [[2506.03523]] TokAlign: Efficient Vocabulary Adaptation via Token Alignment(https://arxiv.org/abs/2506.03523)
Keywords: large language model
Abstract: Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4$\text{e}^2$ of strong baseline methods to 1.2$\text{e}^2$ after initialization. Experimental results on models across multiple parameter scales demonstrate the effectiveness and generalization of TokAlign, which costs as few as 5k steps to restore the performance of the vanilla model. After unifying vocabularies between LLMs, token-level distillation can remarkably boost (+4.4% than sentence-level distillation) the base model, costing only 235M tokens.

Title: Seed-Coder: Let the Code Model Curate Data for Itself

Authors: Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, Yonghui Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03524
Pdf URL: https://arxiv.org/pdf/2506.03524
Copy Paste: [[2506.03524]] Seed-Coder: Let the Code Model Curate Data for Itself(https://arxiv.org/abs/2506.03524)
Keywords: large language model
Abstract: Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

Title: Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting

Authors: Chengqi Li, Zhihao Shi, Yangdi Lu, Wenbo He, Xiangyu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03538
Pdf URL: https://arxiv.org/pdf/2506.03538
Copy Paste: [[2506.03538]] Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting(https://arxiv.org/abs/2506.03538)
Keywords: robust
Abstract: 3D reconstruction from in-the-wild images remains a challenging task due to inconsistent lighting conditions and transient distractors. Existing methods typically rely on heuristic strategies to handle the low-quality training data, which often struggle to produce stable and consistent reconstructions, frequently resulting in visual artifacts. In this work, we propose Asymmetric Dual 3DGS, a novel framework that leverages the stochastic nature of these artifacts: they tend to vary across different training runs due to minor randomness. Specifically, our method trains two 3D Gaussian Splatting (3DGS) models in parallel, enforcing a consistency constraint that encourages convergence on reliable scene geometry while suppressing inconsistent artifacts. To prevent the two models from collapsing into similar failure modes due to confirmation bias, we introduce a divergent masking strategy that applies two complementary masks: a multi-cue adaptive mask and a self-supervised soft mask, which leads to an asymmetric training process of the two models, reducing shared error modes. In addition, to improve the efficiency of model training, we introduce a lightweight variant called Dynamic EMA Proxy, which replaces one of the two models with a dynamically updated Exponential Moving Average (EMA) proxy, and employs an alternating masking strategy to preserve divergence. Extensive experiments on challenging real-world datasets demonstrate that our method consistently outperforms existing approaches while achieving high efficiency. Codes and trained models will be released.

Title: Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement

Authors: Xiaofeng Zhou, Heyan Huang, Lizi Liao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03541
Pdf URL: https://arxiv.org/pdf/2506.03541
Copy Paste: [[2506.03541]] Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement(https://arxiv.org/abs/2506.03541)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks, yet their high computational demands limit widespread adoption. While distilling large models into smaller ones offers a sustainable solution, current techniques--such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection--struggle to yield substantial and lasting performance gains. In this paper, we present a novel Debate and Reflect (D&R) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback (e.g., error analysis, corrective strategies) to guide student models. Further, we introduce Tree-structured Direct Preference Optimization (T-DPO) to efficiently leverage these debate logs, organizing interactions into a hierarchical format for effective training. Empirical evaluations across diverse NLP benchmarks demonstrate that our approach significantly improves smaller-model accuracy, robustness, and generalization, outperforming conventional baselines by a large margin.

Title: Learning Monotonic Probabilities with a Generative Cost Model

Authors: Yongxiang Tang, Yanhua Cheng, Xiaocheng Liu, Chenchen Jiao, Yanxiang Zeng, Ning Luo, Pengjia Yuan, Xialong Liu, Peng Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03542
Pdf URL: https://arxiv.org/pdf/2506.03542
Copy Paste: [[2506.03542]] Learning Monotonic Probabilities with a Generative Cost Model(https://arxiv.org/abs/2506.03542)
Keywords: generative
Abstract: In many machine learning tasks, it is often necessary for the relationship between input and output variables to be monotonic, including both strictly monotonic and implicitly monotonic relationships. Traditional methods for maintaining monotonicity mainly rely on construction or regularization techniques, whereas this paper shows that the issue of strict monotonic probability can be viewed as a partial order between an observable revenue variable and a latent cost variable. This perspective enables us to reformulate the monotonicity challenge into modeling the latent cost variable. To tackle this, we introduce a generative network for the latent cost variable, termed the Generative Cost Model (GCM), which inherently addresses the strict monotonic problem, and propose the Implicit Generative Cost Model (IGCM) to address the implicit monotonic problem. We further validate our approach with a numerical simulation of quantile regression and conduct multiple experiments on public datasets, showing that our method significantly outperforms existing monotonic modeling techniques. The code for our experiments can be found at this https URL.

Title: A Threat Intelligence Event Extraction Conceptual Model for Cyber Threat Intelligence Feeds

Authors: Jamal H. Al-Yasiri, Mohamad Fadli Bin Zolkipli, Nik Fatinah N Mohd Farid, Mohammed Alsamman, Zainab Ali Mohammed
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.03551
Pdf URL: https://arxiv.org/pdf/2506.03551
Copy Paste: [[2506.03551]] A Threat Intelligence Event Extraction Conceptual Model for Cyber Threat Intelligence Feeds(https://arxiv.org/abs/2506.03551)
Keywords: security, robust, extraction
Abstract: In response to the escalating cyber threats, the efficiency of Cyber Threat Intelligence (CTI) data collection has become paramount in ensuring robust cybersecurity. However, existing works encounter significant challenges in preprocessing large volumes of multilingual threat data, leading to inefficiencies in real-time threat analysis. This paper presents a systematic review of current techniques aimed at enhancing CTI data collection efficiency. Additionally, it proposes a conceptual model to further advance the effectiveness of threat intelligence feeds. Following the PRISMA guidelines, the review examines relevant studies from the Scopus database, highlighting the critical role of artificial intelligence (AI) and machine learning models in optimizing CTI data preprocessing. The findings underscore the importance of AI-driven methods, particularly supervised and unsupervised learning, in significantly improving the accuracy of threat detection and event extraction, thereby strengthening cybersecurity. Furthermore, the study identifies a gap in the existing research and introduces XBC conceptual model integrating XLM-RoBERTa, BiGRU, and CRF, specifically developed to address this gap. This paper contributes conceptually to the field by providing a detailed analysis of current CTI data collection techniques and introducing an innovative conceptual model to enhance future threat intelligence capabilities.

Title: WIFE-Fusion:Wavelet-aware Intra-inter Frequency Enhancement for Multi-model Image Fusion

Authors: Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03555
Pdf URL: https://arxiv.org/pdf/2506.03555
Copy Paste: [[2506.03555]] WIFE-Fusion:Wavelet-aware Intra-inter Frequency Enhancement for Multi-model Image Fusion(https://arxiv.org/abs/2506.03555)
Keywords: extraction
Abstract: Multimodal image fusion effectively aggregates information from diverse modalities, with fused images playing a crucial role in vision systems. However, existing methods often neglect frequency-domain feature exploration and interactive relationships. In this paper, we propose wavelet-aware Intra-inter Frequency Enhancement Fusion (WIFE-Fusion), a multimodal image fusion framework based on frequency-domain components interactions. Its core innovations include: Intra-Frequency Self-Attention (IFSA) that leverages inherent cross-modal correlations and complementarity through interactive self-attention mechanisms to extract enriched frequency-domain features, and Inter-Frequency Interaction (IFI) that enhances enriched features and filters latent features via combinatorial interactions between heterogeneous frequency-domain components across modalities. These processes achieve precise source feature extraction and unified modeling of feature extraction-aggregation. Extensive experiments on five datasets across three multimodal fusion tasks demonstrate WIFE-Fusion's superiority over current specialized and unified fusion methods. Our code is available at this https URL.

Title: BPO: Revisiting Preference Modeling in Direct Preference Optimization

Authors: Lin Sun, Chuang Liu, Peng Liu, Bingyang Li, Weijia Lu, Ning Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03557
Pdf URL: https://arxiv.org/pdf/2506.03557
Copy Paste: [[2506.03557]] BPO: Revisiting Preference Modeling in Direct Preference Optimization(https://arxiv.org/abs/2506.03557)
Keywords: large language model
Abstract: Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% to 28.9%) and +11.7% with Qwen2.5-Math-7B (35.0% to 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.

Title: ConsistentChat: Building Skeleton-Guided Consistent Dialogues for Large Language Models from Scratch

Authors: Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, Xianpei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03558
Pdf URL: https://arxiv.org/pdf/2506.03558
Copy Paste: [[2506.03558]] ConsistentChat: Building Skeleton-Guided Consistent Dialogues for Large Language Models from Scratch(https://arxiv.org/abs/2506.03558)
Keywords: large language model
Abstract: Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20-30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.

Title: POSS: Position Specialist Generates Better Draft for Speculative Decoding

Authors: Langlin Huang, Chengsong Huang, Jixuan Leng, Di Huang, Jiaxin Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03566
Pdf URL: https://arxiv.org/pdf/2506.03566
Copy Paste: [[2506.03566]] POSS: Position Specialist Generates Better Draft for Speculative Decoding(https://arxiv.org/abs/2506.03566)
Keywords: large language model
Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at this https URL.

Title: FreePRM: Training Process Reward Models Without Ground Truth Process Labels

Authors: Lin Sun, Chuang Liu, Xiaofeng Ma, Tao Yang, Weijia Lu, Ning Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03570
Pdf URL: https://arxiv.org/pdf/2506.03570
Copy Paste: [[2506.03570]] FreePRM: Training Process Reward Models Without Ground Truth Process Labels(https://arxiv.org/abs/2506.03570)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated that Process Reward Models (PRMs) play a crucial role in enhancing model performance. However, training PRMs typically requires step-level labels, either manually annotated or automatically generated, which can be costly and difficult to obtain at scale. To address this challenge, we introduce FreePRM, a weakly supervised framework for training PRMs without access to ground-truth step-level labels. FreePRM first generates pseudo step-level labels based on the correctness of final outcome, and then employs Buffer Probability to eliminate impact of noise inherent in pseudo labeling. Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%. Compared to other open-source PRMs, FreePRM outperforms upon RLHFlow-PRM-Mistral-8B (28.4%) by +24.6%, EurusPRM (31.3%) by +21.7%, and Skywork-PRM-7B (42.1%) by +10.9%. This work introduces a new paradigm in PRM training, significantly reducing reliance on costly step-level annotations while maintaining strong performance.

Title: Exchange of Perspective Prompting Enhances Reasoning in Large Language Models

Authors: Lin Sun, Can Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03573
Pdf URL: https://arxiv.org/pdf/2506.03573
Copy Paste: [[2506.03573]] Exchange of Perspective Prompting Enhances Reasoning in Large Language Models(https://arxiv.org/abs/2506.03573)
Keywords: large language model
Abstract: Large language models (LLMs) have made significant advancements in addressing diverse natural language processing (NLP) tasks. However, their performance is often limited by inherent comprehension of problems. To address this limitation, we propose Exchange-of-Perspective (EoP), a novel framework designed to exchange perspectives across different definitions of problem, so that it can break the fixed mindset from any particular formulation of the question. We conducted extensive and comprehensive experiments on 8 benchmarks. The results show that EoP can significantly improve performance. For instance, compared to the non-commutative baseline PHP, with GPT-3.5-Turbo and EoP, we observe a 3.6% improvement on AQuA (60.6% to 64.2%), while GPT-4-powered EoP demonstrates a 7.7% overall accuracy enhancement on Math (53.9% to 61.6%) and a 3.5% improvement on OlympiadBench Maths (43.5% to 47.0%) when using Qwen-2.5-72b.

Title: KG-BiLM: Knowledge Graph Embedding via Bidirectional Language Models

Authors: Zirui Chen, Xin Wang, Zhao Li, Wenbin Guo, Dongxiao He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03576
Pdf URL: https://arxiv.org/pdf/2506.03576
Copy Paste: [[2506.03576]] KG-BiLM: Knowledge Graph Embedding via Bidirectional Language Models(https://arxiv.org/abs/2506.03576)
Keywords: transformer, generative
Abstract: Recent advances in knowledge representation learning (KRL) highlight the urgent necessity to unify symbolic knowledge graphs (KGs) with language models (LMs) for richer semantic understanding. However, existing approaches typically prioritize either graph structure or textual semantics, leaving a gap: a unified framework that simultaneously captures global KG connectivity, nuanced linguistic context, and discriminative reasoning semantics. To bridge this gap, we introduce KG-BiLM, a bidirectional LM framework that fuses structural cues from KGs with the semantic expressiveness of generative transformers. KG-BiLM incorporates three key components: (i) Bidirectional Knowledge Attention, which removes the causal mask to enable full interaction among all tokens and entities; (ii) Knowledge-Masked Prediction, which encourages the model to leverage both local semantic contexts and global graph connectivity; and (iii) Contrastive Graph Semantic Aggregation, which preserves KG structure via contrastive alignment of sampled sub-graph representations. Extensive experiments on standard benchmarks demonstrate that KG-BiLM outperforms strong baselines in link prediction, especially on large-scale graphs with complex multi-hop relations - validating its effectiveness in unifying structural information and textual semantics.

Title: Automatically Suggesting Diverse Example Sentences for L2 Japanese Learners Using Pre-Trained Language Models

Authors: Enrico Benedetti, Akiko Aizawa, Florian Boudin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03580
Pdf URL: https://arxiv.org/pdf/2506.03580
Copy Paste: [[2506.03580]] Automatically Suggesting Diverse Example Sentences for L2 Japanese Learners Using Pre-Trained Language Models(https://arxiv.org/abs/2506.03580)
Keywords: generative
Abstract: Providing example sentences that are diverse and aligned with learners' proficiency levels is essential for fostering effective language acquisition. This study examines the use of Pre-trained Language Models (PLMs) to produce example sentences targeting L2 Japanese learners. We utilize PLMs in two ways: as quality scoring components in a retrieval system that draws from a newly curated corpus of Japanese sentences, and as direct sentence generators using zero-shot learning. We evaluate the quality of sentences by considering multiple aspects such as difficulty, diversity, and naturalness, with a panel of raters consisting of learners of Japanese, native speakers -- and GPT-4. Our findings suggest that there is inherent disagreement among participants on the ratings of sentence qualities, except for difficulty. Despite that, the retrieval approach was preferred by all evaluators, especially for beginner and advanced target proficiency, while the generative approaches received lower scores on average. Even so, our experiments highlight the potential for using PLMs to enhance the adaptability of sentence suggestion systems and therefore improve the language learning journey.

Title: ViTSGMM: A Robust Semi-Supervised Image Recognition Network Using Sparse Labels

Authors: Rui Yann, Xianglei Xing
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03582
Pdf URL: https://arxiv.org/pdf/2506.03582
Copy Paste: [[2506.03582]] ViTSGMM: A Robust Semi-Supervised Image Recognition Network Using Sparse Labels(https://arxiv.org/abs/2506.03582)
Keywords: robust
Abstract: We present ViTSGMM, an image recognition network that leverages semi-supervised learning in a highly efficient manner. Existing works often rely on complex training techniques and architectures, while their generalization ability when dealing with extremely limited labeled data remains to be improved. To address these limitations, we construct a hierarchical mixture density classification decision mechanism by optimizing mutual information between feature representations and target classes, compressing redundant information while retaining crucial discriminative components. Experimental results demonstrate that our method achieves state-of-the-art performance on STL-10 and CIFAR-10/100 datasets when using negligible labeled samples. Notably, this paper also reveals a long-overlooked data leakage issue in the STL-10 dataset for semi-supervised learning tasks and removes duplicates to ensure the reliability of experimental results. Code available at this https URL.

Title: A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark

Authors: Zhigang Yang, Huiguang Yao, Linmao Tian, Xuezhi Zhao, Qiang Li, Qi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03583
Pdf URL: https://arxiv.org/pdf/2506.03583
Copy Paste: [[2506.03583]] A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark(https://arxiv.org/abs/2506.03583)
Keywords: segmentation
Abstract: Referring Remote Sensing Image Segmentation is a complex and challenging task that integrates the paradigms of computer vision and natural language processing. Existing datasets for RRSIS suffer from critical limitations in resolution, scene diversity, and category coverage, which hinders the generalization and real-world applicability of refer segmentation models. To facilitate the development of this field, we introduce NWPU-Refer, the largest and most diverse RRSIS dataset to date, comprising 15,003 high-resolution images (1024-2048px) spanning 30+ countries with 49,745 annotated targets supporting single-object, multi-object, and non-object segmentation scenarios. Additionally, we propose the Multi-scale Referring Segmentation Network (MRSNet), a novel framework tailored for the unique demands of RRSIS. MRSNet introduces two key innovations: (1) an Intra-scale Feature Interaction Module (IFIM) that captures fine-grained details within each encoder stage, and (2) a Hierarchical Feature Interaction Module (HFIM) to enable seamless cross-scale feature fusion, preserving spatial integrity while enhancing discriminative power. Extensive experiments conducte on the proposed NWPU-Refer dataset demonstrate that MRSNet achieves state-of-the-art performance across multiple evaluation metrics, validating its effectiveness. The dataset and code are publicly available at this https URL.

Title: A Class Inference Scheme With Dempster-Shafer Theory for Learning Fuzzy-Classifier Systems

Authors: Hiroki Shiraishi, Hisao Ishibuchi, Masaya Nakata
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2506.03588
Pdf URL: https://arxiv.org/pdf/2506.03588
Copy Paste: [[2506.03588]] A Class Inference Scheme With Dempster-Shafer Theory for Learning Fuzzy-Classifier Systems(https://arxiv.org/abs/2506.03588)
Keywords: robust, interpretability
Abstract: The decision-making process significantly influences the predictions of machine learning models. This is especially important in rule-based systems such as Learning Fuzzy-Classifier Systems (LFCSs) where the selection and application of rules directly determine prediction accuracy and reliability. LFCSs combine evolutionary algorithms with supervised learning to optimize fuzzy classification rules, offering enhanced interpretability and robustness. Despite these advantages, research on improving decision-making mechanisms (i.e., class inference schemes) in LFCSs remains limited. Most LFCSs use voting-based or single-winner-based inference schemes. These schemes rely on classification performance on training data and may not perform well on unseen data, risking overfitting. To address these limitations, this article introduces a novel class inference scheme for LFCSs based on the Dempster-Shafer Theory of Evidence (DS theory). The proposed scheme handles uncertainty well. By using the DS theory, the scheme calculates belief masses (i.e., measures of belief) for each specific class and the ``I don't know'' state from each fuzzy rule and infers a class from these belief masses. Unlike the conventional schemes, the proposed scheme also considers the ``I don't know'' state that reflects uncertainty, thereby improving the transparency and reliability of LFCSs. Applied to a variant of LFCS (i.e., Fuzzy-UCS), the proposed scheme demonstrates statistically significant improvements in terms of test macro F1 scores across 30 real-world datasets compared to conventional voting-based and single-winner-based fuzzy inference schemes. It forms smoother decision boundaries, provides reliable confidence measures, and enhances the robustness and generalizability of LFCSs in real-world applications. Our implementation is available at this https URL.

Title: Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts

Authors: Jiaxing Zhang, Xinyi Zeng, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03591
Pdf URL: https://arxiv.org/pdf/2506.03591
Copy Paste: [[2506.03591]] Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts(https://arxiv.org/abs/2506.03591)
Keywords: transformer, large language model
Abstract: Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.

Title: From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models

Authors: Viktor Hangya, Fabian Küch, Darina Gold
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03592
Pdf URL: https://arxiv.org/pdf/2506.03592
Copy Paste: [[2506.03592]] From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models(https://arxiv.org/abs/2506.03592)
Keywords: generative
Abstract: Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. We plan to publish our benchmark adaptions.

Title: Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments

Authors: Zetong Tang, Qian Ma, Di Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03598
Pdf URL: https://arxiv.org/pdf/2506.03598
Copy Paste: [[2506.03598]] Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments(https://arxiv.org/abs/2506.03598)
Keywords: large language model
Abstract: Using the best Text-to-SQL methods in resource-constrained environments is challenging due to their reliance on resource-intensive open-source models. This paper introduces Auto Prompt SQL(AP-SQL), a novel architecture designed to bridge the gap between resource-efficient small open-source models and the powerful capabilities of large closed-source models for Text-to-SQL translation. Our method decomposes the task into schema filtering, retrieval-augmented text-to-SQL generation based on in-context examples, and prompt-driven schema linking and SQL generation. To improve schema selection accuracy, we fine-tune large language models. Crucially, we also explore the impact of prompt engineering throughout the process, leveraging Chain-of-Thought(CoT) and Graph-of-Thought(GoT) templates to significantly enhance the model's reasoning for accurate SQL generation. Comprehensive evaluations on the Spider benchmarks demonstrate the effectiveness of AP-SQL.

Title: Adapting Rule Representation With Four-Parameter Beta Distribution for Learning Classifier Systems

Authors: Hiroki Shiraishi, Yohei Hayamizu, Tomonori Hashiyama, Keiki Takadama, Hisao Ishibuchi, Masaya Nakata
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2506.03602
Pdf URL: https://arxiv.org/pdf/2506.03602
Copy Paste: [[2506.03602]] Adapting Rule Representation With Four-Parameter Beta Distribution for Learning Classifier Systems(https://arxiv.org/abs/2506.03602)
Keywords: interpretability
Abstract: Rule representations significantly influence the search capabilities and decision boundaries within the search space of Learning Classifier Systems (LCSs), a family of rule-based machine learning systems that evolve interpretable models through evolutionary processes. However, it is very difficult to choose an appropriate rule representation for each problem. Additionally, some problems benefit from using different representations for different subspaces within the input space. Thus, an adaptive mechanism is needed to choose an appropriate rule representation for each rule in LCSs. This article introduces a flexible rule representation using a four-parameter beta distribution and integrates it into a fuzzy-style LCS. The four-parameter beta distribution can form various function shapes, and this flexibility enables our LCS to automatically select appropriate representations for different subspaces. Our rule representation can represent crisp/fuzzy decision boundaries in various boundary shapes, such as rectangles and bells, by controlling four parameters, compared to the standard representations such as trapezoidal ones. Leveraging this flexibility, our LCS is designed to adapt the appropriate rule representation for each subspace. Moreover, our LCS incorporates a generalization bias favoring crisp rules where feasible, enhancing model interpretability without compromising accuracy. Experimental results on real-world classification tasks show that our LCS achieves significantly superior test accuracy and produces more compact rule sets. Our implementation is available at this https URL. An extended abstract related to this work is available at this https URL.

Title: Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI

Authors: Wing Man Casca Kwok, Yip Chiu Tung, Kunal Bhagchandani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03607
Pdf URL: https://arxiv.org/pdf/2506.03607
Copy Paste: [[2506.03607]] Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI(https://arxiv.org/abs/2506.03607)
Keywords: transformer
Abstract: Edge computing decentralizes processing power to network edge, enabling real-time AI-driven decision-making in IoT applications. In industrial automation such as robotics and rugged edge AI, real-time perception and intelligence are critical for autonomous operations. Deploying transformer-based image captioning models at the edge can enhance machine perception, improve scene understanding for autonomous robots, and aid in industrial inspection. However, these edge or IoT devices are often constrained in computational resources for physical agility, yet they have strict response time requirements. Traditional deep learning models can be too large and computationally demanding for these devices. In this research, we present findings of transformer-based models for image captioning that operate effectively on edge devices. By evaluating resource-effective transformer models and applying knowledge distillation techniques, we demonstrate inference can be accelerated on resource-constrained devices while maintaining model performance using these techniques.

Title: VLMs Can Aggregate Scattered Training Patches

Authors: Zhanhui Zhou, Lingjie Chen, Chao Yang, Chaochao Lu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.03614
Pdf URL: https://arxiv.org/pdf/2506.03614
Copy Paste: [[2506.03614]] VLMs Can Aggregate Scattered Training Patches(https://arxiv.org/abs/2506.03614)
Keywords: attack
Abstract: One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at this https URL.

Title: Isharah: A Large-Scale Multi-Scene Dataset for Continuous Sign Language Recognition

Authors: Sarah Alyami, Hamzah Luqman, Sadam Al-Azani, Maad Alowaifeer, Yazeed Alharbi, Yaser Alonaizan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03615
Pdf URL: https://arxiv.org/pdf/2506.03615
Copy Paste: [[2506.03615]] Isharah: A Large-Scale Multi-Scene Dataset for Continuous Sign Language Recognition(https://arxiv.org/abs/2506.03615)
Keywords: robust
Abstract: Current benchmarks for sign language recognition (SLR) focus mainly on isolated SLR, while there are limited datasets for continuous SLR (CSLR), which recognizes sequences of signs in a video. Additionally, existing CSLR datasets are collected in controlled settings, which restricts their effectiveness in building robust real-world CSLR systems. To address these limitations, we present Isharah, a large multi-scene dataset for CSLR. It is the first dataset of its type and size that has been collected in an unconstrained environment using signers' smartphone cameras. This setup resulted in high variations of recording settings, camera distances, angles, and resolutions. This variation helps with developing sign language understanding models capable of handling the variability and complexity of real-world scenarios. The dataset consists of 30,000 video clips performed by 18 deaf and professional signers. Additionally, the dataset is linguistically rich as it provides a gloss-level annotation for all dataset's videos, making it useful for developing CSLR and sign language translation (SLT) systems. This paper also introduces multiple sign language understanding benchmarks, including signer-independent and unseen-sentence CSLR, along with gloss-based and gloss-free SLT. The Isharah dataset is available on this https URL.

Title: Learning to Insert [PAUSE] Tokens for Better Reasoning

Authors: Eunki Kim, Sangryul Kim, James Thorne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03616
Pdf URL: https://arxiv.org/pdf/2506.03616
Copy Paste: [[2506.03616]] Learning to Insert [PAUSE] Tokens for Better Reasoning(https://arxiv.org/abs/2506.03616)
Keywords: transformer, large language model
Abstract: To enhance reasoning capabilities, previous works have explored incorporating special-purpose tokens into the training process. These strategies strengthen the learning mechanism of transformer-based large language models (LLMs). Building on prior research, in which inserting dummy tokens consecutively just before reasoning steps can enhance effectiveness, we introduce a novel approach termed Dynamic Inserting Tokens Training (DIT). Our method identifies positions within sequences where model confidence is lowest according to token log-likelihood. Strategically inserting [PAUSE] tokens on these positions bolsters the model's predictive capabilities for subsequent tokens. Experimental results across diverse datasets and models, from the 2.7B model to the 8B model, demonstrate that DIT consistently outperforms traditional fine-tuning and previous token insertion methods. With this simple yet effective method, we achieve accuracy gains of up to 4.7%p on GSM8K, 3.23%p on AQUA-RAT, and pass@1 improvements of up to 3.4%p on MBPP datasets. Our work shows a model-based, dynamic approach rather than a heuristic one, thereby broadening the scope of research in reasoning.

Title: GCFL: A Gradient Correction-based Federated Learning Framework for Privacy-preserving CPSS

Authors: Jiayi Wan, Xiang Zhu, Fanzhen Liu, Wei Fan, Xiaolong Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03618
Pdf URL: https://arxiv.org/pdf/2506.03618
Copy Paste: [[2506.03618]] GCFL: A Gradient Correction-based Federated Learning Framework for Privacy-preserving CPSS(https://arxiv.org/abs/2506.03618)
Keywords: privacy, federate
Abstract: Federated learning, as a distributed architecture, shows great promise for applications in Cyber-Physical-Social Systems (CPSS). In order to mitigate the privacy risks inherent in CPSS, the integration of differential privacy with federated learning has attracted considerable attention. Existing research mainly focuses on dynamically adjusting the noise added or discarding certain gradients to mitigate the noise introduced by differential privacy. However, these approaches fail to remove the noise that hinders convergence and correct the gradients affected by the noise, which significantly reduces the accuracy of model classification. To overcome these challenges, this paper proposes a novel framework for differentially private federated learning that balances rigorous privacy guarantees with accuracy by introducing a server-side gradient correction mechanism. Specifically, after clients perform gradient clipping and noise perturbation, our framework detects deviations in the noisy local gradients and employs a projection mechanism to correct them, mitigating the negative impact of noise. Simultaneously, gradient projection promotes the alignment of gradients from different clients and guides the model towards convergence to a global optimum. We evaluate our framework on several benchmark datasets, and the experimental results demonstrate that it achieves state-of-the-art performance under the same privacy budget.

Title: Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales

Authors: Ayuto Tsutsumi, Yuu Jinnai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03619
Pdf URL: https://arxiv.org/pdf/2506.03619
Copy Paste: [[2506.03619]] Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales(https://arxiv.org/abs/2506.03619)
Keywords: large language model
Abstract: Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well. The code and dataset are available at this https URL ILab/YokaiEval.

Title: Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation

Authors: Chaehun Shin, Jooyoung Choi, Johan Barthelemy, Jungbeom Lee, Sungroh Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03621
Pdf URL: https://arxiv.org/pdf/2506.03621
Copy Paste: [[2506.03621]] Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation(https://arxiv.org/abs/2506.03621)
Keywords: diffusion
Abstract: We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Beyond supervised fine-tuning methods that rely only on positive targets and use the diffusion loss as in the pre-training stage, SFO introduces synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically generates distinctive and informative negatives by intentionally degrading visual and textual cues without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus finetuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: this https URL

Title: Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

Authors: Lin Mu, Guowei Chu, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03627
Pdf URL: https://arxiv.org/pdf/2506.03627
Copy Paste: [[2506.03627]] Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks(https://arxiv.org/abs/2506.03627)
Keywords: attack, robust, large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can substantially degrade their performance. Despite advances in prompting techniques, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy specifically designed to enhance the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are then used to construct prompts that automatically correct input errors. In the Guidance stage, RoP generates an optimal guidance prompting based on the corrected input, steering the model toward more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Notably, it maintains model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.

Title: RewardAnything: Generalizable Principle-Following Reward Models

Authors: Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03637
Pdf URL: https://arxiv.org/pdf/2506.03637
Copy Paste: [[2506.03637]] RewardAnything: Generalizable Principle-Following Reward Models(https://arxiv.org/abs/2506.03637)
Keywords: large language model
Abstract: Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.

Title: Images are Worth Variable Length of Representations

Authors: Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03643
Pdf URL: https://arxiv.org/pdf/2506.03643
Copy Paste: [[2506.03643]] Images are Worth Variable Length of Representations(https://arxiv.org/abs/2506.03643)
Keywords: extraction
Abstract: Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at this https URL.

Title: YOND: Practical Blind Raw Image Denoising Free from Camera-Specific Data Dependency

Authors: Hansen Feng, Lizhi Wang, Yiqi Huang, Tong Li, Lin Zhu, Hua Huang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.03645
Pdf URL: https://arxiv.org/pdf/2506.03645
Copy Paste: [[2506.03645]] YOND: Practical Blind Raw Image Denoising Free from Camera-Specific Data Dependency(https://arxiv.org/abs/2506.03645)
Keywords: robust
Abstract: The rapid advancement of photography has created a growing demand for a practical blind raw image denoising method. Recently, learning-based methods have become mainstream due to their excellent performance. However, most existing learning-based methods suffer from camera-specific data dependency, resulting in performance drops when applied to data from unknown cameras. To address this challenge, we introduce a novel blind raw image denoising method named YOND, which represents You Only Need a Denoiser. Trained solely on synthetic data, YOND can generalize robustly to noisy raw images captured by diverse unknown cameras. Specifically, we propose three key modules to guarantee the practicality of YOND: coarse-to-fine noise estimation (CNE), expectation-matched variance-stabilizing transform (EM-VST), and SNR-guided denoiser (SNR-Net). Firstly, we propose CNE to identify the camera noise characteristic, refining the estimated noise parameters based on the coarse denoised image. Secondly, we propose EM-VST to eliminate camera-specific data dependency, correcting the bias expectation of VST according to the noisy image. Finally, we propose SNR-Net to offer controllable raw image denoising, supporting adaptive adjustments and manual fine-tuning. Extensive experiments on unknown cameras, along with flexible solutions for challenging cases, demonstrate the superior practicality of our method. The source code will be publicly available at the \href{this https URL}{project homepage}.

Title: Mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond

Authors: Zeyu Gao, Junlin Zhou, Bolun Zhang, Yi He, Chao Zhang, Yuxin Cui, Hao Wang
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2506.03651
Pdf URL: https://arxiv.org/pdf/2506.03651
Copy Paste: [[2506.03651]] Mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond(https://arxiv.org/abs/2506.03651)
Keywords: security
Abstract: The quantity and quality of vulnerability datasets are essential for developing deep learning solutions to vulnerability-related tasks. Due to the limited availability of vulnerabilities, a common approach to building such datasets is analyzing security patches in source code. However, existing security patches often suffer from inaccurate labels, insufficient contextual information, and undecidable patches that fail to clearly represent the root causes of vulnerabilities or their fixes. These issues introduce noise into the dataset, which can mislead detection models and undermine their effectiveness. To address these issues, we present mono, a novel LLM-powered framework that simulates human experts' reasoning process to construct reliable vulnerability datasets. mono introduces three key components to improve security patch datasets: (i) semantic-aware patch classification for precise vulnerability labeling, (ii) iterative contextual analysis for comprehensive code understanding, and (iii) systematic root cause analysis to identify and filter undecidable patches. Our comprehensive evaluation on the MegaVul benchmark demonstrates that mono can correct 31.0% of labeling errors, recover 89% of inter-procedural vulnerabilities, and reveals that 16.7% of CVEs contain undecidable patches. Furthermore, mono's enriched context representation improves existing models' vulnerability detection accuracy by 15%. We open source the framework mono and the dataset MonoLens in this https URL.

Title: EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation

Authors: Cheng Zhang, Hongxia xie, Bin Wen, Songhan Zuo, Ruoxuan Zhang, Wen-huang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03652
Pdf URL: https://arxiv.org/pdf/2506.03652
Copy Paste: [[2506.03652]] EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation(https://arxiv.org/abs/2506.03652)
Keywords: diffusion
Abstract: With the rapid advancement of diffusion models, text-to-image generation has achieved significant progress in image resolution, detail fidelity, and semantic alignment, particularly with models like Stable Diffusion 3.5, Stable Diffusion XL, and FLUX 1. However, generating emotionally expressive and abstract artistic images remains a major challenge, largely due to the lack of large-scale, fine-grained emotional datasets. To address this gap, we present the EmoArt Dataset -- one of the most comprehensive emotion-annotated art datasets to date. It contains 132,664 artworks across 56 painting styles (e.g., Impressionism, Expressionism, Abstract Art), offering rich stylistic and cultural diversity. Each image includes structured annotations: objective scene descriptions, five key visual attributes (brushwork, composition, color, line, light), binary arousal-valence labels, twelve emotion categories, and potential art therapy effects. Using EmoArt, we systematically evaluate popular text-to-image diffusion models for their ability to generate emotionally aligned images from text. Our work provides essential data and benchmarks for emotion-driven image synthesis and aims to advance fields such as affective computing, multimodal learning, and computational art, enabling applications in art therapy and creative design. The dataset and more details can be accessed via our project website.

Title: MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

Authors: Xiaochun Lei, Siqi Wu, Weilin Wu, Zetao Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03654
Pdf URL: https://arxiv.org/pdf/2506.03654
Copy Paste: [[2506.03654]] MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection(https://arxiv.org/abs/2506.03654)
Keywords: transformer
Abstract: Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNs with Mamba to effectively capture both local features and long-range dependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid architecture that improves multi-scale object detection across various object sizes; and (3) Edge-focused Efficiency: our method achieved 66.6\% mAP at 31.9 FPS on the PASCAL VOC dataset without any pre-training and supports deployment on edge devices such as the NVIDIA Jetson Xavier NX and Orin NX.

Title: Client-Side Zero-Shot LLM Inference for Comprehensive In-Browser URL Analysis

Authors: Avihay Cohen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.03656
Pdf URL: https://arxiv.org/pdf/2506.03656
Copy Paste: [[2506.03656]] Client-Side Zero-Shot LLM Inference for Comprehensive In-Browser URL Analysis(https://arxiv.org/abs/2506.03656)
Keywords: security, privacy, attack, large language model
Abstract: Malicious websites and phishing URLs pose an ever-increasing cybersecurity risk, with phishing attacks growing by 40% in a single year. Traditional detection approaches rely on machine learning classifiers or rule-based scanners operating in the cloud, but these face significant challenges in generalization, privacy, and evasion by sophisticated threats. In this paper, we propose a novel client-side framework for comprehensive URL analysis that leverages zero-shot inference by a local large language model (LLM) running entirely in-browser. Our system uses a compact LLM (e.g., 3B/8B parameters) via WebLLM to perform reasoning over rich context collected from the target webpage, including static code analysis (JavaScript abstract syntax trees, structure, and code patterns), dynamic sandbox execution results (DOM changes, API calls, and network requests),and visible content. We detail the architecture and methodology of the system, which combines a real browser sandbox (using iframes) resistant to common anti-analysis techniques, with an LLM-based analyzer that assesses potential vulnerabilities and malicious behaviors without any task-specific training (zero-shot). The LLM aggregates evidence from multiple sources (code, execution trace, page content) to classify the URL as benign or malicious and to provide an explanation of the threats or security issues identified. We evaluate our approach on a diverse set of benign and malicious URLs, demonstrating that even a compact client-side model can achieve high detection accuracy and insightful explanations comparable to cloud-based solutions, while operating privately on end-user devices. The results show that client-side LLM inference is a feasible and effective solution to web threat analysis, eliminating the need to send potentially sensitive data to cloud services.

Title: Trustworthy Medical Question Answering: An Evaluation-Centric Survey

Authors: Yinuo Wang, Robert E. Mercer, Frank Rudzicz, Sudipta Singha Roy, Pengjie Ren, Zhumin Chen, Xindi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03659
Pdf URL: https://arxiv.org/pdf/2506.03659
Copy Paste: [[2506.03659]] Trustworthy Medical Question Answering: An Evaluation-Centric Survey(https://arxiv.org/abs/2506.03659)
Keywords: robust, fair, explainability, large language model
Abstract: Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.

Title: BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation

Authors: Jialei Chen, Xu Zheng, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03675
Pdf URL: https://arxiv.org/pdf/2506.03675
Copy Paste: [[2506.03675]] BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation(https://arxiv.org/abs/2506.03675)
Keywords: robust, segmentation
Abstract: Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a mask-level classification task and propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA) to maximize modality effectiveness and handle missing modalities. Specifically, BiXFormer first categorizes multi-modal inputs into RGB and X, where X represents any non-RGB modalities, e.g., depth, allowing separate processing for each. This design leverages the well-established pretraining for RGB, while addressing the relative lack of attention to X modalities. Then, we propose UMM, which includes Modality Agnostic Matching (MAM) and Complementary Matching (CM). MAM assigns labels to features from all modalities without considering modality differences, leveraging each modality's strengths. CM then reassigns unmatched labels to remaining unassigned features within their respective modalities, ensuring that each available modality contributes to the final prediction and mitigating the impact of missing modalities. Moreover, to further facilitate UMM, we introduce CMA, which enhances the weaker queries assigned in CM by aligning them with optimally matched queries from MAM. Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method, achieving significant improvements in mIoU of +2.75% and +22.74% over the prior arts.

Title: Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Authors: Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.03681
Pdf URL: https://arxiv.org/pdf/2506.03681
Copy Paste: [[2506.03681]] Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering(https://arxiv.org/abs/2506.03681)
Keywords: robust
Abstract: Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

Title: PRJ: Perception-Retrieval-Judgement for Generated Images

Authors: Qiang Fu, Zonglei Jing, Zonghao Ying, Xiaoqian Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03683
Pdf URL: https://arxiv.org/pdf/2506.03683
Copy Paste: [[2506.03683]] PRJ: Perception-Retrieval-Judgement for Generated Images(https://arxiv.org/abs/2506.03683)
Keywords: attack, robust, interpretability, generative
Abstract: The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as sexually explicit images, violent scenes, hate symbols, propaganda, and unauthorized imitations of copyrighted artworks. Existing image safety systems often rely on rigid category filters and produce binary outputs, lacking the capacity to interpret context or reason about nuanced, adversarially induced forms of harm. In addition, standard evaluation metrics (e.g., attack success rate) fail to capture the semantic severity and dynamic progression of toxicity. To address these limitations, we propose Perception-Retrieval-Judgement (PRJ), a cognitively inspired framework that models toxicity detection as a structured reasoning process. PRJ follows a three-stage design: it first transforms an image into descriptive language (perception), then retrieves external knowledge related to harm categories and traits (retrieval), and finally evaluates toxicity based on legal or normative rules (judgement). This language-centric structure enables the system to detect both explicit and implicit harms with improved interpretability and categorical granularity. In addition, we introduce a dynamic scoring mechanism based on a contextual toxicity risk matrix to quantify harmfulness across different semantic dimensions. Experiments show that PRJ surpasses existing safety checkers in detection accuracy and robustness while uniquely supporting structured category-level toxicity interpretation.

Title: DSSAU-Net:U-Shaped Hybrid Network for Pubic Symphysis and Fetal Head Segmentation

Authors: Zunhui Xia, Hongxing Li, Libin Lan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03684
Pdf URL: https://arxiv.org/pdf/2506.03684
Copy Paste: [[2506.03684]] DSSAU-Net:U-Shaped Hybrid Network for Pubic Symphysis and Fetal Head Segmentation(https://arxiv.org/abs/2506.03684)
Keywords: segmentation
Abstract: In the childbirth process, traditional methods involve invasive vaginal examinations, but research has shown that these methods are both subjective and inaccurate. Ultrasound-assisted diagnosis offers an objective yet effective way to assess fetal head position via two key parameters: Angle of Progression (AoP) and Head-Symphysis Distance (HSD), calculated by segmenting the fetal head (FH) and pubic symphysis (PS), which aids clinicians in ensuring a smooth delivery process. Therefore, accurate segmentation of FH and PS is crucial. In this work, we propose a sparse self-attention network architecture with good performance and high computational efficiency, named DSSAU-Net, for the segmentation of FH and PS. Specifically, we stack varying numbers of Dual Sparse Selection Attention (DSSA) blocks at each stage to form a symmetric U-shaped encoder-decoder network architecture. For a given query, DSSA is designed to explicitly perform one sparse token selection at both the region and pixel levels, respectively, which is beneficial for further reducing computational complexity while extracting the most relevant features. To compensate for the information loss during the upsampling process, skip connections with convolutions are designed. Additionally, multiscale feature fusion is employed to enrich the model's global and local information. The performance of DSSAU-Net has been validated using the Intrapartum Ultrasound Grand Challenge (IUGC) 2024 \textit{test set} provided by the organizer in the MICCAI IUGC 2024 competition\footnote{\href{this https URL\#learn\_the\_details}{this https URL\#learn\_the\_details}}, where we win the fourth place on the tasks of classification and segmentation, demonstrating its effectiveness. The codes will be available at this https URL.

Title: Robust Preference Optimization via Dynamic Target Margins

Authors: Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, Xiang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03690
Pdf URL: https://arxiv.org/pdf/2506.03690
Copy Paste: [[2506.03690]] Robust Preference Optimization via Dynamic Target Margins(https://arxiv.org/abs/2506.03690)
Keywords: robust, large language model
Abstract: The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $\gamma$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $\gamma$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $\gamma$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $\gamma$-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $\gamma$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{this https URL}{this https URL}.

Title: Advancements in Artificial Intelligence Applications for Cardiovascular Disease Research

Authors: Yuanlin Mo, Haishan Huang, Bocheng Liang, Weibo Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03698
Pdf URL: https://arxiv.org/pdf/2506.03698
Copy Paste: [[2506.03698]] Advancements in Artificial Intelligence Applications for Cardiovascular Disease Research(https://arxiv.org/abs/2506.03698)
Keywords: robust, generative
Abstract: Recent advancements in artificial intelligence (AI) have revolutionized cardiovascular medicine, particularly through integration with computed tomography (CT), magnetic resonance imaging (MRI), electrocardiography (ECG) and ultrasound (US). Deep learning architectures, including convolutional neural networks and generative adversarial networks, enable automated analysis of medical imaging and physiological signals, surpassing human capabilities in diagnostic accuracy and workflow efficiency. However, critical challenges persist, including the inability to validate input data accuracy, which may propagate diagnostic errors. This review highlights AI's transformative potential in precision diagnostics while underscoring the need for robust validation protocols to ensure clinical reliability. Future directions emphasize hybrid models integrating multimodal data and adaptive algorithms to refine personalized cardiovascular care.

Title: AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Authors: Zhepei Wei, Wei-Lin Chen, Xinyu Zhu, Yu Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03700
Pdf URL: https://arxiv.org/pdf/2506.03700
Copy Paste: [[2506.03700]] AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism(https://arxiv.org/abs/2506.03700)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware's parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary "drafter" model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the outputs due to the missing key-value cache at skipped layers. In this work, we propose AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters, while ensuring output consistency. AdaDecode leverages the insight that many tokens can accurately be generated at intermediate layers, as further layers often do not significantly alter predictions once the model reaches a certain confidence. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token's computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in parallel with subsequent tokens when needed, maximizing hardware utilization and reducing decoding latency. A final verification step ensures that early predictions match the results of standard autoregressive decoding, preserving output parity. Experiments across diverse generation tasks shows that AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup, while guaranteeing output parity with standard autoregressive decoding.

Title: Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond

Authors: Xiansheng Cai, Sihan Hu, Tao Wang, Yuan Huang, Pan Zhang, Youjin Deng, Kun Chen
Subjects: cs.LG, cond-mat.dis-nn, cond-mat.stat-mech, cond-mat.str-el, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.03703
Pdf URL: https://arxiv.org/pdf/2506.03703
Copy Paste: [[2506.03703]] Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond(https://arxiv.org/abs/2506.03703)
Keywords: extraction, large language model
Abstract: Fundamental physics often confronts complex symbolic problems with few guiding exemplars or established principles. While artificial intelligence (AI) offers promise, its typical need for vast datasets to learn from hinders its use in these information-scarce frontiers. We introduce learning at criticality (LaC), a reinforcement learning (RL) scheme that tunes Large Language Models (LLMs) to a sharp learning transition, addressing this information scarcity. At this transition, LLMs achieve peak generalization from minimal data, exemplified by 7-digit base-7 addition -- a test of nontrivial arithmetic reasoning. To elucidate this peak, we analyze a minimal concept-network model (CoNet) designed to capture the essence of how LLMs might link tokens. Trained on a single exemplar, this model also undergoes a sharp learning transition. This transition exhibits hallmarks of a second-order phase transition, notably power-law distributed solution path lengths. At this critical point, the system maximizes a ``critical thinking pattern" crucial for generalization, enabled by the underlying scale-free exploration. This suggests LLMs reach peak performance by operating at criticality, where such explorative dynamics enable the extraction of underlying operational rules. We demonstrate LaC in quantum field theory: an 8B-parameter LLM, tuned to its critical point by LaC using a few exemplars of symbolic Matsubara sums, solves unseen, higher-order problems, significantly outperforming far larger models. LaC thus leverages critical phenomena, a physical principle, to empower AI for complex, data-sparse challenges in fundamental physics.

Title: ScoreRAG: A Retrieval-Augmented Generation Framework with Consistency-Relevance Scoring and Structured Summarization for News Generation

Authors: Pei-Yun Lin, Yen-lung Tsai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03704
Pdf URL: https://arxiv.org/pdf/2506.03704
Copy Paste: [[2506.03704]] ScoreRAG: A Retrieval-Augmented Generation Framework with Consistency-Relevance Scoring and Structured Summarization for News Generation(https://arxiv.org/abs/2506.03704)
Keywords: large language model
Abstract: This research introduces ScoreRAG, an approach to enhance the quality of automated news generation. Despite advancements in Natural Language Processing and large language models, current news generation methods often struggle with hallucinations, factual inconsistencies, and lack of domain-specific expertise when producing news articles. ScoreRAG addresses these challenges through a multi-stage framework combining retrieval-augmented generation, consistency relevance evaluation, and structured summarization. The system first retrieves relevant news documents from a vector database, maps them to complete news items, and assigns consistency relevance scores based on large language model evaluations. These documents are then reranked according to relevance, with low-quality items filtered out. The framework proceeds to generate graded summaries based on relevance scores, which guide the large language model in producing complete news articles following professional journalistic standards. Through this methodical approach, ScoreRAG aims to significantly improve the accuracy, coherence, informativeness, and professionalism of generated news articles while maintaining stability and consistency throughout the generation process. The code and demo are available at: this https URL.

Title: OV-COAST: Cost Aggregation with Optimal Transport for Open-Vocabulary Semantic Segmentation

Authors: Aditya Gandhamal, Aniruddh Sikdar, Suresh Sundaram
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03706
Pdf URL: https://arxiv.org/pdf/2506.03706
Copy Paste: [[2506.03706]] OV-COAST: Cost Aggregation with Optimal Transport for Open-Vocabulary Semantic Segmentation(https://arxiv.org/abs/2506.03706)
Keywords: segmentation
Abstract: Open-vocabulary semantic segmentation (OVSS) entails assigning semantic labels to each pixel in an image using textual descriptions, typically leveraging world models such as CLIP. To enhance out-of-domain generalization, we propose Cost Aggregation with Optimal Transport (OV-COAST) for open-vocabulary semantic segmentation. To align visual-language features within the framework of optimal transport theory, we employ cost volume to construct a cost matrix, which quantifies the distance between two distributions. Our approach adopts a two-stage optimization strategy: in the first stage, the optimal transport problem is solved using cost volume via Sinkhorn distance to obtain an alignment solution; in the second stage, this solution is used to guide the training of the CAT-Seg model. We evaluate state-of-the-art OVSS models on the MESS benchmark, where our approach notably improves the performance of the cost-aggregation model CAT-Seg with ViT-B backbone, achieving superior results, surpassing CAT-Seg by 1.72 % and SAN-B by 4.9 % mIoU. The code is available at this https URL}{this https URL .

Title: AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives

Authors: Aniruddh Sikdar, Aditya Gandhamal, Suresh Sundaram
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03709
Pdf URL: https://arxiv.org/pdf/2506.03709
Copy Paste: [[2506.03709]] AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives(https://arxiv.org/abs/2506.03709)
Keywords: robust, segmentation
Abstract: Open-vocabulary semantic segmentation (OVSS) involves assigning labels to each pixel in an image based on textual descriptions, leveraging world models like CLIP. However, they encounter significant challenges in cross-domain generalization, hindering their practical efficacy in real-world applications. Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities, and in this study, we present AetherVision-Bench, a benchmark for multi-angle segmentation across aerial, and ground perspectives, which facilitates an extensive evaluation of performance across different viewing angles and sensor modalities. We assess state-of-the-art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero-shot transfer models. Our work pioneers the creation of a robustness benchmark, offering valuable insights and establishing a foundation for future research.

Title: FSHNet: Fully Sparse Hybrid Network for 3D Object Detection

Authors: Shuai Liu, Mingyue Cui, Boyang Li, Quanmin Liang, Tinghe Hong, Kai Huang, Yunxiao Shan, Kai Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03714
Pdf URL: https://arxiv.org/pdf/2506.03714
Copy Paste: [[2506.03714]] FSHNet: Fully Sparse Hybrid Network for 3D Object Detection(https://arxiv.org/abs/2506.03714)
Keywords: extraction
Abstract: Fully sparse 3D detectors have recently gained significant attention due to their efficiency in long-range detection. However, sparse 3D detectors extract features only from non-empty voxels, which impairs long-range interactions and causes the center feature missing. The former weakens the feature extraction capability, while the latter hinders network optimization. To address these challenges, we introduce the Fully Sparse Hybrid Network (FSHNet). FSHNet incorporates a proposed SlotFormer block to enhance the long-range feature extraction capability of existing sparse encoders. The SlotFormer divides sparse voxels using a slot partition approach, which, compared to traditional window partition, provides a larger receptive field. Additionally, we propose a dynamic sparse label assignment strategy to deeply optimize the network by providing more high-quality positive samples. To further enhance performance, we introduce a sparse upsampling module to refine downsampled voxels, preserving fine-grained details crucial for detecting small objects. Extensive experiments on the Waymo, nuScenes, and Argoverse2 benchmarks demonstrate the effectiveness of FSHNet. The code is available at this https URL.

Title: On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity

Authors: Quentin Bertrand, Anne Gagneux, Mathurin Massias, Rémi Emonet
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03719
Pdf URL: https://arxiv.org/pdf/2506.03719
Copy Paste: [[2506.03719]] On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity(https://arxiv.org/abs/2506.03719)
Keywords: diffusion, generative
Abstract: Modern deep generative models can now produce high-quality synthetic samples that are often indistinguishable from real training data. A growing body of research aims to understand why recent methods -- such as diffusion and flow matching techniques -- generalize so effectively. Among the proposed explanations are the inductive biases of deep learning architectures and the stochastic nature of the conditional flow matching loss. In this work, we rule out the latter -- the noisy nature of the loss -- as a primary contributor to generalization in flow matching. First, we empirically show that in high-dimensional settings, the stochastic and closed-form versions of the flow matching loss yield nearly equivalent losses. Then, using state-of-the-art flow matching models on standard image datasets, we demonstrate that both variants achieve comparable statistical performance, with the surprising observation that using the closed-form can even improve performance.

Title: Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision

Authors: Chaeyun Jang, Moonseok Choi, Yegon Kim, Hyungi Lee, Juho Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03723
Pdf URL: https://arxiv.org/pdf/2506.03723
Copy Paste: [[2506.03723]] Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision(https://arxiv.org/abs/2506.03723)
Keywords: interpretability, large language model
Abstract: Uncertainty calibration is essential for the safe deployment of large language models (LLMs), particularly when users rely on verbalized confidence estimates. While prior work has focused on classifiers or short-form generation, confidence calibration for chain-of-thought (CoT) reasoning remains largely unexplored. Surprisingly, we find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models, without any explicit reasoning supervision or reinforcement learning-based rewards. Despite being trained only to produce a verbalized confidence score without any self-verifying examples, the model learns to generate longer and self-checking responses for low-confidence queries while providing more concise answers for high-confidence ones. We further propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty. Experiments on GSM8K and held-out reasoning tasks such as MATH-500 and ARC-Challenge show that our confidence-aware fine-tuning improves both calibration and accuracy, while also enhancing interpretability by aligning the model's reasoning path with its confidence.

Title: Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization

Authors: Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2506.03725
Pdf URL: https://arxiv.org/pdf/2506.03725
Copy Paste: [[2506.03725]] Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization(https://arxiv.org/abs/2506.03725)
Keywords: large language model
Abstract: Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.

Title: ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

Authors: Hao Yu, Tangyu Jiang, Shuning Jia, Shannan Yan, Shunning Liu, Haolong Qian, Guanghao Li, Shuting Dong, Huaisong Zhang, Chun Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03737
Pdf URL: https://arxiv.org/pdf/2506.03737
Copy Paste: [[2506.03737]] ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices(https://arxiv.org/abs/2506.03737)
Keywords: robust, transformer
Abstract: The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information. Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position. Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism. However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model's capacity. In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation, which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset. Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available at this https URL

Title: SAAT: Synergistic Alternating Aggregation Transformer for Image Super-Resolution

Authors: Jianfeng Wu, Nannan Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03740
Pdf URL: https://arxiv.org/pdf/2506.03740
Copy Paste: [[2506.03740]] SAAT: Synergistic Alternating Aggregation Transformer for Image Super-Resolution(https://arxiv.org/abs/2506.03740)
Keywords: transformer
Abstract: Single image super-resolution is a well-known downstream task which aims to restore low-resolution images into high-resolution images. At present, models based on Transformers have shone brightly in the field of super-resolution due to their ability to capture long-term dependencies in information. However, current methods typically compute self-attention in nonoverlapping windows to save computational costs, and the standard self-attention computation only focuses on its results, thereby neglecting the useful information across channels and the rich spatial structural information generated in the intermediate process. Channel attention and spatial attention have, respectively, brought significant improvements to various downstream visual tasks in terms of extracting feature dependency and spatial structure relationships, but the synergistic relationship between channel and spatial attention has not been fully explored this http URL address these issues, we propose a novel model. Synergistic Alternating Aggregation Transformer (SAAT), which can better utilize the potential information of features. In SAAT, we introduce the Efficient Channel & Window Synergistic Attention Group (CWSAG) and the Spatial & Window Synergistic Attention Group (SWSAG). On the one hand, CWSAG combines efficient channel attention with shifted window attention, enhancing non-local feature fusion, and producing more visually appealing results. On the other hand, SWSAG leverages spatial attention to capture rich structured feature information, thereby enabling SAAT to more effectively extract structural this http URL experimental results and ablation studies demonstrate the effectiveness of SAAT in the field of super-resolution. SAAT achieves performance comparable to that of the state-of-the-art (SOTA) under the same quantity of parameters.

Title: Dropout-Robust Mechanisms for Differentially Private and Fully Decentralized Mean Estimation

Authors: César Sabater, Sonia Ben Mokhtar, Jan Ramon
Subjects: cs.CR, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03746
Pdf URL: https://arxiv.org/pdf/2506.03746
Copy Paste: [[2506.03746]] Dropout-Robust Mechanisms for Differentially Private and Fully Decentralized Mean Estimation(https://arxiv.org/abs/2506.03746)
Keywords: privacy, robust
Abstract: Achieving differentially private computations in decentralized settings poses significant challenges, particularly regarding accuracy, communication cost, and robustness against information leakage. While cryptographic solutions offer promise, they often suffer from high communication overhead or require centralization in the presence of network failures. Conversely, existing fully decentralized approaches typically rely on relaxed adversarial models or pairwise noise cancellation, the latter suffering from substantial accuracy degradation if parties unexpectedly disconnect. In this work, we propose IncA, a new protocol for fully decentralized mean estimation, a widely used primitive in data-intensive processing. Our protocol, which enforces differential privacy, requires no central orchestration and employs low-variance correlated noise, achieved by incrementally injecting sensitive information into the computation. First, we theoretically demonstrate that, when no parties permanently disconnect, our protocol achieves accuracy comparable to that of a centralized setting-already an improvement over most existing decentralized differentially private techniques. Second, we empirically show that our use of low-variance correlated noise significantly mitigates the accuracy loss experienced by existing techniques in the presence of dropouts.

Title: Scaling CrossQ with Weight Normalization

Authors: Daniel Palenicek, Florian Vogt, Jan Peters
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03758
Pdf URL: https://arxiv.org/pdf/2506.03758
Copy Paste: [[2506.03758]] Scaling CrossQ with Weight Normalization(https://arxiv.org/abs/2506.03758)
Keywords: robust
Abstract: Reinforcement learning has achieved significant milestones, but sample efficiency remains a bottleneck for real-world applications. Recently, CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1. In this work, we explore CrossQ's scaling behavior with higher UTD ratios. We identify challenges in the training dynamics which are emphasized by higher UTDs, particularly Q-bias explosion and the growing magnitude of critic network weights. To address this, we integrate weight normalization into the CrossQ framework, a solution that stabilizes training, prevents potential loss of plasticity and keeps the effective learning rate constant. Our proposed approach reliably scales with increasing UTD ratios, achieving competitive or superior performance across a range of challenging tasks on the DeepMind control benchmark, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a robust pathway for improving sample efficiency and scalability in model-free reinforcement learning.

Title: Act-as-Pet: Benchmarking the Abilities of Large Language Models as E-Pets in Social Network Services

Authors: Hongcheng Guo, Zheyong Xie, Shaosheng Cao, Boyang Wang, Weiting Liu, Zheyu Ye, Zhoujun Li, Zuozhu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03761
Pdf URL: https://arxiv.org/pdf/2506.03761
Copy Paste: [[2506.03761]] Act-as-Pet: Benchmarking the Abilities of Large Language Models as E-Pets in Social Network Services(https://arxiv.org/abs/2506.03761)
Keywords: large language model
Abstract: As interest in using Large Language Models (LLMs) for interactive and emotionally rich experiences grows, virtual pet companionship emerges as a novel yet underexplored application. Existing approaches focus on basic pet role-playing interactions without systematically benchmarking LLMs for comprehensive companionship. In this paper, we introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across both self-interaction and human-interaction dimensions. Unlike prior work, Pet-Bench emphasizes self-evolution and developmental behaviors alongside interactive engagement, offering a more realistic reflection of pet companionship. It features diverse tasks such as intelligent scheduling, memory-based dialogues, and psychological conversations, with over 7,500 interaction instances designed to simulate complex pet behaviors. Evaluation of 28 LLMs reveals significant performance variations linked to model size and inherent capabilities, underscoring the need for specialized optimization in this domain. Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions.

Title: AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models

Authors: Yifeng Gu, Zicong Jiang, Jianxiu Jin, Kailing Guo, Ziyang Zhang, Xiangmin Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03762
Pdf URL: https://arxiv.org/pdf/2506.03762
Copy Paste: [[2506.03762]] AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models(https://arxiv.org/abs/2506.03762)
Keywords: large language model
Abstract: Large Language Models (LLMs) have significantly advanced the field of Artificial Intelligence. However, their deployment is resource-intensive, not only due to the large number of model parameters but also because the (Key-Value) KV cache consumes a lot of memory during inference. While several works propose reducing the KV cache by evicting the unnecessary tokens, these approaches rely on accumulated attention score as eviction score to quantify the importance of the token. We identify the accumulated attention score is biased and it decreases with the position of the tokens in the mathematical expectation. As a result, the retained tokens concentrate on the initial positions, limiting model's access to global contextual information. To address this issue, we propose Adaptive holistic attention KV (AhaKV), it addresses the bias of the accumulated attention score by adaptively tuning the scale of softmax according the expectation of information entropy of attention scores. To make use of the holistic attention information in self-attention mechanism, AhaKV utilize the information of value vectors, which is overlooked in previous works, to refine the adaptive score. We show theoretically that our method is well suited for bias reduction. We deployed AhaKV on different models with a fixed cache budget. Experiments show that AhaKV successfully mitigates bias and retains crucial tokens across global context and achieve state-of-the-art results against other related work on several benchmark tasks.

Title: ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations

Authors: Quang Hieu Pham, Thuy Duong Nguyen, Tung Pham, Anh Tuan Luu, Dat Quoc Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03763
Pdf URL: https://arxiv.org/pdf/2506.03763
Copy Paste: [[2506.03763]] ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations(https://arxiv.org/abs/2506.03763)
Keywords: robust, large language model
Abstract: The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.

Title: Prediction Inconsistency Helps Achieve Generalizable Detection of Adversarial Examples

Authors: Sicong Han, Chenhao Lin, Zhengyu Zhao, Xiyuan Wang, Xinlei He, Qian Li, Cong Wang, Qian Wang, Chao Shen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.03765
Pdf URL: https://arxiv.org/pdf/2506.03765
Copy Paste: [[2506.03765]] Prediction Inconsistency Helps Achieve Generalizable Detection of Adversarial Examples(https://arxiv.org/abs/2506.03765)
Keywords: protect, attack
Abstract: Adversarial detection protects models from adversarial attacks by refusing suspicious test samples. However, current detection methods often suffer from weak generalization: their effectiveness tends to degrade significantly when applied to adversarially trained models rather than naturally trained ones, and they generally struggle to achieve consistent effectiveness across both white-box and black-box attack settings. In this work, we observe that an auxiliary model, differing from the primary model in training strategy or model architecture, tends to assign low confidence to the primary model's predictions on adversarial examples (AEs), while preserving high confidence on normal examples (NEs). Based on this discovery, we propose Prediction Inconsistency Detector (PID), a lightweight and generalizable detection framework to distinguish AEs from NEs by capturing the prediction inconsistency between the primal and auxiliary models. PID is compatible with both naturally and adversarially trained primal models and outperforms four detection methods across 3 white-box, 3 black-box, and 1 mixed adversarial attacks. Specifically, PID achieves average AUC scores of 99.29\% and 99.30\% on CIFAR-10 when the primal model is naturally and adversarially trained, respectively, and 98.31% and 96.81% on ImageNet under the same conditions, outperforming existing SOTAs by 4.70%$\sim$25.46%.

Title: FedFACT: A Provable Framework for Controllable Group-Fairness Calibration in Federated Learning

Authors: Li Zhang, Zhongxuan Han, Chaochao chen, Xiaohua Feng, Jiaming Zhang, Yuyuan Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03777
Pdf URL: https://arxiv.org/pdf/2506.03777
Copy Paste: [[2506.03777]] FedFACT: A Provable Framework for Controllable Group-Fairness Calibration in Federated Learning(https://arxiv.org/abs/2506.03777)
Keywords: federate, fair
Abstract: With emerging application of Federated Learning (FL) in decision-making scenarios, it is imperative to regulate model fairness to prevent disparities across sensitive groups (e.g., female, male). Current research predominantly focuses on two concepts of group fairness within FL: Global Fairness (overall model disparity across all clients) and Local Fairness (the disparity within each client). However, the non-decomposable, non-differentiable nature of fairness criteria pose two fundamental, unresolved challenges for fair FL: (i) Harmonizing global and local fairness in multi-class classification; (ii) Enabling a controllable, optimal accuracy-fairness trade-off. To tackle the aforementioned challenges, we propose a novel controllable federated group-fairness calibration framework, named FedFACT. FedFACT identifies the Bayes-optimal classifiers under both global and local fairness constraints in multi-class case, yielding models with minimal performance decline while guaranteeing fairness. To effectively realize an adjustable, optimal accuracy-fairness balance, we derive specific characterizations of the Bayes-optimal fair classifiers for reformulating fair FL as personalized cost-sensitive learning problem for in-processing, and bi-level optimization for post-processing. Theoretically, we provide convergence and generalization guarantees for FedFACT to approach the near-optimal accuracy under given fairness levels. Extensive experiments on multiple datasets across various data heterogeneity demonstrate that FedFACT consistently outperforms baselines in balancing accuracy and global-local fairness.

Title: Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models

Authors: Seungcheol Park, Jeongin Bae, Beomseok Kwon, Minjun Kim, Byeongwook Kim, Se Jung Kwon, U Kang, Dongsoo Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03781
Pdf URL: https://arxiv.org/pdf/2506.03781
Copy Paste: [[2506.03781]] Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models(https://arxiv.org/abs/2506.03781)
Keywords: large language model
Abstract: How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF (Unified Quantization with Flexible Mapping), an accurate quantization method for LLMs. UniQuanF harnesses both strong expressiveness and optimizability by unifying the flexible mapping technique in UQ and non-uniform quantization levels of BCQ. We propose unified initialization, and local and periodic mapping techniques to optimize the parameters in UniQuanF precisely. After optimization, our unification theorem removes computational and memory overhead, allowing us to utilize the superior accuracy of UniQuanF without extra deployment costs induced by the unification. Experimental results demonstrate that UniQuanF outperforms existing UQ and BCQ methods, achieving up to 4.60% higher accuracy on GSM8K benchmark.

Title: Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Authors: Isik Baran Sandan, Tu Anh Dinh, Jan Niehues
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03785
Pdf URL: https://arxiv.org/pdf/2506.03785
Copy Paste: [[2506.03785]] Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons(https://arxiv.org/abs/2506.03785)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.

Title: Attention-Only Transformers via Unrolled Subspace Denoising

Authors: Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03790
Pdf URL: https://arxiv.org/pdf/2506.03790
Copy Paste: [[2506.03790]] Attention-Only Transformers via Unrolled Subspace Denoising(https://arxiv.org/abs/2506.03790)
Keywords: transformer
Abstract: Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.

Title: Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts

Authors: Sidharth Pulipaka, Sparsh Jain, Ashwin Sankar, Raj Dabre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03793
Pdf URL: https://arxiv.org/pdf/2506.03793
Copy Paste: [[2506.03793]] Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts(https://arxiv.org/abs/2506.03793)
Keywords: robust, large language model
Abstract: Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence practical value for low resource NLP pipelines at scale.

Title: ConText: Driving In-context Learning for Text Removal and Segmentation

Authors: Fei Zhang, Pei Zhang, Baosong Yang, Fei Huang, Yanfeng Wang, Ya Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03799
Pdf URL: https://arxiv.org/pdf/2506.03799
Copy Paste: [[2506.03799]] ConText: Driving In-context Learning for Text Removal and Segmentation(https://arxiv.org/abs/2506.03799)
Keywords: segmentation
Abstract: This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at this https URL.

Title: Automatic Correction of Writing Anomalies in Hausa Texts

Authors: Ahmad Mustapha Wali, Sergiu Nisioi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03820
Pdf URL: https://arxiv.org/pdf/2506.03820
Copy Paste: [[2506.03820]] Automatic Correction of Writing Anomalies in Hausa Texts(https://arxiv.org/abs/2506.03820)
Keywords: robust, transformer
Abstract: Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we created a large-scale parallel dataset of over 450,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise, fine-tuned to mimic realistic writing errors. Moreover, we adapted several multilingual and African language-focused models, including M2M100, AfriTEVA, mBART, and Opus-MT variants for this correction task using SentencePiece tokenization. Our experimental results demonstrate significant increases in F1, BLEU and METEOR scores, as well as reductions in Character Error Rate (CER) and Word Error Rate (WER). This research provides a robust methodology, a publicly available dataset, and effective models to improve Hausa text quality, thereby advancing NLP capabilities for the language and offering transferable insights for other low-resource languages.

Title: CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

Authors: Fabian Karl, Ansgar Scherp
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.03822
Pdf URL: https://arxiv.org/pdf/2506.03822
Copy Paste: [[2506.03822]] CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents(https://arxiv.org/abs/2506.03822)
Keywords: robust, extraction
Abstract: Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at this https URL.

Title: Multi-objective Aligned Bidword Generation Model for E-commerce Search Advertising

Authors: Zhenhui Liu, Chunyuan Yuan, Ming Pang, Zheng Fang, Li Yuan, Xue Jiang, Changping Peng, Zhangang Lin, Zheng Luo, Jingping Shao
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.03827
Pdf URL: https://arxiv.org/pdf/2506.03827
Copy Paste: [[2506.03827]] Multi-objective Aligned Bidword Generation Model for E-commerce Search Advertising(https://arxiv.org/abs/2506.03827)
Keywords: robust
Abstract: Retrieval systems primarily address the challenge of matching user queries with the most relevant advertisements, playing a crucial role in e-commerce search advertising. The diversity of user needs and expressions often produces massive long-tail queries that cannot be matched with merchant bidwords or product titles, which results in some advertisements not being recalled, ultimately harming user experience and search efficiency. Existing query rewriting research focuses on various methods such as query log mining, query-bidword vector matching, or generation-based rewriting. However, these methods often fail to simultaneously optimize the relevance and authenticity of the user's original query and rewrite and maximize the revenue potential of recalled ads. In this paper, we propose a Multi-objective aligned Bidword Generation Model (MoBGM), which is composed of a discriminator, generator, and preference alignment module, to address these challenges. To simultaneously improve the relevance and authenticity of the query and rewrite and maximize the platform revenue, we design a discriminator to optimize these key objectives. Using the feedback signal of the discriminator, we train a multi-objective aligned bidword generator that aims to maximize the combined effect of the three objectives. Extensive offline and online experiments show that our proposed algorithm significantly outperforms the state of the art. After deployment, the algorithm has created huge commercial value for the platform, further verifying its feasibility and robustness.

Title: Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

Authors: Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03850
Pdf URL: https://arxiv.org/pdf/2506.03850
Copy Paste: [[2506.03850]] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning(https://arxiv.org/abs/2506.03850)
Keywords: robust
Abstract: Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which estimates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.

Title: Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

Authors: Mingxuan Xia, Haobo Wang, Yixuan Li, Zewei Yu, Jindong Wang, Junbo Zhao, Runze Wu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.03857
Pdf URL: https://arxiv.org/pdf/2506.03857
Copy Paste: [[2506.03857]] Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation(https://arxiv.org/abs/2506.03857)
Keywords: large language model
Abstract: Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at this https URL.

Title: PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

Authors: Qiuhan Han, Qian Wang, Atsushi Yoshikawa, Masayuki Yamamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03861
Pdf URL: https://arxiv.org/pdf/2506.03861
Copy Paste: [[2506.03861]] PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading(https://arxiv.org/abs/2506.03861)
Keywords: robust, large language model
Abstract: High-Frequency Trading (HFT) is pivotal in cryptocurrency markets, demanding rapid decision-making. Social media platforms like Reddit offer valuable, yet underexplored, information for such high-frequency, short-term trading. This paper introduces \textbf{PulseReddit}, a novel dataset that is the first to align large-scale Reddit discussion data with high-frequency cryptocurrency market statistics for short-term trading analysis. We conduct an extensive empirical study using Large Language Model (LLM)-based Multi-Agent Systems (MAS) to investigate the impact of social sentiment from PulseReddit on trading performance. Our experiments conclude that MAS augmented with PulseReddit data achieve superior trading outcomes compared to traditional baselines, particularly in bull markets, and demonstrate robust adaptability across different market regimes. Furthermore, our research provides conclusive insights into the performance-efficiency trade-offs of different LLMs, detailing significant considerations for practical model selection in HFT applications. PulseReddit and our findings establish a foundation for advanced MAS research in HFT, demonstrating the tangible benefits of integrating social media.

Title: EuroGEST: Investigating gender stereotypes in multilingual language models

Authors: Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03867
Pdf URL: https://arxiv.org/pdf/2506.03867
Copy Paste: [[2506.03867]] EuroGEST: Investigating gender stereotypes in multilingual language models(https://arxiv.org/abs/2506.03867)
Keywords: fair, large language model
Abstract: Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are \textit{beautiful,} \textit{empathetic} and \textit{neat} and men are \textit{leaders}, \textit{strong, tough} and \textit{professional}. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.

Title: Evaluating Apple Intelligence's Writing Tools for Privacy Against Large Language Model-Based Inference Attacks: Insights from Early Datasets

Authors: Mohd. Farhan Israk Soumik, Syed Mhamudul Hasan, Abdur R. Shahid
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.03870
Pdf URL: https://arxiv.org/pdf/2506.03870
Copy Paste: [[2506.03870]] Evaluating Apple Intelligence's Writing Tools for Privacy Against Large Language Model-Based Inference Attacks: Insights from Early Datasets(https://arxiv.org/abs/2506.03870)
Keywords: privacy, protect, attack, large language model
Abstract: The misuse of Large Language Models (LLMs) to infer emotions from text for malicious purposes, known as emotion inference attacks, poses a significant threat to user privacy. In this paper, we investigate the potential of Apple Intelligence's writing tools, integrated across iPhone, iPad, and MacBook, to mitigate these risks through text modifications such as rewriting and tone adjustment. By developing early novel datasets specifically for this purpose, we empirically assess how different text modifications influence LLM-based detection. This capability suggests strong potential for Apple Intelligence's writing tools as privacy-preserving mechanisms. Our findings lay the groundwork for future adaptive rewriting systems capable of dynamically neutralizing sensitive emotional content to enhance user privacy. To the best of our knowledge, this research provides the first empirical analysis of Apple Intelligence's text-modification tools within a privacy-preservation context with the broader goal of developing on-device, user-centric privacy-preserving mechanisms to protect against LLMs-based advanced inference attacks on deployed systems.

Title: JointSplat: Probabilistic Joint Flow-Depth Optimization for Sparse-View Gaussian Splatting

Authors: Yang Xiao, Guoan Xu, Qiang Wu, Wenjing Jia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03872
Pdf URL: https://arxiv.org/pdf/2506.03872
Copy Paste: [[2506.03872]] JointSplat: Probabilistic Joint Flow-Depth Optimization for Sparse-View Gaussian Splatting(https://arxiv.org/abs/2506.03872)
Keywords: robust
Abstract: Reconstructing 3D scenes from sparse viewpoints is a long-standing challenge with wide applications. Recent advances in feed-forward 3D Gaussian sparse-view reconstruction methods provide an efficient solution for real-time novel view synthesis by leveraging geometric priors learned from large-scale multi-view datasets and computing 3D Gaussian centers via back-projection. Despite offering strong geometric cues, both feed-forward multi-view depth estimation and flow-depth joint estimation face key limitations: the former suffers from mislocation and artifact issues in low-texture or repetitive regions, while the latter is prone to local noise and global inconsistency due to unreliable matches when ground-truth flow supervision is unavailable. To overcome this, we propose JointSplat, a unified framework that leverages the complementarity between optical flow and depth via a novel probabilistic optimization mechanism. Specifically, this pixel-level mechanism scales the information fusion between depth and flow based on the matching probability of optical flow during training. Building upon the above mechanism, we further propose a novel multi-view depth-consistency loss to leverage the reliability of supervision while suppressing misleading gradients in uncertain areas. Evaluated on RealEstate10K and ACID, JointSplat consistently outperforms state-of-the-art (SOTA) methods, demonstrating the effectiveness and robustness of our proposed probabilistic joint flow-depth optimization approach for high-fidelity sparse-view 3D reconstruction.

Title: RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

Authors: Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, Jianhua Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03880
Pdf URL: https://arxiv.org/pdf/2506.03880
Copy Paste: [[2506.03880]] RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing(https://arxiv.org/abs/2506.03880)
Keywords: robust, transformer, large language model
Abstract: The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present RadialRouter, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2\% and 5.8\% in the Balance and Cost First scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.

Title: Video, How Do Your Tokens Merge?

Authors: Sam Pollard, Michael Wray
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03885
Pdf URL: https://arxiv.org/pdf/2506.03885
Copy Paste: [[2506.03885]] Video, How Do Your Tokens Merge?(https://arxiv.org/abs/2506.03885)
Keywords: transformer
Abstract: Video transformer models require huge amounts of compute resources due to the spatio-temporal scaling of the input. Tackling this, recent methods have proposed to drop or merge tokens for image models, whether randomly or via learned methods. Merging tokens has many benefits: it can be plugged into any vision transformer, does not require model re-training, and it propagates information that would otherwise be dropped through the model. Before now, video token merging has not been evaluated on temporally complex datasets for video understanding. In this work, we explore training-free token merging for video to provide comprehensive experiments and find best practices across four video transformers on three datasets that exhibit coarse and fine-grained action recognition. Our results showcase the benefits of video token merging with a speedup of around $2.5$X while maintaining accuracy (avg. $-0.55\%$ for ViViT). Code available at this https URL.

Title: Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems

Authors: Yuxin Zhang, Yan Wang, Yongrui Chen, Shenyu Zhang, Xinbang Dai, Sheng Bi, Guilin Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03901
Pdf URL: https://arxiv.org/pdf/2506.03901
Copy Paste: [[2506.03901]] Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems(https://arxiv.org/abs/2506.03901)
Keywords: robust, large language model
Abstract: Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external retrieved information, mitigating issues such as hallucination and outdated knowledge. However, RAG systems are highly sensitive to retrieval noise prevalent in real-world scenarios. Existing benchmarks fail to emulate the complex and heterogeneous noise distributions encountered in real-world retrieval environments, undermining reliable robustness assessment. In this paper, we define four categories of retrieval noise based on linguistic properties and noise characteristics, aiming to reflect the heterogeneity of noise in real-world scenarios. Building on this, we introduce Magic Mushroom, a benchmark for replicating "magic mushroom" noise: contexts that appear relevant on the surface but covertly mislead RAG systems. Magic Mushroom comprises 7,468 single-hop and 3,925 multi-hop question-answer pairs. More importantly, Magic Mushroom enables researchers to flexibly configure combinations of retrieval noise according to specific research objectives or application scenarios, allowing for highly controlled evaluation setups. We evaluate LLM generators of varying parameter scales and classic RAG denoising strategies under diverse noise distributions to investigate their performance dynamics during progressive noise encroachment. Our analysis reveals that both generators and denoising strategies have significant room for improvement and exhibit extreme sensitivity to noise distributions. Magic Mushroom emerges as a promising tool for evaluating and advancing noise-robust RAG systems, accelerating their widespread deployment in real-world applications. The Magic Mushroom benchmark is available at the this https URL.

Title: Learning Fair And Effective Points-Based Rewards Programs

Authors: Chamsi Hssaine, Yichun Hu, Ciara Pike-Burke
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2506.03911
Pdf URL: https://arxiv.org/pdf/2506.03911
Copy Paste: [[2506.03911]] Learning Fair And Effective Points-Based Rewards Programs(https://arxiv.org/abs/2506.03911)
Keywords: fair
Abstract: Points-based rewards programs are a prevalent way to incentivize customer loyalty; in these programs, customers who make repeated purchases from a seller accumulate points, working toward eventual redemption of a free reward. These programs have recently come under scrutiny due to accusations of unfair practices in their implementation. Motivated by these concerns, we study the problem of fairly designing points-based rewards programs, with a focus on two obstacles that put fairness at odds with their effectiveness. First, due to customer heterogeneity, the seller should set different redemption thresholds for different customers to generate high revenue. Second, the relationship between customer behavior and the number of accumulated points is typically unknown; this requires experimentation which may unfairly devalue customers' previously earned points. We first show that an individually fair rewards program that uses the same redemption threshold for all customers suffers a loss in revenue of at most a factor of $1+\ln 2$, compared to the optimal personalized strategy that differentiates between customers. We then tackle the problem of designing temporally fair learning algorithms in the presence of demand uncertainty. Toward this goal, we design a learning algorithm that limits the risk of point devaluation due to experimentation by only changing the redemption threshold $O(\log T)$ times, over a horizon of length $T$. This algorithm achieves the optimal (up to polylogarithmic factors) $\widetilde{O}(\sqrt{T})$ regret in expectation. We then modify this algorithm to only ever decrease redemption thresholds, leading to improved fairness at a cost of only a constant factor in regret. Extensive numerical experiments show the limited value of personalization in average-case settings, in addition to demonstrating the strong practical performance of our proposed learning algorithms.

Title: When Fairness Isn't Statistical: The Limits of Machine Learning in Evaluating Legal Reasoning

Authors: Claire Barale, Michael Rovatsos, Nehal Bhuta
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03913
Pdf URL: https://arxiv.org/pdf/2506.03913
Copy Paste: [[2506.03913]] When Fairness Isn't Statistical: The Limits of Machine Learning in Evaluating Legal Reasoning(https://arxiv.org/abs/2506.03913)
Keywords: fair
Abstract: Legal decisions are increasingly evaluated for fairness, consistency, and bias using machine learning (ML) techniques. In high-stakes domains like refugee adjudication, such methods are often applied to detect disparities in outcomes. Yet it remains unclear whether statistical methods can meaningfully assess fairness in legal contexts shaped by discretion, normative complexity, and limited ground truth. In this paper, we empirically evaluate three common ML approaches (feature-based analysis, semantic clustering, and predictive modeling) on a large, real-world dataset of 59,000+ Canadian refugee decisions (AsyLex). Our experiments show that these methods produce divergent and sometimes contradictory signals, that predictive modeling often depends on contextual and procedural features rather than legal features, and that semantic clustering fails to capture substantive legal reasoning. We show limitations of statistical fairness evaluation, challenge the assumption that statistical regularity equates to fairness, and argue that current computational approaches fall short of evaluating fairness in legally discretionary domains. We argue that evaluating fairness in law requires methods grounded not only in data, but in legal reasoning and institutional context.

Title: Learning equivariant models by discovering symmetries with learnable augmentations

Authors: Eduardo Santos Escriche, Stefanie Jegelka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03914
Pdf URL: https://arxiv.org/pdf/2506.03914
Copy Paste: [[2506.03914]] Learning equivariant models by discovering symmetries with learnable augmentations(https://arxiv.org/abs/2506.03914)
Keywords: robust, interpretability
Abstract: Recently, a trend has emerged that favors learning relevant symmetries from data in geometric domains instead of designing constrained architectures. To do so, two popular options are (1) to modify the training protocol, e.g., with a specific loss and data augmentations (soft equivariance), or (2) to ignore equivariance and infer it only implicitly. However, both options have limitations: soft equivariance requires a priori knowledge about relevant symmetries, while inferring symmetries merely via the task and larger data lacks interpretability. To address both limitations, we propose SEMoLA, an end-to-end approach that jointly (1) discovers a priori unknown symmetries in the data via learnable data augmentations, and (2) softly encodes the respective approximate equivariance into an arbitrary unconstrained model. Hence, it does not need prior knowledge about symmetries, it offers interpretability, and it maintains robustness to distribution shifts. Empirically, we demonstrate the ability of SEMoLA to robustly discover relevant symmetries while achieving high prediction accuracy across various datasets, encompassing multiple data modalities and underlying symmetry groups.

Title: Learning from Noise: Enhancing DNNs for Event-Based Vision through Controlled Noise Injection

Authors: Marcin Kowalczyk, Kamil Jeziorek, Tomasz Kryjak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03918
Pdf URL: https://arxiv.org/pdf/2506.03918
Copy Paste: [[2506.03918]] Learning from Noise: Enhancing DNNs for Event-Based Vision through Controlled Noise Injection(https://arxiv.org/abs/2506.03918)
Keywords: robust, transformer
Abstract: Event-based sensors offer significant advantages over traditional frame-based cameras, especially in scenarios involving rapid motion or challenging lighting conditions. However, event data frequently suffers from considerable noise, negatively impacting the performance and robustness of deep learning models. Traditionally, this problem has been addressed by applying filtering algorithms to the event stream, but this may also remove some of relevant data. In this paper, we propose a novel noise-injection training methodology designed to enhance the neural networks robustness against varying levels of event noise. Our approach introduces controlled noise directly into the training data, enabling models to learn noise-resilient representations. We have conducted extensive evaluations of the proposed method using multiple benchmark datasets (N-Caltech101, N-Cars, and Mini N-ImageNet) and various network architectures, including Convolutional Neural Networks, Vision Transformers, Spiking Neural Networks, and Graph Convolutional Networks. Experimental results show that our noise-injection training strategy achieves stable performance over a range of noise intensities, consistently outperforms event-filtering techniques, and achieves the highest average classification accuracy, making it a viable alternative to traditional event-data filtering methods in an object classification system. Code: this https URL

Title: HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Authors: Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, Siqi He, Shannan Yan, Junzhe Chen, Xiaomin He, Chaoya Jiang, Wei Ye, Kaidong Yu, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03922
Pdf URL: https://arxiv.org/pdf/2506.03922
Copy Paste: [[2506.03922]] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models(https://arxiv.org/abs/2506.03922)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

Title: More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning

Authors: Mohammadamin Shafiei, Hamidreza Saffari, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03923
Pdf URL: https://arxiv.org/pdf/2506.03923
Copy Paste: [[2506.03923]] More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning(https://arxiv.org/abs/2506.03923)
Keywords: robust, fair, large language model
Abstract: Large language models (LLMs) are known to be sensitive to input phrasing, but the mechanisms by which semantic cues shape reasoning remain poorly understood. We investigate this phenomenon in the context of comparative math problems with objective ground truth, revealing a consistent and directional framing bias: logically equivalent questions containing the words ``more'', ``less'', or ``equal'' systematically steer predictions in the direction of the framing term. To study this effect, we introduce MathComp, a controlled benchmark of 300 comparison scenarios, each evaluated under 14 prompt variants across three LLM families. We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt. Chain-of-thought prompting reduces these biases, but its effectiveness varies: free-form reasoning is more robust, while structured formats may preserve or reintroduce directional drift. Finally, we show that including demographic identity terms (e.g., ``a woman'', ``a Black person'') in input scenarios amplifies directional drift, despite identical underlying quantities, highlighting the interplay between semantic framing and social referents. These findings expose critical blind spots in standard evaluation and motivate framing-aware benchmarks for diagnosing reasoning robustness and fairness in LLMs.

Title: Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature Resample

Authors: Ze Feng, Jiang-Jiang Liu, Sen Yang, Lingyu Xiao, Xiaofan Li, Wankou Yang, Jingdong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03928
Pdf URL: https://arxiv.org/pdf/2506.03928
Copy Paste: [[2506.03928]] Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature Resample(https://arxiv.org/abs/2506.03928)
Keywords: large language model
Abstract: In this work, we study the Efficient Multimodal Large Language Model. Redundant vision tokens consume a significant amount of computational memory and resources. Therefore, many previous works compress them in the Vision Projector to reduce the number of vision tokens. However, simply compressing in the Vision Projector can lead to the loss of visual information, especially for tasks that rely on fine-grained spatial relationships, such as OCR and Chart \& Table Understanding. To address this problem, we propose Vision Remember, which is inserted between the LLM decoder layers to allow vision tokens to re-memorize vision features. Specifically, we retain multi-level vision features and resample them with the vision tokens that have interacted with the text token. During the resampling process, each vision token only attends to a local region in vision features, which is referred to as saliency-enhancing local attention. Saliency-enhancing local attention not only improves computational efficiency but also captures more fine-grained contextual information and spatial relationships within the region. Comprehensive experiments on multiple visual understanding benchmarks validate the effectiveness of our method when combined with various Efficient Vision Projectors, showing performance gains without sacrificing efficiency. Based on Vision Remember, LLaVA-VR with only 2B parameters is also superior to previous representative MLLMs such as Tokenpacker-HD-7B and DeepSeek-VL-7B.

Title: DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models

Authors: Jia Fu, Yongtao Wu, Yihang Chen, Kunyu Peng, Xiao Zhang, Volkan Cevher, Sepideh Pashami, Anders Holst
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03933
Pdf URL: https://arxiv.org/pdf/2506.03933
Copy Paste: [[2506.03933]] DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models(https://arxiv.org/abs/2506.03933)
Keywords: secure, defense, attack, robust, diffusion
Abstract: Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We observe that adding minimal noise to an adversarially corrupted image significantly alters its latent embedding with respect to VLMs. Building on this insight, DiffCAP cumulatively injects random Gaussian noise into adversarially perturbed input data. This process continues until the embeddings of two consecutive noisy images reach a predefined similarity threshold, indicating a potential approach to neutralize the adversarial effect. Subsequently, a pretrained diffusion model is employed to denoise the stabilized image, recovering a clean representation suitable for the VLMs to produce an output. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP consistently outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with strong theoretical and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments.

Title: Depermissioning Web3: a Permissionless Accountable RPC Protocol for Blockchain Networks

Authors: Weihong Wang, Tom Van Cutsem
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2506.03940
Pdf URL: https://arxiv.org/pdf/2506.03940
Copy Paste: [[2506.03940]] Depermissioning Web3: a Permissionless Accountable RPC Protocol for Blockchain Networks(https://arxiv.org/abs/2506.03940)
Keywords: privacy
Abstract: In blockchain networks, so-called "full nodes" serve data to and relay transactions from clients through an RPC interface. This serving layer enables integration of "Web3" data, stored on blockchains, with "Web2" mobile or web applications that cannot directly participate as peers in a blockchain network. In practice, the serving layer is dominated by a small number of centralized services ("node providers") that offer permissioned access to RPC endpoints. Clients register with these providers because they offer reliable and convenient access to blockchain data: operating a full node themselves requires significant computational and storage resources, and public (permissionless) RPC nodes lack financial incentives to serve large numbers of clients with consistent performance. Permissioned access to an otherwise permissionless blockchain network raises concerns regarding the privacy, integrity, and availability of data access. To address this, we propose a Permissionless Accountable RPC Protocol (PARP). It enables clients and full nodes to interact pseudonymously while keeping both parties accountable. PARP leverages "light client" schemes for essential data integrity checks, combined with fraud proofs, to keep full nodes honest and accountable. It integrates payment channels to facilitate micro-payments, holding clients accountable for the resources they consume and providing an economic incentive for full nodes to serve. Our prototype implementation for Ethereum demonstrates the feasibility of PARP, and we quantify its overhead compared to the base RPC protocol.

Title: Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation

Authors: Theodore Barfoot, Luis C. Garcia-Peraza-Herrera, Samet Akcay, Ben Glocker, Tom Vercauteren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03942
Pdf URL: https://arxiv.org/pdf/2506.03942
Copy Paste: [[2506.03942]] Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation(https://arxiv.org/abs/2506.03942)
Keywords: segmentation
Abstract: Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: this https URL

Title: Lower Ricci Curvature for Hypergraphs

Authors: Shiyi Yang, Can Chen, Didong Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03943
Pdf URL: https://arxiv.org/pdf/2506.03943
Copy Paste: [[2506.03943]] Lower Ricci Curvature for Hypergraphs(https://arxiv.org/abs/2506.03943)
Keywords: robust, interpretability, generative
Abstract: Networks with higher-order interactions, prevalent in biological, social, and information systems, are naturally represented as hypergraphs, yet their structural complexity poses fundamental challenges for geometric characterization. While curvature-based methods offer powerful insights in graph analysis, existing extensions to hypergraphs suffer from critical trade-offs: combinatorial approaches such as Forman-Ricci curvature capture only coarse features, whereas geometric methods like Ollivier-Ricci curvature offer richer expressivity but demand costly optimal transport computations. To address these challenges, we introduce hypergraph lower Ricci curvature (HLRC), a novel curvature metric defined in closed form that achieves a principled balance between interpretability and efficiency. Evaluated across diverse synthetic and real-world hypergraph datasets, HLRC consistently reveals meaningful higher-order organization, distinguishing intra- from inter-community hyperedges, uncovering latent semantic labels, tracking temporal dynamics, and supporting robust clustering of hypergraphs based on global structure. By unifying geometric sensitivity with algorithmic simplicity, HLRC provides a versatile foundation for hypergraph analytics, with broad implications for tasks including node classification, anomaly detection, and generative modeling in complex systems.

Title: HtFLlib: A Comprehensive Heterogeneous Federated Learning Library and Benchmark

Authors: Jianqing Zhang, Xinghao Wu, Yanbing Zhou, Xiaoting Sun, Qiqi Cai, Yang Liu, Yang Hua, Zhenzhe Zheng, Jian Cao, Qiang Yang
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2506.03954
Pdf URL: https://arxiv.org/pdf/2506.03954
Copy Paste: [[2506.03954]] HtFLlib: A Comprehensive Heterogeneous Federated Learning Library and Benchmark(https://arxiv.org/abs/2506.03954)
Keywords: robust, federate, fair
Abstract: As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at this https URL.

Title: Causality-Aware Contrastive Learning for Robust Multivariate Time-Series Anomaly Detection

Authors: HyunGi Kim, Jisoo Mok, Dongjun Lee, Jaihyun Lew, Sungjae Kim, Sungroh Yoon
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03964
Pdf URL: https://arxiv.org/pdf/2506.03964
Copy Paste: [[2506.03964]] Causality-Aware Contrastive Learning for Robust Multivariate Time-Series Anomaly Detection(https://arxiv.org/abs/2506.03964)
Keywords: robust
Abstract: Utilizing the complex inter-variable causal relationships within multivariate time-series provides a promising avenue toward more robust and reliable multivariate time-series anomaly detection (MTSAD) but remains an underexplored area of research. This paper proposes Causality-Aware contrastive learning for RObust multivariate Time-Series (CAROTS), a novel MTSAD pipeline that incorporates the notion of causality into contrastive learning. CAROTS employs two data augmentors to obtain causality-preserving and -disturbing samples that serve as a wide range of normal variations and synthetic anomalies, respectively. With causality-preserving and -disturbing samples as positives and negatives, CAROTS performs contrastive learning to train an encoder whose latent space separates normal and abnormal samples based on causality. Moreover, CAROTS introduces a similarity-filtered one-class contrastive loss that encourages the contrastive learning process to gradually incorporate more semantically diverse samples with common causal relationships. Extensive experiments on five real-world and two synthetic datasets validate that the integration of causal relationships endows CAROTS with improved MTSAD capabilities. The code is available at this https URL.

Title: From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

Authors: Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03968
Pdf URL: https://arxiv.org/pdf/2506.03968
Copy Paste: [[2506.03968]] From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding(https://arxiv.org/abs/2506.03968)
Keywords: large language model
Abstract: The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at this https URL.

Title: MS-YOLO: A Multi-Scale Model for Accurate and Efficient Blood Cell Detection

Authors: Guohua Wu, Shengqi Chen, Pengchao Deng, Wenting Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03972
Pdf URL: https://arxiv.org/pdf/2506.03972
Copy Paste: [[2506.03972]] MS-YOLO: A Multi-Scale Model for Accurate and Efficient Blood Cell Detection(https://arxiv.org/abs/2506.03972)
Keywords: robust
Abstract: Complete blood cell detection holds significant value in clinical diagnostics. Conventional manual microscopy methods suffer from time inefficiency and diagnostic inaccuracies. Existing automated detection approaches remain constrained by high deployment costs and suboptimal accuracy. While deep learning has introduced powerful paradigms to this field, persistent challenges in detecting overlapping cells and multi-scale objects hinder practical deployment. This study proposes the multi-scale YOLO (MS-YOLO), a blood cell detection model based on the YOLOv11 framework, incorporating three key architectural innovations to enhance detection performance. Specifically, the multi-scale dilated residual module (MS-DRM) replaces the original C3K2 modules to improve multi-scale discriminability; the dynamic cross-path feature enhancement module (DCFEM) enables the fusion of hierarchical features from the backbone with aggregated features from the neck to enhance feature representations; and the light adaptive-weight downsampling module (LADS) improves feature downsampling through adaptive spatial weighting while reducing computational complexity. Experimental results on the CBC benchmark demonstrate that MS-YOLO achieves precise detection of overlapping cells and multi-scale objects, particularly small targets such as platelets, achieving an mAP@50 of 97.4% that outperforms existing models. Further validation on the supplementary WBCDD dataset confirms its robust generalization capability. Additionally, with a lightweight architecture and real-time inference efficiency, MS-YOLO meets clinical deployment requirements, providing reliable technical support for standardized blood pathology assessment.

Title: Structured Pruning for Diverse Best-of-N Reasoning Optimization

Authors: Hieu Trung Nguyen, Bao Nguyen, Viet Anh Nguyen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03978
Pdf URL: https://arxiv.org/pdf/2506.03978
Copy Paste: [[2506.03978]] Structured Pruning for Diverse Best-of-N Reasoning Optimization(https://arxiv.org/abs/2506.03978)
Keywords: transformer
Abstract: Model pruning in transformer-based language models, traditionally viewed as a means of achieving computational savings, can enhance the model's reasoning capabilities. In this work, we uncover a surprising phenomenon: the selective pruning of certain attention heads leads to improvements in reasoning performance, particularly on challenging tasks. Motivated by this observation, we propose SPRINT, a novel contrastive learning framework that dynamically selects the optimal head and layer to prune during inference. By aligning question embeddings with head embeddings, SPRINT identifies those pruned-head configurations that result in more accurate reasoning. Extensive experiments demonstrate that our method significantly outperforms traditional best-of-$N$ and random head selection strategies on the MATH500 and GSM8K datasets.

Title: Solving Inverse Problems via Diffusion-Based Priors: An Approximation-Free Ensemble Sampling Approach

Authors: Haoxuan Chen, Yinuo Ren, Martin Renqiang Min, Lexing Ying, Zachary Izzo
Subjects: cs.LG, cs.CV, eess.IV, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03979
Pdf URL: https://arxiv.org/pdf/2506.03979
Copy Paste: [[2506.03979]] Solving Inverse Problems via Diffusion-Based Priors: An Approximation-Free Ensemble Sampling Approach(https://arxiv.org/abs/2506.03979)
Keywords: diffusion, generative
Abstract: Diffusion models (DMs) have proven to be effective in modeling high-dimensional distributions, leading to their widespread adoption for representing complex priors in Bayesian inverse problems (BIPs). However, current DM-based posterior sampling methods proposed for solving common BIPs rely on heuristic approximations to the generative process. To exploit the generative capability of DMs and avoid the usage of such approximations, we propose an ensemble-based algorithm that performs posterior sampling without the use of heuristic approximations. Our algorithm is motivated by existing works that combine DM-based methods with the sequential Monte Carlo (SMC) method. By examining how the prior evolves through the diffusion process encoded by the pre-trained score function, we derive a modified partial differential equation (PDE) governing the evolution of the corresponding posterior distribution. This PDE includes a modified diffusion term and a reweighting term, which can be simulated via stochastic weighted particle methods. Theoretically, we prove that the error between the true posterior distribution can be bounded in terms of the training error of the pre-trained score function and the number of particles in the ensemble. Empirically, we validate our algorithm on several inverse problems in imaging to show that our method gives more accurate reconstructions compared to existing DM-based methods.

Title: RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors

Authors: Hicham Eddoubi, Jonas Ricker, Federico Cocchi, Lorenzo Baraldi, Angelo Sotgiu, Maura Pintor, Marcella Cornia, Lorenzo Baraldi, Asja Fischer, Rita Cucchiara, Battista Biggio
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03988
Pdf URL: https://arxiv.org/pdf/2506.03988
Copy Paste: [[2506.03988]] RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors(https://arxiv.org/abs/2506.03988)
Keywords: attack, robust
Abstract: AI-generated images have reached a quality level at which humans are incapable of reliably distinguishing them from real images. To counteract the inherent risk of fraud and disinformation, the detection of AI-generated images is a pressing challenge and an active research topic. While many of the presented methods claim to achieve high detection accuracy, they are usually evaluated under idealized conditions. In particular, the adversarial robustness is often neglected, potentially due to a lack of awareness or the substantial effort required to conduct a comprehensive robustness analysis. In this work, we tackle this problem by providing a simpler means to assess the robustness of AI-generated image detectors. We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples. The dataset is created by running attacks against an ensemble of seven state-of-the-art detectors and images generated by four different text-to-image models. Extensive experiments show that our methodology generates adversarial images that transfer with a high success rate to unseen detectors, which can be used to quickly provide an approximate yet still reliable estimate of a detector's adversarial robustnessOur findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples, highlighting the critical need for the development of more robust methods. We release our dataset at this https URL and evaluation code at this https URL.

Title: CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor

Authors: Han Ji, Yuqi Feng, Jiahao Fan, Yanan Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04001
Pdf URL: https://arxiv.org/pdf/2506.04001
Copy Paste: [[2506.04001]] CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor(https://arxiv.org/abs/2506.04001)
Keywords: interpretability
Abstract: Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS). These predictors estimate the performance of unseen architectures by learning from the correlation between a small set of trained architectures and their performance. However, most existing predictors ignore the inherent distribution shift between limited training samples and diverse test samples. Hence, they tend to learn spurious correlations as shortcuts to predictions, leading to poor generalization. To address this, we propose a Causality-guided Architecture Representation Learning (CARL) method aiming to separate critical (causal) and redundant (non-causal) features of architectures for generalizable architecture performance prediction. Specifically, we employ a substructure extractor to split the input architecture into critical and redundant substructures in the latent space. Then, we generate multiple interventional samples by pairing critical representations with diverse redundant representations to prioritize critical features. Extensive experiments on five NAS search spaces demonstrate the state-of-the-art accuracy and superior interpretability of CARL. For instance, CARL achieves 97.67% top-1 accuracy on CIFAR-10 using DARTS.

Title: Vocabulary-free few-shot learning for Vision-Language Models

Authors: Maxime Zanella, Clément Fuchs, Ismail Ben Ayed, Christophe De Vleeschouwer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04005
Pdf URL: https://arxiv.org/pdf/2506.04005
Copy Paste: [[2506.04005]] Vocabulary-free few-shot learning for Vision-Language Models(https://arxiv.org/abs/2506.04005)
Keywords: interpretability
Abstract: Recent advances in few-shot adaptation for Vision-Language Models (VLMs) have greatly expanded their ability to generalize across tasks using only a few labeled examples. However, existing approaches primarily build upon the strong zero-shot priors of these models by leveraging carefully designed, task-specific prompts. This dependence on predefined class names can restrict their applicability, especially in scenarios where exact class names are unavailable or difficult to specify. To address this limitation, we introduce vocabulary-free few-shot learning for VLMs, a setting where target class instances - that is, images - are available but their corresponding names are not. We propose Similarity Mapping (SiM), a simple yet effective baseline that classifies target instances solely based on similarity scores with a set of generic prompts (textual or visual), eliminating the need for carefully handcrafted prompts. Although conceptually straightforward, SiM demonstrates strong performance, operates with high computational efficiency (learning the mapping typically takes less than one second), and provides interpretability by linking target classes to generic prompts. We believe that our approach could serve as an important baseline for future research in vocabulary-free few-shot learning. Code is available at this https URL.

Title: Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Authors: Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04034
Pdf URL: https://arxiv.org/pdf/2506.04034
Copy Paste: [[2506.04034]] Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning(https://arxiv.org/abs/2506.04034)
Keywords: robust, interpretability
Abstract: Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.

Title: Privacy and Security Threat for OpenAI GPTs

Authors: Wei Wenying, Zhao Kaifa, Xue Lei, Fan Ming
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04036
Pdf URL: https://arxiv.org/pdf/2506.04036
Copy Paste: [[2506.04036]] Privacy and Security Threat for OpenAI GPTs(https://arxiv.org/abs/2506.04036)
Keywords: security, privacy, defense, attack, large language model
Abstract: Large language models (LLMs) demonstrate powerful information handling capabilities and are widely integrated into chatbot applications. OpenAI provides a platform for developers to construct custom GPTs, extending ChatGPT's functions and integrating external services. Since its release in November 2023, over 3 million custom GPTs have been created. However, such a vast ecosystem also conceals security and privacy threats. For developers, instruction leaking attacks threaten the intellectual property of instructions in custom GPTs through carefully crafted adversarial prompts. For users, unwanted data access behavior by custom GPTs or integrated third-party services raises significant privacy concerns. To systematically evaluate the scope of threats in real-world LLM applications, we develop three phases instruction leaking attacks target GPTs with different defense level. Our widespread experiments on 10,000 real-world custom GPTs reveal that over 98.8% of GPTs are vulnerable to instruction leaking attacks via one or more adversarial prompts, and half of the remaining GPTs can also be attacked through multiround conversations. We also developed a framework to assess the effectiveness of defensive strategies and identify unwanted behaviors in custom GPTs. Our findings show that 77.5% of custom GPTs with defense strategies are vulnerable to basic instruction leaking attacks. Additionally, we reveal that 738 custom GPTs collect user conversational information, and identified 8 GPTs exhibiting data access behaviors that are unnecessary for their intended functionalities. Our findings raise awareness among GPT developers about the importance of integrating specific defensive strategies in their instructions and highlight users' concerns about data privacy when using LLM-based applications.

Title: Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization

Authors: Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, Min Zhang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.04039
Pdf URL: https://arxiv.org/pdf/2506.04039
Copy Paste: [[2506.04039]] Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization(https://arxiv.org/abs/2506.04039)
Keywords: large language model
Abstract: Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment than existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects: image, instruction, and response. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, e.g., reducing hallucination rates by 85.9% on Object-HalBench and 49.8% on MM-HalBench.

Title: Unveiling and Eliminating the Shortcut Learning for Locate-Then-Edit Knowledge Editing via Both Subject and Relation Awareness

Authors: Xiyu Liu, Zhengxiao Liu, Naibin Gu, Zheng Lin, Ji Xiang, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04042
Pdf URL: https://arxiv.org/pdf/2506.04042
Copy Paste: [[2506.04042]] Unveiling and Eliminating the Shortcut Learning for Locate-Then-Edit Knowledge Editing via Both Subject and Relation Awareness(https://arxiv.org/abs/2506.04042)
Keywords: large language model
Abstract: Knowledge editing aims to alternate the target knowledge predicted by large language models while ensuring the least side effects on unrelated knowledge. An effective way to achieve knowledge editing is to identify pivotal parameters for predicting factual associations and modify them with an optimization process to update the predictions. However, these locate-then-edit methods are uncontrollable since they tend to modify most unrelated relations connected to the subject of target editing. We unveil that this failure of controllable editing is due to a shortcut learning issue during the optimization process. Specifically, we discover two crucial features that are the subject feature and the relation feature for models to learn during optimization, but the current optimization process tends to over-learning the subject feature while neglecting the relation feature. To eliminate this shortcut learning of the subject feature, we propose a novel two-stage optimization process that balances the learning of the subject feature and the relation feature. Experimental results demonstrate that our approach successfully prevents knowledge editing from shortcut learning and achieves the optimal overall performance, contributing to controllable knowledge editing.

Title: Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate

Authors: Mikel K. Ngueajio, Flor Miriam Plaza-del-Arco, Yi-Ling Chung, Danda B. Rawat, Amanda Cercas Curry
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04043
Pdf URL: https://arxiv.org/pdf/2506.04043
Copy Paste: [[2506.04043]] Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate(https://arxiv.org/abs/2506.04043)
Keywords: robust, large language model
Abstract: Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility, and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere's CommandR-7B, and Meta's LLaMA 3.1-70B, we assess three prompting strategies on the MT-Conan and HatEval datasets. Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.

Title: Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLMs

Authors: Aleksey Kudelya, Alexander Shirnin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04044
Pdf URL: https://arxiv.org/pdf/2506.04044
Copy Paste: [[2506.04044]] Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLMs(https://arxiv.org/abs/2506.04044)
Keywords: large language model
Abstract: This paper describes LIBU (LoRA enhanced influence-based unlearning), an algorithm to solve the task of unlearning - removing specific knowledge from a large language model without retraining from scratch and compromising its overall utility (SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models). The algorithm combines classical \textit{influence functions} to remove the influence of the data from the model and \textit{second-order optimization} to stabilize the overall utility. Our experiments show that this lightweight approach is well applicable for unlearning LLMs in different kinds of task.

Title: On Support Samples of Next Word Prediction

Authors: Yuqian Li, Yupei Du, Yufang Liu, Feifei Feng, Mou Xiao Feng, Yuanbin Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04047
Pdf URL: https://arxiv.org/pdf/2506.04047
Copy Paste: [[2506.04047]] On Support Samples of Next Word Prediction(https://arxiv.org/abs/2506.04047)
Keywords: interpretability
Abstract: Language models excel in various tasks by making complex decisions, yet understanding the rationale behind these decisions remains a challenge. This paper investigates \emph{data-centric interpretability} in language models, focusing on the next-word prediction task. Using representer theorem, we identify two types of \emph{support samples}-those that either promote or deter specific predictions. Our findings reveal that being a support sample is an intrinsic property, predictable even before training begins. Additionally, while non-support samples are less influential in direct predictions, they play a critical role in preventing overfitting and shaping generalization and representation learning. Notably, the importance of non-support samples increases in deeper layers, suggesting their significant role in intermediate representation this http URL insights shed light on the interplay between data and model decisions, offering a new dimension to understanding language model behavior and interpretability.

Title: EV-Flying: an Event-based Dataset for In-The-Wild Recognition of Flying Objects

Authors: Gabriele Magrini, Federico Becattini, Giovanni Colombo, Pietro Pala
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04048
Pdf URL: https://arxiv.org/pdf/2506.04048
Copy Paste: [[2506.04048]] EV-Flying: an Event-based Dataset for In-The-Wild Recognition of Flying Objects(https://arxiv.org/abs/2506.04048)
Keywords: security, robust
Abstract: Monitoring aerial objects is crucial for security, wildlife conservation, and environmental studies. Traditional RGB-based approaches struggle with challenges such as scale variations, motion blur, and high-speed object movements, especially for small flying entities like insects and drones. In this work, we explore the potential of event-based vision for detecting and recognizing flying objects, in particular animals that may not follow short and long-term predictable patters. Event cameras offer high temporal resolution, low latency, and robustness to motion blur, making them well-suited for this task. We introduce EV-Flying, an event-based dataset of flying objects, comprising manually annotated birds, insects and drones with spatio-temporal bounding boxes and track identities. To effectively process the asynchronous event streams, we employ a point-based approach leveraging lightweight architectures inspired by PointNet. Our study investigates the classification of flying objects using point cloud-based event representations. The proposed dataset and methodology pave the way for more efficient and reliable aerial object recognition in real-world scenarios.

Title: Explainability-Based Token Replacement on LLM-Generated Text

Authors: Hadi Mohammadi, Anastasia Giachanou, Daniel L. Oberski, Ayoub Bagheri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04050
Pdf URL: https://arxiv.org/pdf/2506.04050
Copy Paste: [[2506.04050]] Explainability-Based Token Replacement on LLM-Generated Text(https://arxiv.org/abs/2506.04050)
Keywords: robust, explainability, generative, large language model
Abstract: Generative models, especially large language models (LLMs), have shown remarkable progress in producing text that appears human-like. However, they often exhibit patterns that make their output easier to detect than text written by humans. In this paper, we investigate how explainable AI (XAI) methods can be used to reduce the detectability of AI-generated text (AIGT) while also introducing a robust ensemble-based detection approach. We begin by training an ensemble classifier to distinguish AIGT from human-written text, then apply SHAP and LIME to identify tokens that most strongly influence its predictions. We propose four explainability-based token replacement strategies to modify these influential tokens. Our findings show that these token replacement approaches can significantly diminish a single classifier's ability to detect AIGT. However, our ensemble classifier maintains strong performance across multiple languages and domains, showing that a multi-model approach can mitigate the impact of token-level manipulations. These results show that XAI methods can make AIGT harder to detect by focusing on the most influential tokens. At the same time, they highlight the need for robust, ensemble-based detection strategies that can adapt to evolving approaches for hiding AIGT.

Title: High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

Authors: Tim Franzmeyer, Archie Sravankumar, Lijuan Liu, Yuning Mao, Rui Hou, Sinong Wang, Jakob N. Foerster, Luke Zettlemoyer, Madian Khabsa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04051
Pdf URL: https://arxiv.org/pdf/2506.04051
Copy Paste: [[2506.04051]] High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning(https://arxiv.org/abs/2506.04051)
Keywords: large language model
Abstract: Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response's fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.

Title: Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence

Authors: Alexander Semenenko, Ivan Butakov, Alexey Frolov, Ivan Oseledets
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2506.04053
Pdf URL: https://arxiv.org/pdf/2506.04053
Copy Paste: [[2506.04053]] Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence(https://arxiv.org/abs/2506.04053)
Keywords: robust, extraction
Abstract: Sliced Mutual Information (SMI) is widely used as a scalable alternative to mutual information for measuring non-linear statistical dependence. Despite its advantages, such as faster convergence, robustness to high dimensionality, and nullification only under statistical independence, we demonstrate that SMI is highly susceptible to data manipulation and exhibits counterintuitive behavior. Through extensive benchmarking and theoretical analysis, we show that SMI saturates easily, fails to detect increases in statistical dependence (even under linear transformations designed to enhance the extraction of information), prioritizes redundancy over informative content, and in some cases, performs worse than simpler dependence measures like the correlation coefficient.

Title: Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning

Authors: Muling Wu, Qi Qian, Wenhao Liu, Xiaohua Wang, Zisu Huang, Di Liang, LI Miao, Shihan Dou, Changze Lv, Zhenghua Wang, Zhibo Xu, Lina Chen, Tianlong Li, Xiaoqing Zheng, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04065
Pdf URL: https://arxiv.org/pdf/2506.04065
Copy Paste: [[2506.04065]] Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning(https://arxiv.org/abs/2506.04065)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable performance across various reasoning tasks, yet post-training is constrained by inefficient sample utilization and inflexible difficulty samples processing. To address these limitations, we propose Customized Curriculum Learning (CCL), a novel framework with two key innovations. First, we introduce model-adaptive difficulty definition that customizes curriculum datasets based on each model's individual capabilities rather than using predefined difficulty metrics. Second, we develop "Guided Prompting," which dynamically reduces sample difficulty through strategic hints, enabling effective utilization of challenging samples that would otherwise degrade performance. Comprehensive experiments on supervised fine-tuning and reinforcement learning demonstrate that CCL significantly outperforms uniform training approaches across five mathematical reasoning benchmarks, confirming its effectiveness across both paradigms in enhancing sample utilization and model performance.

Title: Optimal Transport-based Domain Alignment as a Preprocessing Step for Federated Learning

Authors: Luiz Manella Pereira, M. Hadi Amini
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04071
Pdf URL: https://arxiv.org/pdf/2506.04071
Copy Paste: [[2506.04071]] Optimal Transport-based Domain Alignment as a Preprocessing Step for Federated Learning(https://arxiv.org/abs/2506.04071)
Keywords: privacy, federate
Abstract: Federated learning (FL) is a subfield of machine learning that avoids sharing local data with a central server, which can enhance privacy and scalability. The inability to consolidate data leads to a unique problem called dataset imbalance, where agents in a network do not have equal representation of the labels one is trying to learn to predict. In FL, fusing locally-trained models with unbalanced datasets may deteriorate the performance of global model aggregation, and reduce the quality of updated local models and the accuracy of the distributed agents' decisions. In this work, we introduce an Optimal Transport-based preprocessing algorithm that aligns the datasets by minimizing the distributional discrepancy of data along the edge devices. We accomplish this by leveraging Wasserstein barycenters when computing channel-wise averages. These barycenters are collected in a trusted central server where they collectively generate a target RGB space. By projecting our dataset towards this target space, we minimize the distributional discrepancy on a global level, which facilitates the learning process due to a minimization of variance across the samples. We demonstrate the capabilities of the proposed approach over the CIFAR-10 dataset, where we show its capability of reaching higher degrees of generalization in fewer communication rounds.

Title: Controlling Difficulty of Generated Text for AI-Assisted Language Learning

Authors: Meiqing Jin, Liam Dugan, Chris Callison-Burch
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.04072
Pdf URL: https://arxiv.org/pdf/2506.04072
Copy Paste: [[2506.04072]] Controlling Difficulty of Generated Text for AI-Assisted Language Learning(https://arxiv.org/abs/2506.04072)
Keywords: large language model
Abstract: Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques -- specifically modular methods that do not require model fine-tuning -- can adapt LLM outputs to better support absolute beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails to control output difficulty, the use of future discriminators (Yang and Klein, 2021) significantly improves output comprehensibility (from 40.4\% to 84.3\%). We further introduce a novel token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.

Title: A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions

Authors: Chung-Chun Wang, Jhen-Ke Lin, Hao-Chien Lu, Hong-Yun Lin, Berlin Chen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.04077
Pdf URL: https://arxiv.org/pdf/2506.04077
Copy Paste: [[2506.04077]] A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions(https://arxiv.org/abs/2506.04077)
Keywords: large language model
Abstract: Automated speaking assessment (ASA) on opinion expressions is often hampered by the scarcity of labeled recordings, which restricts prompt diversity and undermines scoring reliability. To address this challenge, we propose a novel training paradigm that leverages a large language models (LLM) to generate diverse responses of a given proficiency level, converts responses into synthesized speech via speaker-aware text-to-speech synthesis, and employs a dynamic importance loss to adaptively reweight training instances based on feature distribution differences between synthesized and real speech. Subsequently, a multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly. Experiments conducted on the LTTC dataset show that our approach outperforms methods relying on real data or conventional augmentation, effectively mitigating low-resource constraints and enabling ASA on opinion expressions with cross-modal information.

Title: LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Authors: Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04078
Pdf URL: https://arxiv.org/pdf/2506.04078
Copy Paste: [[2506.04078]] LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation(https://arxiv.org/abs/2506.04078)
Keywords: large language model
Abstract: Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in this https URL.

Title: EuroLLM-9B: Technical Report

Authors: Pedro Henrique Martins, João Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Amin Farajian, Mateusz Klimaszewski, Duarte M. Alves, José Pombal, Manuel Faysse, Pierre Colombo, François Yvon, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04079
Pdf URL: https://arxiv.org/pdf/2506.04079
Copy Paste: [[2506.04079]] EuroLLM-9B: Technical Report(https://arxiv.org/abs/2506.04079)
Keywords: large language model
Abstract: This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.

Title: Multimodal Tabular Reasoning with Privileged Structured Information

Authors: Jun-Peng Jiang, Yu Xia, Hai-Long Sun, Shiyin Lu, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04088
Pdf URL: https://arxiv.org/pdf/2506.04088
Copy Paste: [[2506.04088]] Multimodal Tabular Reasoning with Privileged Structured Information(https://arxiv.org/abs/2506.04088)
Keywords: extraction, large language model
Abstract: Tabular reasoning involves multi-step information extraction and logical inference over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning from table images, leveraging privileged structured information available during training to enhance multimodal large language models (MLLMs). The key challenges lie in the complexity of accurately aligning structured information with visual representations, and in effectively transferring structured reasoning skills to MLLMs despite the input modality gap. To address these, we introduce TabUlar Reasoning with Bridged infOrmation ({\sc Turbo}), a new framework for multimodal tabular reasoning with privileged structured tables. {\sc Turbo} benefits from a structure-aware reasoning trace generator based on DeepSeek-R1, contributing to high-quality modality-bridged data. On this basis, {\sc Turbo} repeatedly generates and selects the advantageous reasoning paths, further enhancing the model's tabular reasoning ability. Experimental results demonstrate that, with limited ($9$k) data, {\sc Turbo} achieves state-of-the-art performance ($+7.2\%$ vs. previous SOTA) across multiple datasets.

Title: AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment

Authors: Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov
Subjects: cs.LG, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2506.04089
Pdf URL: https://arxiv.org/pdf/2506.04089
Copy Paste: [[2506.04089]] AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment(https://arxiv.org/abs/2506.04089)
Keywords: large language model
Abstract: As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at this https URL.

Title: TextAtari: 100K Frames Game Playing with Language Agents

Authors: Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04098
Pdf URL: https://arxiv.org/pdf/2506.04098
Copy Paste: [[2506.04098]] TextAtari: 100K Frames Game Playing with Language Agents(https://arxiv.org/abs/2506.04098)
Keywords: large language model
Abstract: We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning.

Title: Rectified Sparse Attention

Authors: Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04108
Pdf URL: https://arxiv.org/pdf/2506.04108
Copy Paste: [[2506.04108]] Rectified Sparse Attention(https://arxiv.org/abs/2506.04108)
Keywords: large language model
Abstract: Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at this https URL.

Title: Multi-view Surface Reconstruction Using Normal and Reflectance Cues

Authors: Robin Bruneau, Baptiste Brument, Yvain Quéau, Jean Mélou, François Bernard Lauze, Jean-Denis Durou, Lilian Calvet
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04115
Pdf URL: https://arxiv.org/pdf/2506.04115
Copy Paste: [[2506.04115]] Multi-view Surface Reconstruction Using Normal and Reflectance Cues(https://arxiv.org/abs/2506.04115)
Keywords: robust
Abstract: Achieving high-fidelity 3D surface reconstruction while preserving fine details remains challenging, especially in the presence of materials with complex reflectance properties and without a dense-view setup. In this paper, we introduce a versatile framework that incorporates multi-view normal and optionally reflectance maps into radiance-based surface reconstruction. Our approach employs a pixel-wise joint re-parametrization of reflectance and surface normals, representing them as a vector of radiances under simulated, varying illumination. This formulation enables seamless incorporation into standard surface reconstruction pipelines, such as traditional multi-view stereo (MVS) frameworks or modern neural volume rendering (NVR) ones. Combined with the latter, our approach achieves state-of-the-art performance on multi-view photometric stereo (MVPS) benchmark datasets, including DiLiGenT-MV, LUCES-MV and Skoltech3D. In particular, our method excels in reconstructing fine-grained details and handling challenging visibility conditions. The present paper is an extended version of the earlier conference paper by Brument et al. (in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024), featuring an accelerated and more robust algorithm as well as a broader empirical evaluation. The code and data relative to this article is available at this https URL.

Title: Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

Authors: Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.04118
Pdf URL: https://arxiv.org/pdf/2506.04118
Copy Paste: [[2506.04118]] Guided Speculative Inference for Efficient Test-Time Alignment of LLMs(https://arxiv.org/abs/2506.04118)
Keywords: large language model
Abstract: We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $\pi_S(y\mid x)$. We provably approximate the optimal tilted policy $\pi_{\beta,B}(y\mid x) \propto \pi_B(y\mid x)\exp(\beta\,r(x,y))$ of soft best-of-$n$ under the primary model $\pi_B$. We derive a theoretical bound on the KL divergence between our induced distribution and the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math), our method achieves higher accuracy than standard soft best-of-$n$ with $\pi_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $\pi_B$. The code is available at this https URL .

Title: CLAIM: An Intent-Driven Multi-Agent Framework for Analyzing Manipulation in Courtroom Dialogues

Authors: Disha Sheshanarayana, Tanishka Magar, Ayushi Mittal, Neelam Chaplot
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04131
Pdf URL: https://arxiv.org/pdf/2506.04131
Copy Paste: [[2506.04131]] CLAIM: An Intent-Driven Multi-Agent Framework for Analyzing Manipulation in Courtroom Dialogues(https://arxiv.org/abs/2506.04131)
Keywords: robust, fair
Abstract: Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two-stage, Intent-driven Multi-agent framework designed to enhance manipulation analysis by enabling context-aware and informed decision-making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision-making. Our code and data are available at this https URL.

Title: MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Authors: Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.04141
Pdf URL: https://arxiv.org/pdf/2506.04141
Copy Paste: [[2506.04141]] MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos(https://arxiv.org/abs/2506.04141)
Keywords: large language model
Abstract: The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

Title: Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis

Authors: Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04142
Pdf URL: https://arxiv.org/pdf/2506.04142
Copy Paste: [[2506.04142]] Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis(https://arxiv.org/abs/2506.04142)
Keywords: fair, large language model
Abstract: The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ($\rho$) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: this https URL

Title: A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization

Authors: Sarvesh Soni, Dina Demner-Fushman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04156
Pdf URL: https://arxiv.org/pdf/2506.04156
Copy Paste: [[2506.04156]] A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization(https://arxiv.org/abs/2506.04156)
Keywords: robust, large language model
Abstract: Patients have distinct information needs about their hospitalization that can be addressed using clinical evidence from electronic health records (EHRs). While artificial intelligence (AI) systems show promise in meeting these needs, robust datasets are needed to evaluate the factual accuracy and relevance of AI-generated responses. To our knowledge, no existing dataset captures patient information needs in the context of their EHRs. We introduce ArchEHR-QA, an expert-annotated dataset based on real-world patient cases from intensive care unit and emergency department settings. The cases comprise questions posed by patients to public health forums, clinician-interpreted counterparts, relevant clinical note excerpts with sentence-level relevance annotations, and clinician-authored answers. To establish benchmarks for grounded EHR question answering (QA), we evaluated three open-weight large language models (LLMs)--Llama 4, Llama 3, and Mixtral--across three prompting strategies: generating (1) answers with citations to clinical note sentences, (2) answers before citations, and (3) answers from filtered citations. We assessed performance on two dimensions: Factuality (overlap between cited note sentences and ground truth) and Relevance (textual and semantic similarity between system and reference answers). The final dataset contains 134 patient cases. The answer-first prompting approach consistently performed best, with Llama 4 achieving the highest scores. Manual error analysis supported these findings and revealed common issues such as omitted key clinical evidence and contradictory or hallucinated content. Overall, ArchEHR-QA provides a strong benchmark for developing and evaluating patient-centered EHR QA systems, underscoring the need for further progress toward generating factual and relevant responses in clinical contexts.

Title: Image Editing As Programs with Diffusion Models

Authors: Yujia Hu, Songhua Liu, Zhenxiong Tan, Xingyi Yang, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04158
Pdf URL: https://arxiv.org/pdf/2506.04158
Copy Paste: [[2506.04158]] Image Editing As Programs with Diffusion Models(https://arxiv.org/abs/2506.04158)
Keywords: robust, diffusion, transformer
Abstract: While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at this https URL.

Title: N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion

Authors: Caleb Chin, Aashish Khubchandani, Harshvardhan Maskara, Kyuseong Choi, Jacob Feitelberg, Albert Gong, Manit Paul, Tathagata Sadhukhan, Anish Agarwal, Raaz Dwivedi
Subjects: cs.LG, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2506.04166
Pdf URL: https://arxiv.org/pdf/2506.04166
Copy Paste: [[2506.04166]] N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion(https://arxiv.org/abs/2506.04166)
Keywords: robust
Abstract: Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.

Title: Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints

Authors: Utkarsh Utkarsh, Pengfei Cai, Alan Edelman, Rafael Gomez-Bombarelli, Christopher Vincent Rackauckas
Subjects: cs.LG, cs.AI, cs.CE, math.NA
Abstract URL: https://arxiv.org/abs/2506.04171
Pdf URL: https://arxiv.org/pdf/2506.04171
Copy Paste: [[2506.04171]] Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints(https://arxiv.org/abs/2506.04171)
Keywords: generative
Abstract: Deep generative models have recently been applied to physical systems governed by partial differential equations (PDEs), offering scalable simulation and uncertainty-aware inference. However, enforcing physical constraints, such as conservation laws (linear and nonlinear) and physical consistencies, remains challenging. Existing methods often rely on soft penalties or architectural biases that fail to guarantee hard constraints. In this work, we propose Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear constraints in pretrained flow-based generative models. PCFM continuously guides the sampling process through physics-based corrections applied to intermediate solution states, while remaining aligned with the learned flow and satisfying physical constraints. Empirically, PCFM outperforms both unconstrained and constrained baselines on a range of PDEs, including those with shocks, discontinuities, and sharp features, while ensuring exact constraint satisfaction at the final solution. Our method provides a general framework for enforcing hard constraints in both scientific and general-purpose generative models, especially in applications where constraint satisfaction is essential.

Title: Does Prompt Design Impact Quality of Data Imputation by LLMs?

Authors: Shreenidhi Srinivasan, Lydia Manikonda
Subjects: cs.LG, cs.ET
Abstract URL: https://arxiv.org/abs/2506.04172
Pdf URL: https://arxiv.org/pdf/2506.04172
Copy Paste: [[2506.04172]] Does Prompt Design Impact Quality of Data Imputation by LLMs?(https://arxiv.org/abs/2506.04172)
Keywords: large language model
Abstract: Generating realistic synthetic tabular data presents a critical challenge in machine learning. It adds another layer of complexity when this data contain class imbalance problems. This paper presents a novel token-aware data imputation method that leverages the in-context learning capabilities of large language models. This is achieved through the combination of a structured group-wise CSV-style prompting technique and the elimination of irrelevant contextual information in the input prompt. We test this approach with two class-imbalanced binary classification datasets and evaluate the effectiveness of imputation using classification-based evaluation metrics. The experimental results demonstrate that our approach significantly reduces the input prompt size while maintaining or improving imputation quality compared to our baseline prompt, especially for datasets that are of relatively smaller in size. The contributions of this presented work is two-fold -- 1) it sheds light on the importance of prompt design when leveraging LLMs for synthetic data generation and 2) it addresses a critical gap in LLM-based data imputation for class-imbalanced datasets with missing data by providing a practical solution within computational constraints. We hope that our work will foster further research and discussions about leveraging the incredible potential of LLMs and prompt engineering techniques for synthetic data generation.

Title: SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling

Authors: Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhiwei Fei, Hui Su, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04179
Pdf URL: https://arxiv.org/pdf/2506.04179
Copy Paste: [[2506.04179]] SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling(https://arxiv.org/abs/2506.04179)
Keywords: large language model
Abstract: Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies, but conventional static pruning methods overlook two critical dynamics inherent to LLM inference: (1) horizontal dynamics, where token-level heterogeneity demands context-aware pruning decisions, and (2) vertical dynamics, where the distinct functional roles of MLP and self-attention layers necessitate component-specific pruning policies. We introduce SkipGPT, a dynamic layer pruning framework designed to optimize computational resource allocation through two core innovations: (1) global token-aware routing to prioritize critical tokens, and (2) decoupled pruning policies for MLP and self-attention components. To mitigate training instability, we propose a two-stage optimization paradigm: first, a disentangled training phase that learns routing strategies via soft parameterization to avoid premature pruning decisions, followed by parameter-efficient LoRA fine-tuning to restore performance impacted by layer removal. Extensive experiments demonstrate that SkipGPT reduces over 40% of model parameters while matching or exceeding the performance of the original dense model across benchmarks. By harmonizing dynamic efficiency with preserved expressivity, SkipGPT advances the practical deployment of scalable, resource-aware LLMs. Our code is publicly available at: this https URL.

Title: SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

Authors: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04180
Pdf URL: https://arxiv.org/pdf/2506.04180
Copy Paste: [[2506.04180]] SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models(https://arxiv.org/abs/2506.04180)
Keywords: large language model
Abstract: Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.

Title: R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning

Authors: Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, Limin Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.04185
Pdf URL: https://arxiv.org/pdf/2506.04185
Copy Paste: [[2506.04185]] R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning(https://arxiv.org/abs/2506.04185)
Keywords: large language model
Abstract: Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning-Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-stage, multi-type rewards to jointly optimize the reasoning-search trajectory. Experiments on seven datasets show that R-Search outperforms advanced RAG baselines by up to 32.2% (in-domain) and 25.1% (out-of-domain). The code and data are available at this https URL.

Title: TracLLM: A Generic Framework for Attributing Long Context LLMs

Authors: Yanting Wang, Wei Zou, Runpeng Geng, Jinyuan Jia
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04202
Pdf URL: https://arxiv.org/pdf/2506.04202
Copy Paste: [[2506.04202]] TracLLM: A Generic Framework for Attributing Long Context LLMs(https://arxiv.org/abs/2506.04202)
Keywords: attack, large language model
Abstract: Long context large language models (LLMs) are deployed in many real-world applications such as RAG, agent, and broad LLM-integrated applications. Given an instruction and a long context (e.g., documents, PDF files, webpages), a long context LLM can generate an output grounded in the provided context, aiming to provide more accurate, up-to-date, and verifiable outputs while reducing hallucinations and unsupported claims. This raises a research question: how to pinpoint the texts (e.g., sentences, passages, or paragraphs) in the context that contribute most to or are responsible for the generated output by an LLM? This process, which we call context traceback, has various real-world applications, such as 1) debugging LLM-based systems, 2) conducting post-attack forensic analysis for attacks (e.g., prompt injection attack, knowledge corruption attacks) to an LLM, and 3) highlighting knowledge sources to enhance the trust of users towards outputs generated by LLMs. When applied to context traceback for long context LLMs, existing feature attribution methods such as Shapley have sub-optimal performance and/or incur a large computational cost. In this work, we develop TracLLM, the first generic context traceback framework tailored to long context LLMs. Our framework can improve the effectiveness and efficiency of existing feature attribution methods. To improve the efficiency, we develop an informed search based algorithm in TracLLM. We also develop contribution score ensemble/denoising techniques to improve the accuracy of TracLLM. Our evaluation results show TracLLM can effectively identify texts in a long context that lead to the output of an LLM. Our code and data are at: this https URL.

Title: EPiC: Towards Lossless Speedup for Reasoning Training through Edge-Preserving CoT Condensation

Authors: Jinghan Jia, Hadi Reisizadeh, Chongyu Fan, Nathalie Baracaldo, Mingyi Hong, Sijia Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04205
Pdf URL: https://arxiv.org/pdf/2506.04205
Copy Paste: [[2506.04205]] EPiC: Towards Lossless Speedup for Reasoning Training through Edge-Preserving CoT Condensation(https://arxiv.org/abs/2506.04205)
Keywords: large language model
Abstract: Large language models (LLMs) have shown remarkable reasoning capabilities when trained with chain-of-thought (CoT) supervision. However, the long and verbose CoT traces, especially those distilled from large reasoning models (LRMs) such as DeepSeek-R1, significantly increase training costs during the distillation process, where a non-reasoning base model is taught to replicate the reasoning behavior of an LRM. In this work, we study the problem of CoT condensation for resource-efficient reasoning training, aimed at pruning intermediate reasoning steps (i.e., thoughts) in CoT traces, enabling supervised model training on length-reduced CoT data while preserving both answer accuracy and the model's ability to generate coherent reasoning. Our rationale is that CoT traces typically follow a three-stage structure: problem understanding, exploration, and solution convergence. Through empirical analysis, we find that retaining the structure of the reasoning trace, especially the early stage of problem understanding (rich in reflective cues) and the final stage of solution convergence, is sufficient to achieve lossless reasoning supervision. To this end, we propose an Edge-Preserving Condensation method, EPiC, which selectively retains only the initial and final segments of each CoT trace while discarding the middle portion. This design draws an analogy to preserving the "edge" of a reasoning trajectory, capturing both the initial problem framing and the final answer synthesis, to maintain logical continuity. Experiments across multiple model families (Qwen and LLaMA) and benchmarks show that EPiC reduces training time by over 34% while achieving lossless reasoning accuracy on MATH500, comparable to full CoT supervision. To the best of our knowledge, this is the first study to explore thought-level CoT condensation for efficient reasoning model distillation.

Title: Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Authors: Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04207
Pdf URL: https://arxiv.org/pdf/2506.04207
Copy Paste: [[2506.04207]] Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning(https://arxiv.org/abs/2506.04207)
Keywords: large language model
Abstract: Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

Title: Language-Image Alignment with Fixed Text Encoders

Authors: Jingfeng Yang, Ziyang Wu, Yue Zhao, Yi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04209
Pdf URL: https://arxiv.org/pdf/2506.04209
Copy Paste: [[2506.04209]] Language-Image Alignment with Fixed Text Encoders(https://arxiv.org/abs/2506.04209)
Keywords: large language model
Abstract: Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

Title: Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector

Authors: Boyong He, Yuxiang Ji, Zhuoyue Tan, Liaoni Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04211
Pdf URL: https://arxiv.org/pdf/2506.04211
Copy Paste: [[2506.04211]] Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector(https://arxiv.org/abs/2506.04211)
Keywords: diffusion, generative
Abstract: Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic}, surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT. The code is available at this https URL.

Title: FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Authors: Xuanhua He, Quande Liu, Zixuan Ye, Wecai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04213
Pdf URL: https://arxiv.org/pdf/2506.04213
Copy Paste: [[2506.04213]] FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers(https://arxiv.org/abs/2506.04213)
Keywords: diffusion, transformer
Abstract: Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy, FullDiT2 leverages a dynamic token selection mechanism to adaptively identify important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality. The project page is at \href{this https URL}{this https URL}.

Title: Sounding that Object: Interactive Object-Aware Image to Audio Generation

Authors: Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang
Subjects: cs.CV, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.04214
Pdf URL: https://arxiv.org/pdf/2506.04214
Copy Paste: [[2506.04214]] Sounding that Object: Interactive Object-Aware Image to Audio Generation(https://arxiv.org/abs/2506.04214)
Keywords: diffusion, segmentation
Abstract: Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: this https URL

Title: UNIC: Unified In-Context Video Editing

Authors: Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04216
Pdf URL: https://arxiv.org/pdf/2506.04216
Copy Paste: [[2506.04216]] UNIC: Unified In-Context Video Editing(https://arxiv.org/abs/2506.04216)
Keywords: generative
Abstract: Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.

Title: Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

Authors: Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson W.H. Lau, Wangmeng Zuo, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04225
Pdf URL: https://arxiv.org/pdf/2506.04225
Copy Paste: [[2506.04225]] Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation(https://arxiv.org/abs/2506.04225)
Keywords: diffusion
Abstract: Real-world applications like video gaming and virtual reality often demand the ability to model 3D scenes that users can explore along custom camera trajectories. While significant progress has been made in generating 3D objects from text or images, creating long-range, 3D-consistent, explorable 3D scenes remains a complex and challenging problem. In this work, we present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames, eliminating the need for 3D reconstruction pipelines (e.g., structure-from-motion or multi-view stereo). Our method integrates three key components: 1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence 2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency, and 3) Scalable Data Engine: A video reconstruction pipeline that automates camera pose estimation and metric depth prediction for arbitrary videos, enabling large-scale, diverse training data curation without manual 3D annotations. Collectively, these designs result in a clear improvement over existing methods in visual quality and geometric accuracy, with versatile applications.

Title: LayerFlow: A Unified Model for Layer-aware Video Generation

Authors: Sihui Ji, Hao Luo, Xi Chen, Yuanpeng Tu, Yiyang Wang, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04228
Pdf URL: https://arxiv.org/pdf/2506.04228
Copy Paste: [[2506.04228]] LayerFlow: A Unified Model for Layer-aware Video Generation(https://arxiv.org/abs/2506.04228)
Keywords: diffusion, transformer
Abstract: We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.