2025-06-04

Title: Research on Medical Named Entity Identification Based On Prompt-Biomrc Model and Its Application in Intelligent Consultation System

Authors: Jinzhu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01961
Pdf URL: https://arxiv.org/pdf/2506.01961
Copy Paste: [[2506.01961]] Research on Medical Named Entity Identification Based On Prompt-Biomrc Model and Its Application in Intelligent Consultation System(https://arxiv.org/abs/2506.01961)
Keywords: extraction
Abstract: This study is dedicated to exploring the application of prompt learning methods to advance Named Entity Recognition (NER) within the medical domain. In recent years, the emergence of large-scale models has driven significant progress in NER tasks, particularly with the introduction of the BioBERT language model, which has greatly enhanced NER capabilities in medical texts. Our research introduces the Prompt-bioMRC model, which integrates both hard template and soft prompt designs aimed at refining the precision and efficiency of medical entity recognition. Through extensive experimentation across diverse medical datasets, our findings consistently demonstrate that our approach surpasses traditional models. This enhancement not only validates the efficacy of our methodology but also highlights its potential to provide reliable technological support for applications like intelligent diagnosis systems. By leveraging advanced NER techniques, this study contributes to advancing automated medical data processing, facilitating more accurate medical information extraction, and supporting efficient healthcare decision-making processes.

Title: Graph-Based Adversarial Domain Generalization with Anatomical Correlation Knowledge for Cross-User Human Activity Recognition

Authors: Xiaozhou Ye, Kevin I-Kai Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01962
Pdf URL: https://arxiv.org/pdf/2506.01962
Copy Paste: [[2506.01962]] Graph-Based Adversarial Domain Generalization with Anatomical Correlation Knowledge for Cross-User Human Activity Recognition(https://arxiv.org/abs/2506.01962)
Keywords: robust
Abstract: Cross-user variability poses a significant challenge in sensor-based Human Activity Recognition (HAR) systems, as traditional models struggle to generalize across users due to differences in behavior, sensor placement, and data distribution. To address this, we propose GNN-ADG (Graph Neural Network with Adversarial Domain Generalization), a novel method that leverages both the strength from both the Graph Neural Networks (GNNs) and adversarial learning to achieve robust cross-user generalization. GNN-ADG models spatial relationships between sensors on different anatomical body parts, extracting three types of Anatomical Units: (1) Interconnected Units, capturing inter-relations between neighboring sensors; (2) Analogous Units, grouping sensors on symmetrical or functionally similar body parts; and (3) Lateral Units, connecting sensors based on their position to capture region-specific coordination. These units information are fused into an unified graph structure with a cyclic training strategy, dynamically integrating spatial, functional, and lateral correlations to facilitate a holistic, user-invariant representation. Information fusion mechanism of GNN-ADG occurs by iteratively cycling through edge topologies during training, allowing the model to refine its understanding of inter-sensor relationships across diverse perspectives. By representing the spatial configuration of sensors as an unified graph and incorporating adversarial learning, Information Fusion GNN-ADG effectively learns features that generalize well to unseen users without requiring target user data during training, making it practical for real-world applications.

Title: Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

Authors: Andrew Kiruluta, Preethi Raju, Priscilla Burity
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.01963
Pdf URL: https://arxiv.org/pdf/2506.01963
Copy Paste: [[2506.01963]] Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons(https://arxiv.org/abs/2506.01963)
Keywords: transformer, large language model
Abstract: We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows, on the order of hundreds of thousands to potentially millions of tokens. Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely. Instead, it combines the following complementary components: State Space blocks (inspired by S4) that learn continuous time convolution kernels and scale near linearly with sequence length, Multi Resolution Convolution layers that capture local context at different dilation levels, a lightweight Recurrent Supervisor to maintain a global hidden state across sequential chunks, and Retrieval Augmented External Memory that stores and retrieves high-level chunk embeddings without reintroducing quadratic operations.

Title: TaskVAE: Task-Specific Variational Autoencoders for Exemplar Generation in Continual Learning for Human Activity Recognition

Authors: Bonpagna Kann, Sandra Castellanos-Paez, Romain Rombourg, Philippe Lalanda
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01965
Pdf URL: https://arxiv.org/pdf/2506.01965
Copy Paste: [[2506.01965]] TaskVAE: Task-Specific Variational Autoencoders for Exemplar Generation in Continual Learning for Human Activity Recognition(https://arxiv.org/abs/2506.01965)
Keywords: robust
Abstract: As machine learning based systems become more integrated into daily life, they unlock new opportunities but face the challenge of adapting to dynamic data environments. Various forms of data shift-gradual, abrupt, or cyclic-threaten model accuracy, making continual adaptation essential. Continual Learning (CL) enables models to learn from evolving data streams while minimizing forgetting of prior knowledge. Among CL strategies, replay-based methods have proven effective, but their success relies on balancing memory constraints and retaining old class accuracy while learning new classes. This paper presents TaskVAE, a framework for replay-based CL in class-incremental settings. TaskVAE employs task-specific Variational Autoencoders (VAEs) to generate synthetic exemplars from previous tasks, which are then used to train the classifier alongside new task data. In contrast to traditional methods that require prior knowledge of the total class count or rely on a single VAE for all tasks, TaskVAE adapts flexibly to increasing tasks without such constraints. We focus on Human Activity Recognition (HAR) using IMU sensor-equipped devices. Unlike previous HAR studies that combine data across all users, our approach focuses on individual user data, better reflecting real-world scenarios where a person progressively learns new activities. Extensive experiments on 5 different HAR datasets show that TaskVAE outperforms experience replay methods, particularly with limited data, and exhibits robust performance as dataset size increases. Additionally, memory footprint of TaskVAE is minimal, being equivalent to only 60 samples per task, while still being able to generate an unlimited number of synthetic samples. The contributions lie in balancing memory constraints, task-specific generation, and long-term stability, making it a reliable solution for real-world applications in domains like HAR.

Title: Matrix Is All You Need

Authors: Yuzhou Zhu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01966
Pdf URL: https://arxiv.org/pdf/2506.01966
Copy Paste: [[2506.01966]] Matrix Is All You Need(https://arxiv.org/abs/2506.01966)
Keywords: transformer
Abstract: Deep neural networks employ specialized architectures for vision, sequential and language tasks, yet this proliferation obscures their underlying commonalities. We introduce a unified matrix-order framework that casts convolutional, recurrent and self-attention operations as sparse matrix multiplications. Convolution is realized via an upper-triangular weight matrix performing first-order transformations; recurrence emerges from a lower-triangular matrix encoding stepwise updates; attention arises naturally as a third-order tensor factorization. We prove algebraic isomorphism with standard CNN, RNN and Transformer layers under mild assumptions. Empirical evaluations on image classification (MNIST, CIFAR-10/100, Tiny ImageNet), time-series forecasting (ETTh1, Electricity Load Diagrams) and language modeling/classification (AG News, WikiText-2, Penn Treebank) confirm that sparse-matrix formulations match or exceed native model performance while converging in comparable or fewer epochs. By reducing architecture design to sparse pattern selection, our matrix perspective aligns with GPU parallelism and leverages mature algebraic optimization tools. This work establishes a mathematically rigorous substrate for diverse neural architectures and opens avenues for principled, hardware-aware network design.

Title: Turning LLM Activations Quantization-Friendly

Authors: Patrik Czakó, Gábor Kertész, Sándor Szénási
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.01967
Pdf URL: https://arxiv.org/pdf/2506.01967
Copy Paste: [[2506.01967]] Turning LLM Activations Quantization-Friendly(https://arxiv.org/abs/2506.01967)
Keywords: large language model
Abstract: Quantization effectively reduces the serving costs of Large Language Models (LLMs) by speeding up data movement through compressed parameters and enabling faster operations via integer arithmetic. However, activating integer arithmetic requires quantizing both weights and activations, which poses challenges due to the significant outliers in LLMs that increase quantization error. In this work, we investigate these outliers with an emphasis on their effect on layer-wise quantization error, then examine how smoothing and rotation transform the observed values. Our primary contributions include introducing a new metric to measure and visualize quantization difficulty based on channel magnitudes, as well as proposing a hybrid approach that applies channel-wise scaling before rotation, supported by a mathematical formulation of its benefits.

Title: Johnny: Structuring Representation Space to Enhance Machine Abstract Reasoning Ability

Authors: Ruizhuo Song, Beiming Yuan
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.01970
Pdf URL: https://arxiv.org/pdf/2506.01970
Copy Paste: [[2506.01970]] Johnny: Structuring Representation Space to Enhance Machine Abstract Reasoning Ability(https://arxiv.org/abs/2506.01970)
Keywords: extraction, transformer
Abstract: This paper thoroughly investigates the challenges of enhancing AI's abstract reasoning capabilities, with a particular focus on Raven's Progressive Matrices (RPM) tasks involving complex human-like concepts. Firstly, it dissects the empirical reality that traditional end-to-end RPM-solving models heavily rely on option pool configurations, highlighting that this dependency constrains the model's reasoning capabilities. To address this limitation, the paper proposes the Johnny architecture - a novel representation space-based framework for RPM-solving. Through the synergistic operation of its Representation Extraction Module and Reasoning Module, Johnny significantly enhances reasoning performance by supplementing primitive negative option configurations with a learned representation space. Furthermore, to strengthen the model's capacity for capturing positional relationships among local features, the paper introduces the Spin-Transformer network architecture, accompanied by a lightweight Straw Spin-Transformer variant that reduces computational overhead through parameter sharing and attention mechanism optimization. Experimental evaluations demonstrate that both Johnny and Spin-Transformer achieve superior performance on RPM tasks, offering innovative methodologies for advancing AI's abstract reasoning capabilities.

Title: Towards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN

Authors: Wei Huang, Hanchen Wang, Dong Wen, Shaozhen Ma, Wenjie Zhang, Xuemin Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01977
Pdf URL: https://arxiv.org/pdf/2506.01977
Copy Paste: [[2506.01977]] Towards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN(https://arxiv.org/abs/2506.01977)
Keywords: diffusion, generative
Abstract: Graph Edit Distance (GED) is a fundamental graph similarity metric widely used in various applications. However, computing GED is an NP-hard problem. Recent state-of-the-art hybrid GED solver has shown promising performance by formulating GED as a bipartite graph matching problem, then leveraging a generative diffusion model to predict node matching between two graphs, from which both the GED and its corresponding edit path can be extracted using a traditional algorithm. However, such methods typically rely heavily on ground-truth supervision, where the ground-truth labels are often costly to obtain in real-world scenarios. In this paper, we propose GEDRanker, a novel unsupervised GAN-based framework for GED computation. Specifically, GEDRanker consists of a matching-based GED solver and introduces an interpretable preference-aware discriminator with an effective training strategy to guide the matching-based GED solver toward generating high-quality node matching without the need for ground-truth labels. Extensive experiments on benchmark datasets demonstrate that our GEDRanker enables the matching-based GED solver to achieve near-optimal solution quality without any ground-truth supervision.

Title: Improvement of AMPs Identification with Generative Adversarial Network and Ensemble Classification

Authors: Reyhaneh Keshavarzpour, Eghbal Mansoori
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01983
Pdf URL: https://arxiv.org/pdf/2506.01983
Copy Paste: [[2506.01983]] Improvement of AMPs Identification with Generative Adversarial Network and Ensemble Classification(https://arxiv.org/abs/2506.01983)
Keywords: generative
Abstract: Identification of antimicrobial peptides is an important and necessary issue in today's era. Antimicrobial peptides are essential as an alternative to antibiotics for biomedical applications and many other practical applications. These oligopeptides are useful in drug design and cause innate immunity against microorganisms. Artificial intelligence algorithms have played a significant role in the ease of identifying these this http URL research is improved by improving proposed method in the field of antimicrobial peptides prediction. Suggested method is improved by combining the best coding method from different perspectives, In the following a deep neural network to balance the imbalanced combined datasets. The results of this research show that the proposed method have a significant improvement in the accuracy and efficiency of the prediction of antimicrobial peptides and are able to provide the best results compared to the existing methods. These development in the field of prediction and classification of antimicrobial peptides, basically in the fields of medicine and pharmaceutical industries, have high effectiveness and application.

Title: SpecMemo: Speculative Decoding is in Your Pocket

Authors: Selin Yildirim, Deming Chen
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2506.01986
Pdf URL: https://arxiv.org/pdf/2506.01986
Copy Paste: [[2506.01986]] SpecMemo: Speculative Decoding is in Your Pocket(https://arxiv.org/abs/2506.01986)
Keywords: robust, large language model
Abstract: Recent advancements in speculative decoding have demonstrated considerable speedup across a wide array of large language model (LLM) tasks. Speculative decoding inherently relies on sacrificing extra memory allocations to generate several candidate tokens, of which acceptance rate drives the speedup. However, deploying speculative decoding on memory-constrained devices, such as mobile GPUs, remains as a significant challenge in real-world scenarios. In this work, we present a device-aware inference engine named SpecMemo that can smartly control memory allocations at finer levels to enable multi-turn chatbots with speculative decoding on such limited memory devices. Our methodology stems from theoretically modeling memory footprint of speculative decoding to determine a lower bound on the required memory budget while retaining speedup. SpecMemo empirically acquires a careful balance between minimizing redundant memory allocations for rejected candidate tokens and maintaining competitive performance gains from speculation. Notably, with SpecMemo's memory management, we maintain 96% of overall throughput from speculative decoding on MT-Bench, with reduced generation-memory by 65% on single Nvidia Titan RTX. Given multiple constrained GPUs, we build on top of previous speculative decoding architectures to facilitate big-model inference by distributing Llama-2-70B-Chat model, on which we provide novel batched speculative decoding to increase usability of multiple small server GPUs. This novel framework demonstrates 2x speedup over distributed and batched vanilla decoding with the base model on eight AMD MI250 GPUs. Moreover, inference throughput increases remarkably 8x with batch size 10. Our work contributes to democratized LLM applications in resource-constrained environments, providing a pathway for faster and cheaper deployment of real-world LLM applications with robust performance.

Title: Surrogate Interpretable Graph for Random Decision Forests

Authors: Akshat Dubey, Aleksandar Anžel, Georges Hattab
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01988
Pdf URL: https://arxiv.org/pdf/2506.01988
Copy Paste: [[2506.01988]] Surrogate Interpretable Graph for Random Decision Forests(https://arxiv.org/abs/2506.01988)
Keywords: robust, interpretability
Abstract: The field of health informatics has been profoundly influenced by the development of random forest models, which have led to significant advances in the interpretability of feature interactions. These models are characterized by their robustness to overfitting and parallelization, making them particularly useful in this domain. However, the increasing number of features and estimators in random forests can prevent domain experts from accurately interpreting global feature interactions, thereby compromising trust and regulatory compliance. A method called the surrogate interpretability graph has been developed to address this issue. It uses graphs and mixed-integer linear programming to analyze and visualize feature interactions. This improves their interpretability by visualizing the feature usage per decision-feature-interaction table and the most dominant hierarchical decision feature interactions for predictions. The implementation of a surrogate interpretable graph enhances global interpretability, which is critical for such a high-stakes domain.

Title: Coded Robust Aggregation for Distributed Learning under Byzantine Attacks

Authors: Chengxi Li, Ming Xiao, Mikael Skoglund
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.01989
Pdf URL: https://arxiv.org/pdf/2506.01989
Copy Paste: [[2506.01989]] Coded Robust Aggregation for Distributed Learning under Byzantine Attacks(https://arxiv.org/abs/2506.01989)
Keywords: attack, robust
Abstract: In this paper, we investigate the problem of distributed learning (DL) in the presence of Byzantine attacks. For this problem, various robust bounded aggregation (RBA) rules have been proposed at the central server to mitigate the impact of Byzantine attacks. However, current DL methods apply RBA rules for the local gradients from the honest devices and the disruptive information from Byzantine devices, and the learning performance degrades significantly when the local gradients of different devices vary considerably from each other. To overcome this limitation, we propose a new DL method to cope with Byzantine attacks based on coded robust aggregation (CRA-DL). Before training begins, the training data are allocated to the devices redundantly. During training, in each iteration, the honest devices transmit coded gradients to the server computed from the allocated training data, and the server then aggregates the information received from both honest and Byzantine devices using RBA rules. In this way, the global gradient can be approximately recovered at the server to update the global model. Compared with current DL methods applying RBA rules, the improvement of CRA-DL is attributed to the fact that the coded gradients sent by the honest devices are closer to each other. This closeness enhances the robustness of the aggregation against Byzantine attacks, since Byzantine messages tend to be significantly different from those of honest devices in this case. We theoretically analyze the convergence performance of CRA-DL. Finally, we present numerical results to verify the superiority of the proposed method over existing baselines, showing its enhanced learning performance under Byzantine attacks.

Title: No Free Lunch in Active Learning: LLM Embedding Quality Dictates Query Strategy Success

Authors: Lukas Rauch, Moritz Wirth, Denis Huseljic, Marek Herde, Bernhard Sick, Matthias Aßenmacher
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.01992
Pdf URL: https://arxiv.org/pdf/2506.01992
Copy Paste: [[2506.01992]] No Free Lunch in Active Learning: LLM Embedding Quality Dictates Query Strategy Success(https://arxiv.org/abs/2506.01992)
Keywords: robust, large language model
Abstract: The advent of large language models (LLMs) capable of producing general-purpose representations lets us revisit the practicality of deep active learning (AL): By leveraging frozen LLM embeddings, we can mitigate the computational costs of iteratively fine-tuning large backbones. This study establishes a benchmark and systematically investigates the influence of LLM embedding quality on query strategies in deep AL. We employ five top-performing models from the massive text embedding benchmark (MTEB) leaderboard and two baselines for ten diverse text classification tasks. Our findings reveal key insights: First, initializing the labeled pool using diversity-based sampling synergizes with high-quality embeddings, boosting performance in early AL iterations. Second, the choice of the optimal query strategy is sensitive to embedding quality. While the computationally inexpensive Margin sampling can achieve performance spikes on specific datasets, we find that strategies like Badge exhibit greater robustness across tasks. Importantly, their effectiveness is often enhanced when paired with higher-quality embeddings. Our results emphasize the need for context-specific evaluation of AL strategies, as performance heavily depends on embedding quality and the target task.

Title: NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

Authors: Abhay Gupta, Michael Lu, Kevin Zhu, Sean O'Brien, Vasu Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02000
Pdf URL: https://arxiv.org/pdf/2506.02000
Copy Paste: [[2506.02000]] NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts(https://arxiv.org/abs/2506.02000)
Keywords: robust, large language model
Abstract: Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. We introduce NovelHopQA, the first benchmark to evaluate k1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. We evaluate six state-of-the-art (SOTA) models and apply oracle-context filtering to ensure all questions are genuinely answerable. Human annotators validate both alignment and hop depth. We noticed consistent accuracy drops with increased hops and context length, even in frontier models-revealing that sheer scale does not guarantee robust reasoning. Our failure mode analysis highlights common breakdowns, such as missed final-hop integration and long-range drift. NovelHopQA offers a controlled diagnostic setting to stress-test multi-hop reasoning at scale.

Title: CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge

Authors: Zehua Liu, Xiaolou Li, Chen Chen, Lantian Li, Dong Wang
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.02010
Pdf URL: https://arxiv.org/pdf/2506.02010
Copy Paste: [[2506.02010]] CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge(https://arxiv.org/abs/2506.02010)
Keywords: extraction
Abstract: This paper presents the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), which builds on CNVSRC 2023 to advance research in Chinese Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR). The challenge evaluates two test scenarios: reading in recording studios and Internet speech. CNVSRC 2024 uses the same datasets as its predecessor CNVSRC 2023, which involves CN-CVS for training and CNVSRC-Single/Multi for development and evaluation. However, CNVSRC 2024 introduced two key improvements: (1) a stronger baseline system, and (2) an additional dataset, CN-CVS2-P1, for open tracks to improve data volume and diversity. The new challenge has demonstrated several important innovations in data preprocessing, feature extraction, model design, and training strategies, further pushing the state-of-the-art in Chinese LVC-VSR. More details and resources are available at the official website.

Title: Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

Authors: Zehua Liu, Xiaolou Li, Li Guo, Lantian Li, Dong Wang
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.02012
Pdf URL: https://arxiv.org/pdf/2506.02012
Copy Paste: [[2506.02012]] Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing(https://arxiv.org/abs/2506.02012)
Keywords: large language model
Abstract: Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement.

Title: Object-centric Self-improving Preference Optimization for Text-to-Image Generation

Authors: Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02015
Pdf URL: https://arxiv.org/pdf/2506.02015
Copy Paste: [[2506.02015]] Object-centric Self-improving Preference Optimization for Text-to-Image Generation(https://arxiv.org/abs/2506.02015)
Keywords: large language model
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly improved both image understanding and generation capabilities. Despite these improvements, MLLMs still struggle with fine-grained visual comprehension, particularly in text-to-image generation tasks. While preference optimization methods have been explored to address these limitations in image understanding tasks, their application to image generation remains largely underexplored. To address this gap, we propose an Object-centric Self-improving Preference Optimization (OSPO) framework designed for text-to-image generation by MLLMs. OSPO leverages the intrinsic reasoning abilities of MLLMs without requiring any external datasets or models. OSPO emphasizes the importance of high-quality preference pair data, which is critical for effective preference optimization. To achieve this, it introduces a self-improving mechanism that autonomously constructs object-level contrastive preference pairs through object-centric prompt perturbation, densification and VQA scoring. This process eliminates ambiguous or disproportionate variations commonly found in naively generated preference pairs, thereby enhancing the effectiveness of preference optimization. We validate OSPO on three representative compositional text-to-image benchmarks, demonstrating substantial performance gains over baseline models.

Title: Are classical deep neural networks weakly adversarially robust?

Authors: Nuolin Sun, Linyuan Wang, Dongyang Li, Bin Yan, Lei Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02016
Pdf URL: https://arxiv.org/pdf/2506.02016
Copy Paste: [[2506.02016]] Are classical deep neural networks weakly adversarially robust?(https://arxiv.org/abs/2506.02016)
Keywords: defense, attack, robust
Abstract: Adversarial attacks have received increasing attention and it has been widely recognized that classical DNNs have weak adversarial robustness. The most commonly used adversarial defense method, adversarial training, improves the adversarial accuracy of DNNs by generating adversarial examples and retraining the model. However, adversarial training requires a significant computational overhead. In this paper, inspired by existing studies focusing on the clustering properties of DNN output features at each layer and the Progressive Feedforward Collapse phenomenon, we propose a method for adversarial example detection and image recognition that uses layer-wise features to construct feature paths and computes the correlation between the examples feature paths and the class-centered feature paths. Experimental results show that the recognition method achieves 82.77% clean accuracy and 44.17% adversarial accuracy on the ResNet-20 with PFC. Compared to the adversarial training method with 77.64% clean accuracy and 52.94% adversarial accuracy, our method exhibits a trade-off without relying on computationally expensive defense strategies. Furthermore, on the standard ResNet-18, our method maintains this advantage with respective metrics of 80.01% and 46.1%. This result reveals inherent adversarial robustness in DNNs, challenging the conventional understanding of the weak adversarial robustness in DNNs.

Title: Fairness through Feedback: Addressing Algorithmic Misgendering in Automatic Gender Recognition

Authors: Camilla Quaresmini, Giacomo Zanotti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02017
Pdf URL: https://arxiv.org/pdf/2506.02017
Copy Paste: [[2506.02017]] Fairness through Feedback: Addressing Algorithmic Misgendering in Automatic Gender Recognition(https://arxiv.org/abs/2506.02017)
Keywords: fair
Abstract: Automatic Gender Recognition (AGR) systems are an increasingly widespread application in the Machine Learning (ML) landscape. While these systems are typically understood as detecting gender, they often classify datapoints based on observable features correlated at best with either male or female sex. In addition to questionable binary assumptions, from an epistemological point of view, this is problematic for two reasons. First, there exists a gap between the categories the system is meant to predict (woman versus man) and those onto which their output reasonably maps (female versus male). What is more, gender cannot be inferred on the basis of such observable features. This makes AGR tools often unreliable, especially in the case of non-binary and gender non-conforming people. We suggest a theoretical and practical rethinking of AGR systems. To begin, distinctions are made between sex, gender, and gender expression. Then, we build upon the observation that, unlike algorithmic misgendering, human-human misgendering is open to the possibility of re-evaluation and correction. We suggest that analogous dynamics should be recreated in AGR, giving users the possibility to correct the system's output. While implementing such a feedback mechanism could be regarded as diminishing the system's autonomy, it represents a way to significantly increase fairness levels in AGR. This is consistent with the conceptual change of paradigm that we advocate for AGR systems, which should be understood as tools respecting individuals' rights and capabilities of self-expression and determination.

Title: Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data

Authors: Christopher Lee Lübbers
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02018
Pdf URL: https://arxiv.org/pdf/2506.02018
Copy Paste: [[2506.02018]] Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data(https://arxiv.org/abs/2506.02018)
Keywords: robust
Abstract: Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and linguistic transformations. This study addresses this gap by leveraging a human-ranked paraphrase-type dataset and integrating Direct Preference Optimization (DPO) to align model outputs directly with human judgments. DPO-based training increases paraphrase-type generation accuracy by 3 percentage points over a supervised baseline and raises human preference ratings by 7 percentage points. A newly created human-annotated dataset supports more rigorous future evaluations. Additionally, a paraphrase-type detection model achieves F1 scores of 0.91 for addition/deletion, 0.78 for same polarity substitution, and 0.70 for punctuation changes. These findings demonstrate that preference data and DPO training produce more reliable, semantically accurate paraphrases, enabling downstream applications such as improved summarization and more robust question-answering. The PTD model surpasses automated metrics and provides a more reliable framework for evaluating paraphrase quality, advancing paraphrase-type research toward richer, user-aligned language generation and establishing a stronger foundation for future evaluations grounded in human-centric criteria.

Title: ChatCFD: an End-to-End CFD Agent with Domain-specific Structured Thinking

Authors: E Fan, Weizong Wang, Tianhan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02019
Pdf URL: https://arxiv.org/pdf/2506.02019
Copy Paste: [[2506.02019]] ChatCFD: an End-to-End CFD Agent with Domain-specific Structured Thinking(https://arxiv.org/abs/2506.02019)
Keywords: large language model
Abstract: Computational Fluid Dynamics (CFD) is essential for scientific and engineering advancements but is limited by operational complexity and the need for extensive expertise. This paper presents ChatCFD, a large language model-driven pipeline that automates CFD workflows within the OpenFOAM framework. It enables users to configure and execute complex simulations from natural language prompts or published literature with minimal expertise. The innovation is its structured approach to database construction, configuration validation, and error reflection, integrating CFD and OpenFOAM knowledge with general language models to improve accuracy and adaptability. Validation shows ChatCFD can autonomously reproduce published CFD results, handling complex, unseen configurations beyond basic examples, a task challenging for general language models.

Title: Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

Authors: Youze Xue, Dian Li, Gang Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02020
Pdf URL: https://arxiv.org/pdf/2506.02020
Copy Paste: [[2506.02020]] Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying(https://arxiv.org/abs/2506.02020)
Keywords: large language model
Abstract: With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples, elucidating the role of hard negatives in updating model parameters. Building upon this analysis, we propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings. Our multi-modal embedding model, trained with the proposed Explicit Gradient Amplifier and based on the LLaVA-OneVision-7B architecture, achieves state-of-the-art performance on the MMEB benchmark compared to previous methods utilizing the same MLLM backbone. Furthermore, when integrated with our self-developed MLLM, QQMM, our approach attains the top rank on the MMEB leaderboard. Code and models are released on this https URL.

Title: Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Authors: Aditya Kanade, Tanuja Ganu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02022
Pdf URL: https://arxiv.org/pdf/2506.02022
Copy Paste: [[2506.02022]] Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs(https://arxiv.org/abs/2506.02022)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at this https URL.

Title: The End Of Universal Lifelong Identifiers: Identity Systems For The AI Era

Authors: Shriphani Palakodety
Subjects: cs.CR, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.02027
Pdf URL: https://arxiv.org/pdf/2506.02027
Copy Paste: [[2506.02027]] The End Of Universal Lifelong Identifiers: Identity Systems For The AI Era(https://arxiv.org/abs/2506.02027)
Keywords: privacy
Abstract: Many identity systems assign a single, static identifier to an individual for life, reused across domains like healthcare, finance, and education. These Universal Lifelong Identifiers (ULIs) underpin critical workflows but now pose systemic privacy risks. We take the position that ULIs are fundamentally incompatible with the AI era and must be phased out. We articulate a threat model grounded in modern AI capabilities and show that traditional safeguards such as redaction, consent, and access controls are no longer sufficient. We define core properties for identity systems in the AI era and present a cryptographic framework that satisfies them while retaining compatibility with existing identifier workflows. Our design preserves institutional workflows, supports essential functions such as auditability and delegation, and offers a practical migration path beyond ULIs.

Title: A tertiary review on quantum cryptography

Authors: Luiz Filipi Anderson de Sousa Moura, Carlos Becker Westphall
Subjects: cs.CR, cs.NI, physics.optics
Abstract URL: https://arxiv.org/abs/2506.02028
Pdf URL: https://arxiv.org/pdf/2506.02028
Copy Paste: [[2506.02028]] A tertiary review on quantum cryptography(https://arxiv.org/abs/2506.02028)
Keywords: security, attack
Abstract: Quantum computers impose an immense threat to system security. As a countermeasure, new cryptographic classes have been created to prevent these attacks. Technologies such as post-quantum cryptography and quantum cryptography. Quantum cryptography uses the principle of quantum physics to produce theoretically unbreakable security. This tertiary review selected 51 secondary studies from the Scopus database and presented bibliometric analysis, a list of the main techniques used in the field, and existing open challenges and future directions in quantum cryptography research. The results showed a prevalence of QKD over other techniques among the selected papers and stated that the field still faces many problems related to implementation cost, error correction, decoherence, key rates, communication distance, and quantum hacking.

Title: Adaptive Privacy-Preserving SSD

Authors: Na Young Ahn, Dong Hoon Lee
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02030
Pdf URL: https://arxiv.org/pdf/2506.02030
Copy Paste: [[2506.02030]] Adaptive Privacy-Preserving SSD(https://arxiv.org/abs/2506.02030)
Keywords: privacy
Abstract: Data remanence in NAND flash complicates complete deletion on IoT SSDs. We design an adaptive architecture offering four privacy levels (PL0-PL3) that select among address, data, and parity deletion techniques. Quantitative analysis balances efficacy, latency, endurance, and cost. Machine-learning adjusts levels contextually, boosting privacy with negligible performance overhead and complexity.

Title: Towards Secure MLOps: Surveying Attacks, Mitigation Strategies, and Research Challenges

Authors: Raj Patel, Himanshu Tripathi, Jasper Stone, Noorbakhsh Amiri Golilarz, Sudip Mittal, Shahram Rahimi, Vini Chaudhary
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02032
Pdf URL: https://arxiv.org/pdf/2506.02032
Copy Paste: [[2506.02032]] Towards Secure MLOps: Surveying Attacks, Mitigation Strategies, and Research Challenges(https://arxiv.org/abs/2506.02032)
Keywords: secure, security, defense, attack, robust
Abstract: The rapid adoption of machine learning (ML) technologies has driven organizations across diverse sectors to seek efficient and reliable methods to accelerate model development-to-deployment. Machine Learning Operations (MLOps) has emerged as an integrative approach addressing these requirements by unifying relevant roles and streamlining ML workflows. As the MLOps market continues to grow, securing these pipelines has become increasingly critical. However, the unified nature of MLOps ecosystem introduces vulnerabilities, making them susceptible to adversarial attacks where a single misconfiguration can lead to compromised credentials, severe financial losses, damaged public trust, and the poisoning of training data. Our paper presents a systematic application of the MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) framework, a comprehensive and continuously updated catalog of AI-focused attacks, to systematically assess attacks across different phases of the MLOps ecosystem. We begin by examining the preparatory phases during which adversaries acquire the essential intelligence required to initiate their attacks. We then present a structured taxonomy of attack techniques explicitly mapped to corresponding phases of the MLOps ecosystem, supported by examples drawn from red-teaming exercises and real-world incidents. This is followed by a taxonomy of mitigation strategies aligned with these attack categories, offering actionable early-stage defenses to strengthen the security of MLOps ecosystem. Given the rapid evolution and adoption of MLOps, we further highlight key research gaps that require immediate attention. Our work emphasizes the importance of implementing robust security protocols from the outset, empowering practitioners to safeguard MLOps ecosystem against evolving cyber attacks.

Title: Asymmetry by Design: Boosting Cyber Defenders with Differential Access to AI

Authors: Shaun Ee, Chris Covino, Cara Labrador, Christina Krawec, Jam Kraprayoon, Joe O'Brien
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2506.02035
Pdf URL: https://arxiv.org/pdf/2506.02035
Copy Paste: [[2506.02035]] Asymmetry by Design: Boosting Cyber Defenders with Differential Access to AI(https://arxiv.org/abs/2506.02035)
Keywords: security, defense
Abstract: As AI-enabled cyber capabilities become more advanced, we propose "differential access" as a strategy to tilt the cybersecurity balance toward defense by shaping access to these capabilities. We introduce three possible approaches that form a continuum, becoming progressively more restrictive for higher-risk capabilities: Promote Access, Manage Access, and Deny by Default. However, a key principle across all approaches is the need to prioritize defender access, even in the most restrictive scenarios, so that defenders can prepare for adversaries gaining access to similar capabilities. This report provides a process to help frontier AI developers choose and implement one of the three differential access approaches, including considerations based on a model's cyber capabilities, a defender's maturity and role, and strategic and technical implementation details. We also present four example schemes for defenders to reference, demonstrating how differential access provides value across various capability and defender levels, and suggest directions for further research.

Title: FinS-Pilot: A Benchmark for Online Financial System

Authors: Feng Wang, Yiding Sun, Jiaxin Mao, Wei Xue, Danqing Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02037
Pdf URL: https://arxiv.org/pdf/2506.02037
Copy Paste: [[2506.02037]] FinS-Pilot: A Benchmark for Online Financial System(https://arxiv.org/abs/2506.02037)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various professional domains, with their performance typically evaluated through standardized benchmarks. However, the development of financial RAG benchmarks has been constrained by data confidentiality issues and the lack of dynamic data integration. To address this issue, we introduces FinS-Pilot, a novel benchmark for evaluating RAG systems in online financial applications. Constructed from real-world financial assistant interactions, our benchmark incorporates both real-time API data and structured text sources, organized through an intent classification framework covering critical financial domains such as equity analysis and macroeconomic forecasting. The benchmark enables comprehensive evaluation of financial assistants' capabilities in handling both static knowledge and time-sensitive market information. Through systematic experiments with multiple Chinese leading LLMs, we demonstrate FinS-Pilot's effectiveness in identifying models suitable for financial applications while addressing the current gap in specialized evaluation tools for the financial domain. Our work contributes both a practical evaluation framework and a curated dataset to advance research in financial NLP systems. The code and dataset are accessible on GitHub\footnote{this https URL\_rag\_benchmark}.

Title: Blockchain Powered Edge Intelligence for U-Healthcare in Privacy Critical and Time Sensitive Environment

Authors: Anum Nawaz, Hafiz Humza Mahmood Ramzan, Xianjia Yu, Zhuo Zou, Tomi Westerlund
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02038
Pdf URL: https://arxiv.org/pdf/2506.02038
Copy Paste: [[2506.02038]] Blockchain Powered Edge Intelligence for U-Healthcare in Privacy Critical and Time Sensitive Environment(https://arxiv.org/abs/2506.02038)
Keywords: secure, security, privacy, robust
Abstract: Edge Intelligence (EI) serves as a critical enabler for privacy-preserving systems by providing AI-empowered computation and distributed caching services at the edge, thereby minimizing latency and enhancing data privacy. The integration of blockchain technology further augments EI frameworks by ensuring transactional transparency, auditability, and system-wide reliability through a decentralized network model. However, the operational architecture of such systems introduces inherent vulnerabilities, particularly due to the extensive data interactions between edge gateways (EGs) and the distributed nature of information storage during service provisioning. To address these challenges, we propose an autonomous computing model along with its interaction topologies tailored for privacy-critical and time-sensitive health applications. The system supports continuous monitoring, real-time alert notifications, disease detection, and robust data processing and aggregation. It also includes a data transaction handler and mechanisms for ensuring privacy at the EGs. Moreover, a resource-efficient one-dimensional convolutional neural network (1D-CNN) is proposed for the multiclass classification of arrhythmia, enabling accurate and real-time analysis of constrained EGs. Furthermore, a secure access scheme is defined to manage both off-chain and on-chain data sharing and storage. To validate the proposed model, comprehensive security, performance, and cost analyses are conducted, demonstrating the efficiency and reliability of the fine-grained access control scheme.

Title: Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem

Authors: Hao Song, Yiming Shen, Wenxuan Luo, Leixin Guo, Ting Chen, Jiashui Wang, Beibei Li, Xiaosong Zhang, Jiachi Chen
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2506.02040
Pdf URL: https://arxiv.org/pdf/2506.02040
Copy Paste: [[2506.02040]] Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem(https://arxiv.org/abs/2506.02040)
Keywords: security, attack, robust, large language model
Abstract: The Model Context Protocol (MCP) is an emerging standard designed to enable seamless interaction between Large Language Model (LLM) applications and external tools or resources. Within a short period, thousands of MCP services have already been developed and deployed. However, the client-server integration architecture inherent in MCP may expand the attack surface against LLM Agent systems, introducing new vulnerabilities that allow attackers to exploit by designing malicious MCP servers. In this paper, we present the first systematic study of attack vectors targeting the MCP ecosystem. Our analysis identifies four categories of attacks, i.e., Tool Poisoning Attacks, Puppet Attacks, Rug Pull Attacks, and Exploitation via Malicious External Resources. To evaluate the feasibility of these attacks, we conduct experiments following the typical steps of launching an attack through malicious MCP servers: upload-download-attack. Specifically, we first construct malicious MCP servers and successfully upload them to three widely used MCP aggregation platforms. The results indicate that current audit mechanisms are insufficient to identify and prevent the proposed attack methods. Next, through a user study and interview with 20 participants, we demonstrate that users struggle to identify malicious MCP servers and often unknowingly install them from aggregator platforms. Finally, we demonstrate that these attacks can trigger harmful behaviors within the user's local environment-such as accessing private files or controlling devices to transfer digital assets-by deploying a proof-of-concept (PoC) framework against five leading LLMs. Additionally, based on interview results, we discuss four key challenges faced by the current security ecosystem surrounding MCP servers. These findings underscore the urgent need for robust security mechanisms to defend against malicious MCP servers.

Title: Enhancing Multimodal Continual Instruction Tuning with BranchLoRA

Authors: Duzhen Zhang, Yong Ren, Zhong-Zhi Li, Yahan Yu, Jiahua Dong, Chenxing Li, Zhilong Ji, Jinfeng Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02041
Pdf URL: https://arxiv.org/pdf/2506.02041
Copy Paste: [[2506.02041]] Enhancing Multimodal Continual Instruction Tuning with BranchLoRA(https://arxiv.org/abs/2506.02041)
Keywords: large language model
Abstract: Multimodal Continual Instruction Tuning (MCIT) aims to finetune Multimodal Large Language Models (MLLMs) to continually align with human intent across sequential tasks. Existing approaches often rely on the Mixture-of-Experts (MoE) LoRA framework to preserve previous instruction alignments. However, these methods are prone to Catastrophic Forgetting (CF), as they aggregate all LoRA blocks via simple summation, which compromises performance over time. In this paper, we identify a critical parameter inefficiency in the MoELoRA framework within the MCIT context. Based on this insight, we propose BranchLoRA, an asymmetric framework to enhance both efficiency and performance. To mitigate CF, we introduce a flexible tuning-freezing mechanism within BranchLoRA, enabling branches to specialize in intra-task knowledge while fostering inter-task collaboration. Moreover, we incrementally incorporate task-specific routers to ensure an optimal branch distribution over time, rather than favoring the most recent task. To streamline inference, we introduce a task selector that automatically routes test inputs to the appropriate router without requiring task identity. Extensive experiments on the latest MCIT benchmark demonstrate that BranchLoRA significantly outperforms MoELoRA and maintains its superiority across various MLLM sizes.

Title: Docker under Siege: Securing Containers in the Modern Era

Authors: Gogulakrishnan Thiyagarajan, Prabhudarshi Nayak
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02043
Pdf URL: https://arxiv.org/pdf/2506.02043
Copy Paste: [[2506.02043]] Docker under Siege: Securing Containers in the Modern Era(https://arxiv.org/abs/2506.02043)
Keywords: security, protect
Abstract: Containerization, driven by Docker, has transformed application development and deployment by enhancing efficiency and scalability. However, the rapid adoption of container technologies introduces significant security challenges that require careful management. This paper investigates key areas of container security, including runtime protection, network safeguards, configuration best practices, supply chain security, and comprehensive monitoring and logging solutions. We identify common vulnerabilities within these domains and provide actionable recommendations to address and mitigate these risks. By integrating security throughout the Software Development Lifecycle (SDLC), organizations can reinforce their security posture, creating a resilient and reliable containerized application infrastructure that withstands evolving threats.

Title: Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges

Authors: Lajos Muzsai, David Imolai, András Lukács
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02048
Pdf URL: https://arxiv.org/pdf/2506.02048
Copy Paste: [[2506.02048]] Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges(https://arxiv.org/abs/2506.02048)
Keywords: security, large language model
Abstract: Large Language Models (LLMs) still struggle with the structured reasoning and tool-assisted computation needed for problem solving in cybersecurity applications. In this work, we introduce "random-crypto", a cryptographic Capture-the-Flag (CTF) challenge generator framework that we use to fine-tune a tool-augmented Llama-3.1-8B with Guided Reinforcement Prompt Optimisation (GRPO), allowing the agent to iteratively write and execute Python inside an isolated REPL. GRPO yields a +53% absolute jump in Pass@8 on unseen "random-crypto" tasks (0.35 -> 0.88) and raises Majority@8 to 0.41. The fine-tuned agent also generalizes to an external dataset. On a subset of picoCTF cryptography problems, it improves Pass@8 by +13 pp. Ablations show the gains stem from more reliable tool invocation and code synthesis, rather than superficial prompt adaptation.

Title: Generalization Performance of Ensemble Clustering: From Theory to Algorithm

Authors: Xu Zhang, Haoye Qiu, Weixuan Liang, Hui Liu, Junhui Hou, Yuheng Jia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02053
Pdf URL: https://arxiv.org/pdf/2506.02053
Copy Paste: [[2506.02053]] Generalization Performance of Ensemble Clustering: From Theory to Algorithm(https://arxiv.org/abs/2506.02053)
Keywords: robust
Abstract: Ensemble clustering has demonstrated great success in practice; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of $\mathcal{O}(\sqrt{\frac{\log n}{m}}+\frac{1}{\sqrt{n}})$, with $n$ and $m$ being the numbers of samples and base clusterings. Based on this, we prove that when $m$ and $n$ approach infinity and $m$ is significantly larger than log $n$, i.e., $m,n\to \infty, m\gg \log n$, ensemble clustering is consistent. Furthermore, recognizing that $n$ and $m$ are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to finite clusterings, we minimize the error between the empirical average clusterings and their expectation. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (min-max) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1%, 7.3%, and 6.0% on 10 datasets w.r.t. NMI, ARI, and Purity. The code is available at this https URL.

Title: Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?

Authors: Xiang Li, Jiayi Xin, Qi Long, Weijie J. Su
Subjects: cs.CL, cs.IR, cs.LG, stat.AP, stat.ME
Abstract URL: https://arxiv.org/abs/2506.02058
Pdf URL: https://arxiv.org/pdf/2506.02058
Copy Paste: [[2506.02058]] Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?(https://arxiv.org/abs/2506.02058)
Keywords: large language model
Abstract: Accurate evaluation of large language models (LLMs) is crucial for understanding their capabilities and guiding their development. However, current evaluations often inconsistently reflect the actual capacities of these models. In this paper, we demonstrate that one of many contributing factors to this \textit{evaluation crisis} is the oversight of unseen knowledge -- information encoded by LLMs but not directly observed or not yet observed during evaluations. We introduce KnowSum, a statistical framework designed to provide a more comprehensive assessment by quantifying the unseen knowledge for a class of evaluation tasks. KnowSum estimates the unobserved portion by extrapolating from the appearance frequencies of observed knowledge instances. We demonstrate the effectiveness and utility of KnowSum across three critical applications: estimating total knowledge, evaluating information retrieval effectiveness, and measuring output diversity. Our experiments reveal that a substantial volume of knowledge is omitted when relying solely on observed LLM performance. Importantly, KnowSum yields significantly different comparative rankings for several common LLMs based on their internal knowledge.

Title: Predicting Blood Type: Assessing Model Performance with ROC Analysis

Authors: Malik A. Altayar, Muhyeeddin Alqaraleh, Mowafaq Salem Alzboon, Wesam T. Almagharbeh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02062
Pdf URL: https://arxiv.org/pdf/2506.02062
Copy Paste: [[2506.02062]] Predicting Blood Type: Assessing Model Performance with ROC Analysis(https://arxiv.org/abs/2506.02062)
Keywords: security, biometric
Abstract: Introduction: Personal identification is a critical aspect of forensic sciences, security, and healthcare. While conventional biometrics systems such as DNA profiling and iris scanning offer high accuracy, they are time-consuming and costly. Objectives: This study investigates the relationship between fingerprint patterns and ABO blood group classification to explore potential correlations between these two traits. Methods: The study analyzed 200 individuals, categorizing their fingerprints into three types: loops, whorls, and arches. Blood group classification was also recorded. Statistical analysis, including chi-square and Pearson correlation tests, was used to assess associations between fingerprint patterns and blood groups. Results: Loops were the most common fingerprint pattern, while blood group O+ was the most prevalent among the participants. Statistical analysis revealed no significant correlation between fingerprint patterns and blood groups (p > 0.05), suggesting that these traits are independent. Conclusions: Although the study showed limited correlation between fingerprint patterns and ABO blood groups, it highlights the importance of future research using larger and more diverse populations, incorporating machine learning approaches, and integrating multiple biometric signals. This study contributes to forensic science by emphasizing the need for rigorous protocols and comprehensive investigations in personal identification.

Title: Privacy-Aware, Public-Aligned: Embedding Risk Detection and Public Values into Scalable Clinical Text De-Identification for Trusted Research Environments

Authors: Arlene Casey, Stuart Dunbar, Franz Gruber, Samuel McInerney, Matúš Falis, Pamela Linksted, Katie Wilde, Kathy Harrison, Alison Hamilton, Christian Cole
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02063
Pdf URL: https://arxiv.org/pdf/2506.02063
Copy Paste: [[2506.02063]] Privacy-Aware, Public-Aligned: Embedding Risk Detection and Public Values into Scalable Clinical Text De-Identification for Trusted Research Environments(https://arxiv.org/abs/2506.02063)
Keywords: privacy
Abstract: Clinical free-text data offers immense potential to improve population health research such as richer phenotyping, symptom tracking, and contextual understanding of patient care. However, these data present significant privacy risks due to the presence of directly or indirectly identifying information embedded in unstructured narratives. While numerous de-identification tools have been developed, few have been tested on real-world, heterogeneous datasets at scale or assessed for governance readiness. In this paper, we synthesise our findings from previous studies examining the privacy-risk landscape across multiple document types and NHS data providers in Scotland. We characterise how direct and indirect identifiers vary by record type, clinical setting, and data flow, and show how changes in documentation practice can degrade model performance over time. Through public engagement, we explore societal expectations around the safe use of clinical free text and reflect these in the design of a prototype privacy-risk management tool to support transparent, auditable decision-making. Our findings highlight that privacy risk is context-dependent and cumulative, underscoring the need for adaptable, hybrid de-identification approaches that combine rule-based precision with contextual understanding. We offer a comprehensive view of the challenges and opportunities for safe, scalable reuse of clinical free-text within Trusted Research Environments and beyond, grounded in both technical evidence and public perspectives on responsible data use.

Title: EWGN: Elastic Weight Generation and Context Switching in Deep Learning

Authors: Shriraj P. Sawant, Krishna P. Miyapuram
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.02065
Pdf URL: https://arxiv.org/pdf/2506.02065
Copy Paste: [[2506.02065]] EWGN: Elastic Weight Generation and Context Switching in Deep Learning(https://arxiv.org/abs/2506.02065)
Keywords: generative
Abstract: The ability to learn and retain a wide variety of tasks is a hallmark of human intelligence that has inspired research in artificial general intelligence. Continual learning approaches provide a significant step towards achieving this goal. It has been known that task variability and context switching are challenging for learning in neural networks. Catastrophic forgetting refers to the poor performance on retention of a previously learned task when a new task is being learned. Switching between different task contexts can be a useful approach to mitigate the same by preventing the interference between the varying task weights of the network. This paper introduces Elastic Weight Generative Networks (EWGN) as an idea for context switching between two different tasks. The proposed EWGN architecture uses an additional network that generates the weights of the primary network dynamically while consolidating the weights learned. The weight generation is input-dependent and thus enables context switching. Using standard computer vision datasets, namely MNIST and fashion-MNIST, we analyse the retention of previously learned task representations in Fully Connected Networks, Convolutional Neural Networks, and EWGN architectures with Stochastic Gradient Descent and Elastic Weight Consolidation learning algorithms. Understanding dynamic weight generation and context-switching ability can be useful in enabling continual learning for improved performance.

Title: An Introduction to Flow Matching and Diffusion Models

Authors: Peter Holderrieth, Ezra Erives
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02070
Pdf URL: https://arxiv.org/pdf/2506.02070
Copy Paste: [[2506.02070]] An Introduction to Flow Matching and Diffusion Models(https://arxiv.org/abs/2506.02070)
Keywords: diffusion, generative
Abstract: Diffusion and flow-based models have become the state of the art for generative AI across a wide range of data modalities, including images, videos, shapes, molecules, music, and more! These notes are originally from this https URL, as taught at MIT over the 2025 IAP (winter) term, and are intended to accompany other course content, including lectures and labs. Overall, they function as a self-contained introduction to both flow matching and diffusion models, starting with ordinary and stochastic differential equations, and culminating in flow matching, score matching, classifier-free guidance, and the inner workings of modern, state-of-the-art models for image and video. These notes, and the accompanying course, are ideal for students and practitioners alike who want to develop a principled understanding of the theory and practice of generative AI.

Title: Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition

Authors: Yoonjun Cho, Soeun Kim, Dongjae Jeon, Kyelim Lee, Beomsoo Lee, Albert No
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.02077
Pdf URL: https://arxiv.org/pdf/2506.02077
Copy Paste: [[2506.02077]] Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition(https://arxiv.org/abs/2506.02077)
Keywords: large language model
Abstract: Decomposing weight matrices into quantization and low-rank components ($\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$) is a widely used technique for compressing large language models (LLMs). Existing joint optimization methods iteratively alternate between quantization and low-rank approximation. However, these methods tend to prioritize one component at the expense of the other, resulting in suboptimal decompositions that fail to leverage each component's unique strengths. In this work, we introduce Outlier-Driven Low-Rank Initialization (ODLRI), which assigns low-rank components the specific role of capturing activation-sensitive weights. This structured decomposition mitigates outliers' negative impact on quantization, enabling more effective balance between quantization and low-rank approximation. Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that incorporating ODLRI into the joint optimization framework consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.

Title: Robust Federated Learning against Noisy Clients via Masked Optimization

Authors: Xuefeng Jiang, Tian Wen, Zhiqin Yang, Lvhua Wu, Yufeng Chen, Sheng Sun, Yuwei Wang, Min Liu
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02079
Pdf URL: https://arxiv.org/pdf/2506.02079
Copy Paste: [[2506.02079]] Robust Federated Learning against Noisy Clients via Masked Optimization(https://arxiv.org/abs/2506.02079)
Keywords: privacy, robust, federate
Abstract: In recent years, federated learning (FL) has made significant advance in privacy-sensitive applications. However, it can be hard to ensure that FL participants provide well-annotated data for training. The corresponding annotations from different clients often contain complex label noise at varying levels. This label noise issue has a substantial impact on the performance of the trained models, and clients with greater noise levels can be largely attributed for this degradation. To this end, it is necessary to develop an effective optimization strategy to alleviate the adverse effects of these noisy this http URL this study, we present a two-stage optimization framework, MaskedOptim, to address this intricate label noise problem. The first stage is designed to facilitate the detection of noisy clients with higher label noise rates. The second stage focuses on rectifying the labels of the noisy clients' data through an end-to-end label correction mechanism, aiming to mitigate the negative impacts caused by misinformation within datasets. This is achieved by learning the potential ground-truth labels of the noisy clients' datasets via backpropagation. To further enhance the training robustness, we apply the geometric median based model aggregation instead of the commonly-used vanilla averaged model aggregation. We implement sixteen related methods and conduct evaluations on three image datasets and one text dataset with diverse label noise patterns for a comprehensive comparison. Extensive experimental results indicate that our proposed framework shows its robustness in different scenarios. Additionally, our label correction framework effectively enhances the data quality of the detected noisy clients' local datasets. % Our codes will be open-sourced to facilitate related research communities. Our codes are available via this https URL .

Title: RATFM: Retrieval-augmented Time Series Foundation Model for Anomaly Detection

Authors: Chihiro Maru, Shoetsu Sato
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02081
Pdf URL: https://arxiv.org/pdf/2506.02081
Copy Paste: [[2506.02081]] RATFM: Retrieval-augmented Time Series Foundation Model for Anomaly Detection(https://arxiv.org/abs/2506.02081)
Keywords: large language model
Abstract: Inspired by the success of large language models (LLMs) in natural language processing, recent research has explored the building of time series foundation models and applied them to tasks such as forecasting, classification, and anomaly detection. However, their performances vary between different domains and tasks. In LLM-based approaches, test-time adaptation using example-based prompting has become common, owing to the high cost of retraining. In the context of anomaly detection, which is the focus of this study, providing normal examples from the target domain can also be effective. However, time series foundation models do not naturally acquire the ability to interpret or utilize examples or instructions, because the nature of time series data used during training does not encourage such capabilities. To address this limitation, we propose a retrieval augmented time series foundation model (RATFM), which enables pretrained time series foundation models to incorporate examples of test-time adaptation. We show that RATFM achieves a performance comparable to that of in-domain fine-tuning while avoiding domain-dependent fine-tuning. Experiments on the UCR Anomaly Archive, a multi-domain dataset including nine domains, confirms the effectiveness of the proposed approach.

Title: Temporal Causal-based Simulation for Realistic Time-series Generation

Authors: Nikolaos Gkorgkolis, Nikolaos Kougioulis, MingXue Wang, Bora Caglayan, Andrea Tonon, Dario Simionato, Ioannis Tsamardinos
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02084
Pdf URL: https://arxiv.org/pdf/2506.02084
Copy Paste: [[2506.02084]] Temporal Causal-based Simulation for Realistic Time-series Generation(https://arxiv.org/abs/2506.02084)
Keywords: robust
Abstract: Causal Discovery plays a pivotal role in revealing relationships among observed variables, particularly in the temporal setup. While the majority of CD methods rely on synthetic data for evaluation, and recently for training, these fall short in accurately mirroring real-world scenarios; an effect even more evident in temporal data. Generation techniques depending on simplified assumptions on causal structure, effects and time, limit the quality and diversity of the simulated data. In this work, we introduce Temporal Causal-based Simulation (TCS), a robust framework for generating realistic time-series data and their associated temporal causal graphs. The approach is structured in three phases: estimating the true lagged causal structure of the data, approximating the functional dependencies between variables and learning the noise distribution of the corresponding causal model, each part of which can be explicitly tailored based on data assumptions and characteristics. Through an extensive evaluation process, we highlight that single detection methods for generated data discrimination prove inadequate, accentuating it as a multifaceted challenge. For this, we detail a Min-max optimization phase that draws on AutoML techniques. Our contributions include a flexible, model-agnostic pipeline for generating realistic temporal causal data, a thorough evaluation setup which enhances the validity of the generated datasets and insights into the challenges posed by realistic data generation. Through experiments involving not only real but also semi-synthetic and purely synthetic datasets, we demonstrate that while sampling realistic causal data remains a complex task, our method enriches the domain of generating sensible causal-based temporal data.

Title: SALAD: Systematic Assessment of Machine Unlearing on LLM-Aided Hardware Design

Authors: Zeng Wang, Minghao Shao, Rupesh Karn, Jitendra Bhandari, Likhitha Mankali, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.02089
Pdf URL: https://arxiv.org/pdf/2506.02089
Copy Paste: [[2506.02089]] SALAD: Systematic Assessment of Machine Unlearing on LLM-Aided Hardware Design(https://arxiv.org/abs/2506.02089)
Keywords: security, large language model
Abstract: Large Language Models (LLMs) offer transformative capabilities for hardware design automation, particularly in Verilog code generation. However, they also pose significant data security challenges, including Verilog evaluation data contamination, intellectual property (IP) design leakage, and the risk of malicious Verilog generation. We introduce SALAD, a comprehensive assessment that leverages machine unlearning to mitigate these threats. Our approach enables the selective removal of contaminated benchmarks, sensitive IP and design artifacts, or malicious code patterns from pre-trained LLMs, all without requiring full retraining. Through detailed case studies, we demonstrate how machine unlearning techniques effectively reduce data security risks in LLM-aided hardware design.

Title: Towards Better Generalization and Interpretability in Unsupervised Concept-Based Models

Authors: Francesco De Santis, Philippe Bich, Gabriele Ciravegna, Pietro Barbiero, Danilo Giordano, Tania Cerquitelli
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02092
Pdf URL: https://arxiv.org/pdf/2506.02092
Copy Paste: [[2506.02092]] Towards Better Generalization and Interpretability in Unsupervised Concept-Based Models(https://arxiv.org/abs/2506.02092)
Keywords: interpretability
Abstract: To increase the trustworthiness of deep neural networks, it is critical to improve the understanding of how they make decisions. This paper introduces a novel unsupervised concept-based model for image classification, named Learnable Concept-Based Model (LCBM) which models concepts as random variables within a Bernoulli latent space. Unlike traditional methods that either require extensive human supervision or suffer from limited scalability, our approach employs a reduced number of concepts without sacrificing performance. We demonstrate that LCBM surpasses existing unsupervised concept-based models in generalization capability and nearly matches the performance of black-box models. The proposed concept representation enhances information retention and aligns more closely with human understanding. A user study demonstrates the discovered concepts are also more intuitive for humans to interpret. Finally, despite the use of concept embeddings, we maintain model interpretability by means of a local linear combination of concepts.

Title: Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Authors: Hyojin Bahng, Caroline Chan, Fredo Durand, Phillip Isola
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02095
Pdf URL: https://arxiv.org/pdf/2506.02095
Copy Paste: [[2506.02095]] Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences(https://arxiv.org/abs/2506.02095)
Keywords: diffusion
Abstract: Learning alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are at this https URL

Title: SAB3R: Semantic-Augmented Backbone in 3D Reconstruction

Authors: Xuweiyi Chen, Tian Xia, Sihan Xu, Jianing Yang, Joyce Chai, Zezhou Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02112
Pdf URL: https://arxiv.org/pdf/2506.02112
Copy Paste: [[2506.02112]] SAB3R: Semantic-Augmented Backbone in 3D Reconstruction(https://arxiv.org/abs/2506.02112)
Keywords: segmentation
Abstract: We introduce a new task, Map and Locate, which unifies the traditionally distinct objectives of open-vocabulary segmentation - detecting and segmenting object instances based on natural language queries - and 3D reconstruction, the process of estimating a scene's 3D structure from visual inputs. Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting object instances based on open-vocabulary queries. This task serves as a critical step toward real-world embodied AI applications and introduces a practical task that bridges reconstruction, recognition and reorganization. To tackle this task, we introduce a simple yet effective baseline, which we denote as SAB3R. Our approach builds upon MASt3R, a recent breakthrough in 3D computer vision, and incorporates a lightweight distillation strategy. This method transfers dense, per-pixel semantic features from 2D vision backbones (eg, CLIP and DINOv2) to enhance MASt3R's capabilities. Without introducing any auxiliary frozen networks, our model generates per-pixel semantic features and constructs cohesive point maps in a single forward pass. Compared to separately deploying MASt3R and CLIP, our unified model, SAB3R, achieves superior performance on the Map and Locate benchmark. Furthermore, we evaluate SAB3R on both 2D semantic segmentation and 3D tasks to comprehensively validate its effectiveness.

Title: Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains

Authors: Juncheng Wu, Sheng Liu, Haoqin Tu, Hang Yu, Xiaoke Huang, James Zou, Cihang Xie, Yuyin Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02126
Pdf URL: https://arxiv.org/pdf/2506.02126
Copy Paste: [[2506.02126]] Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains(https://arxiv.org/abs/2506.02126)
Keywords: large language model
Abstract: Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.

Title: Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models

Authors: Michael Li, Nishant Subramani
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02132
Pdf URL: https://arxiv.org/pdf/2506.02132
Copy Paste: [[2506.02132]] Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models(https://arxiv.org/abs/2506.02132)
Keywords: transformer, large language model
Abstract: Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today's language models, we investigate how both classical architectures (BERT, DeBERTa, GPT-2)and contemporary large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) represent lexical identity and inflectional morphology. We train linear and nonlinear classifiers on layer-wise activations to predict word lemmas and inflectional features. We discover that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout the layers. Further analysis reveals that these models encode inflectional morphology through generalizable abstractions, but rely predominantly on memorization to encode lexical identity. Remarkably, these patterns emerge across all 16 models we test, despite differences in architecture, size, and training regime (including pretrained and instruction-tuned variants). This consistency suggests that, despite substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties could be fundamental for next token prediction and are learned early during pretraining. Our code is available at this https URL.

Title: ReconXF: Graph Reconstruction Attack via Public Feature Explanations on Privatized Node Features and Labels

Authors: Rishi Raj Sahoo, Rucha Bhalchandra Joshi, Subhankar Mishra
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02134
Pdf URL: https://arxiv.org/pdf/2506.02134
Copy Paste: [[2506.02134]] ReconXF: Graph Reconstruction Attack via Public Feature Explanations on Privatized Node Features and Labels(https://arxiv.org/abs/2506.02134)
Keywords: privacy, protect, attack, explainability
Abstract: Graph Neural Networks (GNNs) achieve high performance across many applications but function as black-box models, limiting their use in critical domains like healthcare and criminal justice. Explainability methods address this by providing feature-level explanations that identify important node attributes for predictions. These explanations create privacy risks. Combined with auxiliary information, feature explanations can enable adversaries to reconstruct graph structure, exposing sensitive relationships. Existing graph reconstruction attacks assume access to original auxiliary data, but practical systems use differential privacy to protect node features and labels while providing explanations for transparency. We study a threat model where adversaries access public feature explanations along with privatized node features and labels. We show that existing explanation-based attacks like GSEF perform poorly with privatized data due to noise from differential privacy mechanisms. We propose ReconXF, a graph reconstruction attack for scenarios with public explanations and privatized auxiliary data. Our method adapts explanation-based frameworks by incorporating denoising mechanisms that handle differential privacy noise while exploiting structural signals in explanations. Experiments across multiple datasets show ReconXF outperforms SoTA methods in privatized settings, with improvements in AUC and average precision. Results indicate that public explanations combined with denoising enable graph structure recovery even under the privacy protection of auxiliary data. Code is available at (link to be made public after acceptance).

Title: Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability

Authors: Yarden Bakish, Itamar Zimerman, Hila Chefer, Lior Wolf
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02138
Pdf URL: https://arxiv.org/pdf/2506.02138
Copy Paste: [[2506.02138]] Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability(https://arxiv.org/abs/2506.02138)
Keywords: explainability, transformer
Abstract: The development of effective explainability tools for Transformers is a crucial pursuit in deep learning research. One of the most promising approaches in this domain is Layer-wise Relevance Propagation (LRP), which propagates relevance scores backward through the network to the input space by redistributing activation values based on predefined rules. However, existing LRP-based methods for Transformer explainability entirely overlook a critical component of the Transformer architecture: its positional encoding (PE), resulting in violation of the conservation property, and the loss of an important and unique type of relevance, which is also associated with structural and positional features. To address this limitation, we reformulate the input space for Transformer explainability as a set of position-token pairs. This allows us to propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods, including Rotary, Learnable, and Absolute PE. Extensive experiments with both fine-tuned classifiers and zero-shot foundation models, such as LLaMA 3, demonstrate that our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks. Our code is publicly available.

Title: Z-Error Loss for Training Neural Networks

Authors: Guillaume Godin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02154
Pdf URL: https://arxiv.org/pdf/2506.02154
Copy Paste: [[2506.02154]] Z-Error Loss for Training Neural Networks(https://arxiv.org/abs/2506.02154)
Keywords: robust
Abstract: Outliers introduce significant training challenges in neural networks by propagating erroneous gradients, which can degrade model performance and generalization. We propose the Z-Error Loss, a statistically principled approach that minimizes outlier influence during training by masking the contribution of data points identified as out-of-distribution within each batch. This method leverages batch-level statistics to automatically detect and exclude anomalous samples, allowing the model to focus its learning on the true underlying data structure. Our approach is robust, adaptive to data quality, and provides valuable diagnostics for data curation and cleaning.

Title: Mitigating Data Poisoning Attacks to Local Differential Privacy

Authors: Xiaolin Li, Ninghui Li, Boyang Wang, Wenhai Sun
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02156
Pdf URL: https://arxiv.org/pdf/2506.02156
Copy Paste: [[2506.02156]] Mitigating Data Poisoning Attacks to Local Differential Privacy(https://arxiv.org/abs/2506.02156)
Keywords: privacy, defense, attack, steal
Abstract: The distributed nature of local differential privacy (LDP) invites data poisoning attacks and poses unforeseen threats to the underlying LDP-supported applications. In this paper, we propose a comprehensive mitigation framework for popular frequency estimation, which contains a suite of novel defenses, including malicious user detection, attack pattern recognition, and damaged utility recovery. In addition to existing attacks, we explore new adaptive adversarial activities for our mitigation design. For detection, we present a new method to precisely identify bogus reports and thus LDP aggregation can be performed over the ``clean'' data. When the attack behavior becomes stealthy and direct filtering out malicious users is difficult, we further propose a detection that can effectively recognize hidden adversarial patterns, thus facilitating the decision-making of service providers. These detection methods require no additional data and attack information and incur minimal computational cost. Our experiment demonstrates their excellent performance and substantial improvement over previous work in various settings. In addition, we conduct an empirical analysis of LDP post-processing for corrupted data recovery and propose a new post-processing method, through which we reveal new insights into protocol recommendations in practice and key design principles for future research.

Title: HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation

Authors: Amir Hussein, Cihan Xiao, Matthew Wiesner, Dan Povey, Leibny Paola Garcia, Sanjeev Khudanpur
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.02157
Pdf URL: https://arxiv.org/pdf/2506.02157
Copy Paste: [[2506.02157]] HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation(https://arxiv.org/abs/2506.02157)
Keywords: robust
Abstract: Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing approaches struggle with word reordering and performance degradation when jointly modeling ASR and ST, resulting in a gap with attention-based encoder-decoder (AED) models. Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. To ensure robust ST while preserving ASR performance, we use self-distillation with CTC consistency regularization. Moreover, we improve computational efficiency by incorporating best practices from ASR transducers, including a down-sampled hierarchical encoder, a stateless predictor, and a pruned transducer loss to reduce training complexity. Finally, we introduce a blank penalty during decoding, reducing deletions and improving translation quality. Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin achieving new state-of-the-art performance among NT models and substantially narrowing the gap with AED-based systems.

Title: TIIF-Bench: How Does Your T2I Model Follow Your Instructions?

Authors: Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02161
Pdf URL: https://arxiv.org/pdf/2506.02161
Copy Paste: [[2506.02161]] TIIF-Bench: How Does Your T2I Model Follow Your Instructions?(https://arxiv.org/abs/2506.02161)
Keywords: robust
Abstract: The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images. In this paper, we present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions. TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities. To rigorously evaluate model robustness to varying prompt lengths, we provide a short and a long version for each prompt with identical core semantics. Two critical attributes, i.e., text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models. In addition, we collect 100 high-quality designer level prompts that encompass various scenarios to comprehensively assess model performance. Leveraging the world knowledge encoded in large vision language models, we propose a novel computable framework to discern subtle variations in T2I model outputs. Through meticulous benchmarking of mainstream T2I models on TIIF-Bench, we analyze the pros and cons of current T2I models and reveal the limitations of current T2I benchmarks. Project Page: this https URL.

Title: Quantifying task-relevant representational similarity using decision variable correlation

Authors: Yu (Eric)Qian, Wilson S. Geisler, Xue-Xin Wei
Subjects: cs.CV, cs.LG, q-bio.NC, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.02164
Pdf URL: https://arxiv.org/pdf/2506.02164
Copy Paste: [[2506.02164]] Quantifying task-relevant representational similarity using decision variable correlation(https://arxiv.org/abs/2506.02164)
Keywords: robust
Abstract: Previous studies have compared the brain and deep neural networks trained on image classification. Intriguingly, while some suggest that their representations are highly similar, others argued the opposite. Here, we propose a new approach to characterize the similarity of the decision strategies of two observers (models or brains) using decision variable correlation (DVC). DVC quantifies the correlation between decoded decisions on individual samples in a classification task and thus can capture task-relevant information rather than general representational alignment. We evaluate this method using monkey V4/IT recordings and models trained on image classification tasks. We find that model--model similarity is comparable to monkey--monkey similarity, whereas model--monkey similarity is consistently lower and, surprisingly, decreases with increasing ImageNet-1k performance. While adversarial training enhances robustness, it does not improve model--monkey similarity in task-relevant dimensions; however, it markedly increases model--model similarity. Similarly, pre-training on larger datasets does not improve model--monkey similarity. These results suggest a fundamental divergence between the task-relevant representations in monkey V4/IT and those learned by models trained on image classification tasks.

Title: Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360-Degree Firefighting Videos

Authors: Aditi Tiwari, Farzaneh Masoud, Dac Trong Nguyen, Jill Kraft, Heng Ji, Klara Nahrstedt
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02167
Pdf URL: https://arxiv.org/pdf/2506.02167
Copy Paste: [[2506.02167]] Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360-Degree Firefighting Videos(https://arxiv.org/abs/2506.02167)
Keywords: robust
Abstract: Modern AI systems struggle most in environments where reliability is critical - scenes with smoke, poor visibility, and structural deformation. Each year, tens of thousands of firefighters are injured on duty, often due to breakdowns in situational perception. We introduce Fire360, a benchmark for evaluating perception and reasoning in safety-critical firefighting scenarios. The dataset includes 228 360-degree videos from professional training sessions under diverse conditions (e.g., low light, thermal distortion), annotated with action segments, object locations, and degradation metadata. Fire360 supports five tasks: Visual Question Answering, Temporal Action Captioning, Object Localization, Safety-Critical Reasoning, and Transformed Object Retrieval (TOR). TOR tests whether models can match pristine exemplars to fire-damaged counterparts in unpaired scenes, evaluating transformation-invariant recognition. While human experts achieve 83.5% on TOR, models like GPT-4o lag significantly, exposing failures in reasoning under degradation. By releasing Fire360 and its evaluation suite, we aim to advance models that not only see, but also remember, reason, and act under uncertainty. The dataset is available at: this https URL.

Title: An Approximation Theory Perspective on Machine Learning

Authors: Hrushikesh N. Mhaskar, Efstratios Tsoukanis, Ameya D. Jagtap
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02168
Pdf URL: https://arxiv.org/pdf/2506.02168
Copy Paste: [[2506.02168]] An Approximation Theory Perspective on Machine Learning(https://arxiv.org/abs/2506.02168)
Keywords: transformer
Abstract: A central problem in machine learning is often formulated as follows: Given a dataset $\{(x_j, y_j)\}_{j=1}^M$, which is a sample drawn from an unknown probability distribution, the goal is to construct a functional model $f$ such that $f(x) \approx y$ for any $(x, y)$ drawn from the same distribution. Neural networks and kernel-based methods are commonly employed for this task due to their capacity for fast and parallel computation. The approximation capabilities, or expressive power, of these methods have been extensively studied over the past 35 years. In this paper, we will present examples of key ideas in this area found in the literature. We will discuss emerging trends in machine learning including the role of shallow/deep networks, approximation on manifolds, physics-informed neural surrogates, neural operators, and transformer architectures. Despite function approximation being a fundamental problem in machine learning, approximation theory does not play a central role in the theoretical foundations of the field. One unfortunate consequence of this disconnect is that it is often unclear how well trained models will generalize to unseen or unlabeled data. In this review, we examine some of the shortcomings of the current machine learning framework and explore the reasons for the gap between approximation theory and machine learning practice. We will then introduce our novel research to achieve function approximation on unknown manifolds without the need to learn specific manifold features, such as the eigen-decomposition of the Laplace-Beltrami operator or atlas construction. In many machine learning problems, particularly classification tasks, the labels $y_j$ are drawn from a finite set of values.

Title: Different Speech Translation Models Encode and Translate Speaker Gender Differently

Authors: Dennis Fucci, Marco Gaido, Matteo Negri, Luisa Bentivogli, Andre Martins, Giuseppe Attanasio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02172
Pdf URL: https://arxiv.org/pdf/2506.02172
Copy Paste: [[2506.02172]] Different Speech Translation Models Encode and Translate Speaker Gender Differently(https://arxiv.org/abs/2506.02172)
Keywords: interpretability
Abstract: Recent studies on interpreting the hidden states of speech models have shown their ability to capture speaker-specific features, including gender. Does this finding also hold for speech translation (ST) models? If so, what are the implications for the speaker's gender assignment in translation? We address these questions from an interpretability perspective, using probing methods to assess gender encoding across diverse ST models. Results on three language directions (English-French/Italian/Spanish) indicate that while traditional encoder-decoder models capture gender information, newer architectures -- integrating a speech encoder with a machine translation system via adapters -- do not. We also demonstrate that low gender encoding capabilities result in systems' tendency toward a masculine default, a translation bias that is more pronounced in newer architectures.

Title: Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

Authors: Dennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02181
Pdf URL: https://arxiv.org/pdf/2506.02181
Copy Paste: [[2506.02181]] Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution(https://arxiv.org/abs/2506.02181)
Keywords: robust, interpretability
Abstract: Despite significant advances in ASR, the specific acoustic cues models rely on remain unclear. Prior studies have examined such cues on a limited set of phonemes and outdated models. In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains, also essential for human speech perception. Our findings show that the ASR model relies on vowels' full time spans, particularly their first two formants, with greater saliency in male speech. It also better captures the spectral characteristics of sibilant fricatives than non-sibilants and prioritizes the release phase in plosives, especially burst characteristics. These insights enhance the interpretability of ASR models and highlight areas for future research to uncover potential gaps in model robustness.

Title: KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

Authors: Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, Fei Mi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.02208
Pdf URL: https://arxiv.org/pdf/2506.02208
Copy Paste: [[2506.02208]] KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning(https://arxiv.org/abs/2506.02208)
Keywords: large language model
Abstract: Recent advances in large language model (LLM) post-training have leveraged two distinct paradigms to enhance reasoning capabilities: reinforcement learning (RL) and knowledge distillation (KD). While RL enables the emergence of complex reasoning behaviors, it often suffers from low sample efficiency when the initial policy struggles to explore high-reward trajectories. Conversely, KD improves learning efficiency via mimicking the teacher model but tends to generalize poorly to out-of-domain scenarios. In this work, we present \textbf{KDRL}, a \textit{unified post-training framework} that jointly optimizes a reasoning model through teacher supervision (KD) and self-exploration (RL). Specifically, KDRL leverages policy gradient optimization to simultaneously minimize the reverse Kullback-Leibler divergence (RKL) between the student and teacher distributions while maximizing the expected rule-based rewards. We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance. Empirical results on multiple reasoning benchmarks demonstrate that KDRL outperforms GRPO and various KD baselines while achieving a favorable balance between performance and reasoning token efficiency. These findings indicate that integrating KD and RL serves as an effective and efficient strategy to train reasoning LLMs.

Title: Leveraging Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics, Transcriptomics, and Proteomics

Authors: Ella Rannon, David Burstein
Subjects: cs.CL, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2506.02212
Pdf URL: https://arxiv.org/pdf/2506.02212
Copy Paste: [[2506.02212]] Leveraging Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics, Transcriptomics, and Proteomics(https://arxiv.org/abs/2506.02212)
Keywords: transformer
Abstract: Natural Language Processing (NLP) has transformed various fields beyond linguistics by applying techniques originally developed for human language to the analysis of biological sequences. This review explores the application of NLP methods to biological sequence data, focusing on genomics, transcriptomics, and proteomics. We examine how various NLP methods, from classic approaches like word2vec to advanced models employing transformers and hyena operators, are being adapted to analyze DNA, RNA, protein sequences, and entire genomes. The review also examines tokenization strategies and model architectures, evaluating their strengths, limitations, and suitability for different biological tasks. We further cover recent advances in NLP applications for biological data, such as structure prediction, gene expression, and evolutionary analysis, highlighting the potential of these methods for extracting meaningful insights from large-scale genomic data. As language models continue to advance, their integration into bioinformatics holds immense promise for advancing our understanding of biological processes in all domains of life.

Title: Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

Authors: Johannes Schusterbauer, Ming Gui, Frank Fundel, Björn Ommer
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02221
Pdf URL: https://arxiv.org/pdf/2506.02221
Copy Paste: [[2506.02221]] Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment(https://arxiv.org/abs/2506.02221)
Keywords: diffusion, generative
Abstract: Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM models are computationally prohibitive for finetuning, while diffusion models like Stable Diffusion benefit from efficient architectures and ecosystem support. This work addresses the critical challenge of efficiently transferring knowledge from pre-trained diffusion models to flow matching. We propose Diff2Flow, a novel framework that systematically bridges diffusion and FM paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields from diffusion predictions. This alignment enables direct and efficient FM finetuning of diffusion priors with no extra computation overhead. Our experiments demonstrate that Diff2Flow outperforms naïve FM and diffusion finetuning particularly under parameter-efficient constraints, while achieving superior or competitive performance across diverse downstream tasks compared to state-of-the-art methods. We will release our code at this https URL.

Title: VLCD: Vision-Language Contrastive Distillation for Accurate and Efficient Automatic Placenta Analysis

Authors: Manas Mehta, Yimu Pan, Kelly Gallagher, Alison D. Gernand, Jeffery A. Goldstein, Delia Mwinyelle, Leena Mithal, James Z. Wang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02229
Pdf URL: https://arxiv.org/pdf/2506.02229
Copy Paste: [[2506.02229]] VLCD: Vision-Language Contrastive Distillation for Accurate and Efficient Automatic Placenta Analysis(https://arxiv.org/abs/2506.02229)
Keywords: robust
Abstract: Pathological examination of the placenta is an effective method for detecting and mitigating health risks associated with childbirth. Recent advancements in AI have enabled the use of photographs of the placenta and pathology reports for detecting and classifying signs of childbirth-related pathologies. However, existing automated methods are computationally extensive, which limits their deployability. We propose two modifications to vision-language contrastive learning (VLC) frameworks to enhance their accuracy and efficiency: (1) text-anchored vision-language contrastive knowledge distillation (VLCD)-a new knowledge distillation strategy for medical VLC pretraining, and (2) unsupervised predistillation using a large natural images dataset for improved initialization. Our approach distills efficient neural networks that match or surpass the teacher model in performance while achieving model compression and acceleration. Our results showcase the value of unsupervised predistillation in improving the performance and robustness of our approach, specifically for lower-quality images. VLCD serves as an effective way to improve the efficiency and deployability of medical VLC approaches, making AI-based healthcare solutions more accessible, especially in resource-constrained environments.

Title: From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models

Authors: Yihong Tang, Ao Qu, Xujing Yu, Weipeng Deng, Jun Ma, Jinhua Zhao, Lijun Sun
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2506.02242
Pdf URL: https://arxiv.org/pdf/2506.02242
Copy Paste: [[2506.02242]] From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models(https://arxiv.org/abs/2506.02242)
Keywords: interpretability, large language model
Abstract: Urban and transportation research has long sought to uncover statistically meaningful relationships between key variables and societal outcomes such as road safety, to generate actionable insights that guide the planning, development, and renewal of urban and transportation systems. However, traditional workflows face several key challenges: (1) reliance on human experts to propose hypotheses, which is time-consuming and prone to confirmation bias; (2) limited interpretability, particularly in deep learning approaches; and (3) underutilization of unstructured data that can encode critical urban context. Given these limitations, we propose a Multimodal Large Language Model (MLLM)-based approach for interpretable hypothesis inference, enabling the automated generation, evaluation, and refinement of hypotheses concerning urban context and road safety outcomes. Our method leverages MLLMs to craft safety-relevant questions for street view images (SVIs), extract interpretable embeddings from their responses, and apply them in regression-based statistical models. UrbanX supports iterative hypothesis testing and refinement, guided by statistical evidence such as coefficient significance, thereby enabling rigorous scientific discovery of previously overlooked correlations between urban design and safety. Experimental evaluations on Manhattan street segments demonstrate that our approach outperforms pretrained deep learning models while offering full interpretability. Beyond road safety, UrbanX can serve as a general-purpose framework for urban scientific discovery, extracting structured insights from unstructured urban data across diverse socioeconomic and environmental outcomes. This approach enhances model trustworthiness for policy applications and establishes a scalable, statistically grounded pathway for interpretable knowledge discovery in urban and transportation studies.

Title: Motion aware video generative model

Authors: Bowen Xue, Giuseppe Claudio Guarnera, Shuang Zhao, Zahra Montazeri
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02244
Pdf URL: https://arxiv.org/pdf/2506.02244
Copy Paste: [[2506.02244]] Motion aware video generative model(https://arxiv.org/abs/2506.02244)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion-based video generation have yielded unprecedented quality in visual content and semantic coherence. However, current approaches predominantly rely on statistical learning from vast datasets without explicitly modeling the underlying physics of motion, resulting in subtle yet perceptible non-physical artifacts that diminish the realism of generated videos. This paper introduces a physics-informed frequency domain approach to enhance the physical plausibility of generated videos. We first conduct a systematic analysis of the frequency-domain characteristics of diverse physical motions (translation, rotation, scaling), revealing that each motion type exhibits distinctive and identifiable spectral signatures. Building on this theoretical foundation, we propose two complementary components: (1) a physical motion loss function that quantifies and optimizes the conformity of generated videos to ideal frequency-domain motion patterns, and (2) a frequency domain enhancement module that progressively learns to adjust video features to conform to physical motion constraints while preserving original network functionality through a zero-initialization strategy. Experiments across multiple video diffusion architectures demonstrate that our approach significantly enhances motion quality and physical plausibility without compromising visual quality or semantic alignment. Our frequency-domain physical motion framework generalizes effectively across different video generation architectures, offering a principled approach to incorporating physical constraints into deep learning-based video synthesis pipelines. This work seeks to establish connections between data-driven models and physics-based motion models.

Title: PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio-Visual Fusion and Alignment Loss

Authors: Yu Wang, Juhyung Ha, David J. Crandall
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02247
Pdf URL: https://arxiv.org/pdf/2506.02247
Copy Paste: [[2506.02247]] PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio-Visual Fusion and Alignment Loss(https://arxiv.org/abs/2506.02247)
Keywords: robust
Abstract: Active speaker detection (ASD) in egocentric videos presents unique challenges due to unstable viewpoints, motion blur, and off-screen speech sources - conditions under which traditional visual-centric methods degrade significantly. We introduce PAIR-Net (Pretrained Audio-Visual Integration with Regularization Network), an effective model that integrates a partially frozen Whisper audio encoder with a fine-tuned AV-HuBERT visual backbone to robustly fuse cross-modal cues. To counteract modality imbalance, we introduce an inter-modal alignment loss that synchronizes audio and visual representations, enabling more consistent convergence across modalities. Without relying on multi-speaker context or ideal frontal views, PAIR-Net achieves state-of-the-art performance on the Ego4D ASD benchmark with 76.6% mAP, surpassing LoCoNet and STHG by 8.2% and 12.9% mAP, respectively. Our results highlight the value of pretrained audio priors and alignment-based fusion for robust ASD under real-world egocentric conditions.

Title: Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction

Authors: Samuel Li, Pujith Kachana, Prajwal Chidananda, Saurabh Nair, Yasutaka Furukawa, Matthew Brown
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02265
Pdf URL: https://arxiv.org/pdf/2506.02265
Copy Paste: [[2506.02265]] Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction(https://arxiv.org/abs/2506.02265)
Keywords: robust
Abstract: Estimating agent pose and 3D scene structure from multi-camera rigs is a central task in embodied AI applications such as autonomous driving. Recent learned approaches such as DUSt3R have shown impressive performance in multiview settings. However, these models treat images as unstructured collections, limiting effectiveness in scenarios where frames are captured from synchronized rigs with known or inferable structure. To this end, we introduce Rig3R, a generalization of prior multiview reconstruction models that incorporates rig structure when available, and learns to infer it when not. Rig3R conditions on optional rig metadata including camera ID, time, and rig poses to develop a rig-aware latent space that remains robust to missing information. It jointly predicts pointmaps and two types of raymaps: a pose raymap relative to a global frame, and a rig raymap relative to a rig-centric frame consistent across time. Rig raymaps allow the model to infer rig structure directly from input images when metadata is missing. Rig3R achieves state-of-the-art performance in 3D reconstruction, camera pose estimation, and rig discovery, outperforming both traditional and learned methods by 17-45% mAA across diverse real-world rig datasets, all in a single forward pass without post-processing or iterative refinement.

Title: Latent Stochastic Interpolants

Authors: Saurabh Singh, Dmitry Lagun
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02276
Pdf URL: https://arxiv.org/pdf/2506.02276
Copy Paste: [[2506.02276]] Latent Stochastic Interpolants(https://arxiv.org/abs/2506.02276)
Keywords: diffusion, generative
Abstract: Stochastic Interpolants (SI) are a powerful framework for generative modeling, capable of flexibly transforming between two probability distributions. However, their use in jointly optimized latent variable models remains unexplored as they require direct access to the samples from the two distributions. This work presents Latent Stochastic Interpolants (LSI) enabling joint learning in a latent space with end-to-end optimized encoder, decoder and latent SI models. We achieve this by developing a principled Evidence Lower Bound (ELBO) objective derived directly in continuous time. The joint optimization allows LSI to learn effective latent representations along with a generative process that transforms an arbitrary prior distribution into the encoder-defined aggregated posterior. LSI sidesteps the simple priors of the normal diffusion models and mitigates the computational demands of applying SI directly in high-dimensional observation spaces, while preserving the generative flexibility of the SI framework. We demonstrate the efficacy of LSI through comprehensive experiments on the standard large scale ImageNet generation benchmark.

Title: Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals

Authors: Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, Yiran Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02281
Pdf URL: https://arxiv.org/pdf/2506.02281
Copy Paste: [[2506.02281]] Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals(https://arxiv.org/abs/2506.02281)
Keywords: large language model
Abstract: Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM's capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at this https URL.

Title: Why Gradients Rapidly Increase Near the End of Training

Authors: Aaron Defazio
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02285
Pdf URL: https://arxiv.org/pdf/2506.02285
Copy Paste: [[2506.02285]] Why Gradients Rapidly Increase Near the End of Training(https://arxiv.org/abs/2506.02285)
Keywords: large language model
Abstract: During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.

Title: Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation

Authors: Niclas Popp, Kevin Alexander Laube, Matthias Hein, Lukas Schott
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02294
Pdf URL: https://arxiv.org/pdf/2506.02294
Copy Paste: [[2506.02294]] Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation(https://arxiv.org/abs/2506.02294)
Keywords: robust, diffusion
Abstract: Large foundation models trained on extensive datasets demonstrate strong zero-shot capabilities in various domains. To replicate their success when data and model size are constrained, knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks. However, the effectiveness of distillation is critically limited by the available training data. This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time. We ask the question: when these spurious features are unknown, yet a robust teacher is available, is it possible for a student to also become robust to them? We address this problem by introducing a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student, effectively creating challenging samples that the student struggles with. Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet under covariate shift, outperforming state-of-the-art diffusion-based data augmentation baselines

Title: QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

Authors: Ahmed Wasfy, Omer Nacar, Abdelakreem Elkhateb, Mahmoud Reda, Omar Elshehy, Adel Ammar, Wadii Boulila
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02295
Pdf URL: https://arxiv.org/pdf/2506.02295
Copy Paste: [[2506.02295]] QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation(https://arxiv.org/abs/2506.02295)
Keywords: large language model
Abstract: The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.

Title: LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

Authors: Thai Hoang, Kung-Hsiang Huang, Shirley Kokane, Jianguo Zhang, Zuxin Liu, Ming Zhu, Jake Grigsby, Tian Lan, Michael S Ryoo, Chien-Sheng Wu, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02298
Pdf URL: https://arxiv.org/pdf/2506.02298
Copy Paste: [[2506.02298]] LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback(https://arxiv.org/abs/2506.02298)
Keywords: large language model
Abstract: Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3\% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR's efficiency and effectiveness in speeding up development of AI agents.

Title: Through a Steerable Lens: Magnifying Neural Network Interpretability via Phase-Based Extrapolation

Authors: Farzaneh Mahdisoltani, Saeed Mahdisoltani, Roger B. Grosse, David J. Fleet
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02300
Pdf URL: https://arxiv.org/pdf/2506.02300
Copy Paste: [[2506.02300]] Through a Steerable Lens: Magnifying Neural Network Interpretability via Phase-Based Extrapolation(https://arxiv.org/abs/2506.02300)
Keywords: interpretability
Abstract: Understanding the internal representations and decision mechanisms of deep neural networks remains a critical open challenge. While existing interpretability methods often identify influential input regions, they may not elucidate how a model distinguishes between classes or what specific changes would transition an input from one category to another. To address these limitations, we propose a novel framework that visualizes the implicit path between classes by treating the network gradient as a form of infinitesimal motion. Drawing inspiration from phase-based motion magnification, we first decompose images using invertible transforms-specifically the Complex Steerable Pyramid-then compute class-conditional gradients in the transformed space. Rather than iteratively integrating the gradient to trace a full path, we amplify the one-step gradient to the input and perform a linear extrapolation to expose how the model moves from source to target class. By operating in the steerable pyramid domain, these amplified gradients produce semantically meaningful, spatially coherent morphs that highlight the classifier's most sensitive directions, giving insight into the geometry of its decision boundaries. Experiments on both synthetic and real-world datasets demonstrate that our phase-focused extrapolation yields perceptually aligned, semantically meaningful transformations, offering a novel, interpretable lens into neural classifiers' internal representations.

Title: Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments

Authors: Russell Scheinberg, Ameeta Agrawal, Amber Shore, So Young Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02302
Pdf URL: https://arxiv.org/pdf/2506.02302
Copy Paste: [[2506.02302]] Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments(https://arxiv.org/abs/2506.02302)
Keywords: large language model
Abstract: Large language models (LLMs) can explain grammatical rules, yet they often fail to apply those rules when judging sentence acceptability. We present "grammar prompting", an explain-then-process paradigm: a large LLM first produces a concise explanation of the relevant syntactic phenomenon, then that explanation is fed back as additional context to the target model -- either an LLM or a smaller language model (SLM) -- before deciding which sentence of a minimal pair is grammatical. On the English BLiMP, Chinese SLING, and Russian RuBLiMP benchmarks, this simple prompt design yields substantial improvements over strong baselines across many syntactic phenomena. Feeding an LLM's metalinguistic explanation back to the target model bridges the gap between knowing a rule and using it. On SLMs, grammar prompting alone trims the average LLM-SLM accuracy gap by about 20%, and when paired with chain-of-thought, by 56% (13.0 pp -> 5.8 pp), all at negligible cost. The lightweight, language-agnostic cue lets low-cost SLMs approach frontier-LLM performance in multilingual settings.

Title: Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models

Authors: Yuchen Liang, Renxiang Huang, Lifeng Lai, Ness Shroff, Yingbin Liang
Subjects: cs.LG, eess.SP, math.ST
Abstract URL: https://arxiv.org/abs/2506.02318
Pdf URL: https://arxiv.org/pdf/2506.02318
Copy Paste: [[2506.02318]] Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models(https://arxiv.org/abs/2506.02318)
Keywords: diffusion
Abstract: Discrete state space diffusion models have shown significant advantages in applications involving discrete data, such as text and image generation. It has also been observed that their performance is highly sensitive to the choice of rate matrices, particularly between uniform and absorbing rate matrices. While empirical results suggest that absorbing rate matrices often yield better generation quality compared to uniform rate matrices, existing theoretical works have largely focused on the uniform rate matrices case. Notably, convergence guarantees and error analyses for absorbing diffusion models are still missing. In this work, we provide the first finite-time error bounds and convergence rate analysis for discrete diffusion models using absorbing rate matrices. We begin by deriving an upper bound on the KL divergence of the forward process, introducing a surrogate initialization distribution to address the challenge posed by the absorbing stationary distribution, which is a singleton and causes the KL divergence to be ill-defined. We then establish the first convergence guarantees for both the $\tau$-leaping and uniformization samplers under absorbing rate matrices, demonstrating improved rates over their counterparts using uniform rate matrices. Furthermore, under suitable assumptions, we provide convergence guarantees without early stopping. Our analysis introduces several new technical tools to address challenges unique to absorbing rate matrices. These include a Jensen-type argument for bounding forward process convergence, novel techniques for bounding absorbing score functions, and a non-divergent upper bound on the score near initialization that removes the need of early-stopping.

Title: Quantifying Misattribution Unfairness in Authorship Attribution

Authors: Pegah Alipoormolabashi, Ajay Patel, Niranjan Balasubramanian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02321
Pdf URL: https://arxiv.org/pdf/2506.02321
Copy Paste: [[2506.02321]] Quantifying Misattribution Unfairness in Authorship Attribution(https://arxiv.org/abs/2506.02321)
Keywords: fair
Abstract: Authorship misattribution can have profound consequences in real life. In forensic settings simply being considered as one of the potential authors of an evidential piece of text or communication can result in undesirable scrutiny. This raises a fairness question: Is every author in the candidate pool at equal risk of misattribution? Standard evaluation measures for authorship attribution systems do not explicitly account for this notion of fairness. We introduce a simple measure, Misattribution Unfairness Index (MAUIk), which is based on how often authors are ranked in the top k for texts they did not write. Using this measure we quantify the unfairness of five models on two different datasets. All models exhibit high levels of unfairness with increased risks for some authors. Furthermore, we find that this unfairness relates to how the models embed the authors as vectors in the latent search space. In particular, we observe that the risk of misattribution is higher for authors closer to the centroid (or center) of the embedded authors in the haystack. These results indicate the potential for harm and the need for communicating with and calibrating end users on misattribution risk when building and providing such models for downstream use.

Title: Are Crypto Ecosystems (De)centralizing? A Framework for Longitudinal Analysis

Authors: Harang Ju, Ehsan Valavi, Madhav Kumar, Sinan Aral
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02324
Pdf URL: https://arxiv.org/pdf/2506.02324
Copy Paste: [[2506.02324]] Are Crypto Ecosystems (De)centralizing? A Framework for Longitudinal Analysis(https://arxiv.org/abs/2506.02324)
Keywords: attack
Abstract: Blockchain technology relies on decentralization to resist faults and attacks while operating without trusted intermediaries. Although industry experts have touted decentralization as central to their promise and disruptive potential, it is still unclear whether the crypto ecosystems built around blockchains are becoming more or less decentralized over time. As crypto plays an increasing role in facilitating economic transactions and peer-to-peer interactions, measuring their decentralization becomes even more essential. We thus propose a systematic framework for measuring the decentralization of crypto ecosystems over time and compare commonly used decentralization metrics. We applied this framework to seven prominent crypto ecosystems, across five distinct subsystems and across their lifetime for over 15 years. Our analysis revealed that while crypto has largely become more decentralized over time, recent trends show a shift toward centralization in the consensus layer, NFT marketplaces, and developers. Our framework and results inform researchers, policymakers, and practitioners about the design, regulation, and implementation of crypto ecosystems and provide a systematic, replicable foundation for future studies.

Title: Something Just Like TRuST : Toxicity Recognition of Span and Target

Authors: Berk Atil, Namrata Sureddy, Rebecca J. Passonneau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02326
Pdf URL: https://arxiv.org/pdf/2506.02326
Copy Paste: [[2506.02326]] Something Just Like TRuST : Toxicity Recognition of Span and Target(https://arxiv.org/abs/2506.02326)
Keywords: extraction, large language model
Abstract: Toxicity in online content, including content generated by language models, has become a critical concern due to its potential for negative psychological and social impact. This paper introduces TRuST, a comprehensive dataset designed to improve toxicity detection that merges existing datasets, and has labels for toxicity, target social group, and toxic spans. It includes a diverse range of target groups such as ethnicity, gender, religion, disability, and politics, with both human/machine-annotated and human machine-generated data. We benchmark state-of-the-art large language models (LLMs) on toxicity detection, target group identification, and toxic span extraction. We find that fine-tuned models consistently outperform zero-shot and few-shot prompting, though performance remains low for certain social groups. Further, reasoning capabilities do not significantly improve performance, indicating that LLMs have weak social reasoning skills.

Title: Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning

Authors: Yijun Yang, Zhao-Yang Wang, Qiuping Liu, Shuwen Sun, Kang Wang, Rama Chellappa, Zongwei Zhou, Alan Yuille, Lei Zhu, Yu-Dong Zhang, Jieneng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02327
Pdf URL: https://arxiv.org/pdf/2506.02327
Copy Paste: [[2506.02327]] Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning(https://arxiv.org/abs/2506.02327)
Keywords: generative
Abstract: Providing effective treatment and making informed clinical decisions are essential goals of modern medicine and clinical care. We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models. To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that visually predicts future disease states based on clinical decisions. MeWM comprises (i) vision-language models to serve as policy models, and (ii) tumor generative models as dynamics models. The policy model generates action plans, such as clinical treatments, while the dynamics model simulates tumor progression or regression under given treatment conditions. Building on this, we propose the inverse dynamics model that applies survival analysis to the simulated post-treatment tumor, enabling the evaluation of treatment efficacy and the selection of the optimal clinical action plan. As a result, the proposed MeWM simulates disease dynamics by synthesizing post-treatment tumors, with state-of-the-art specificity in Turing tests evaluated by radiologists. Simultaneously, its inverse dynamics model outperforms medical-specialized GPTs in optimizing individualized treatment protocols across all metrics. Notably, MeWM improves clinical decision-making for interventional physicians, boosting F1-score in selecting the optimal TACE protocol by 13%, paving the way for future integration of medical world models as the second readers.

Title: Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection

Authors: Herun Wan, Jiaying Wu, Minnan Luo, Zhi Zeng, Zhixiong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02350
Pdf URL: https://arxiv.org/pdf/2506.02350
Copy Paste: [[2506.02350]] Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection(https://arxiv.org/abs/2506.02350)
Keywords: robust, large language model
Abstract: Misinformation detection models often rely on superficial cues (i.e., \emph{shortcuts}) that correlate with misinformation in training data but fail to generalize to the diverse and evolving nature of real-world misinformation. This issue is exacerbated by large language models (LLMs), which can easily generate convincing misinformation through simple prompts. We introduce TruthOverTricks, a unified evaluation paradigm for measuring shortcut learning in misinformation detection. TruthOverTricks categorizes shortcut behaviors into intrinsic shortcut induction and extrinsic shortcut injection, and evaluates seven representative detectors across 14 popular benchmarks, along with two new factual misinformation datasets, NQ-Misinfo and Streaming-Misinfo. Empirical results reveal that existing detectors suffer severe performance degradation when exposed to both naturally occurring and adversarially crafted shortcuts. To address this, we propose SMF, an LLM-augmented data augmentation framework that mitigates shortcut reliance through paraphrasing, factual summarization, and sentiment normalization. SMF consistently enhances robustness across 16 benchmarks, encouraging models to rely on deeper semantic understanding rather than shortcut cues. To promote the development of misinformation detectors, we have published the resources publicly at this https URL.

Title: RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models

Authors: Junjie Li, Nan Zhang, Xiaoyang Qu, Kai Lu, Guokuan Li, Jiguang Wan, Jianzong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02354
Pdf URL: https://arxiv.org/pdf/2506.02354
Copy Paste: [[2506.02354]] RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models(https://arxiv.org/abs/2506.02354)
Keywords: segmentation
Abstract: Object Navigation (ObjectNav) is a fundamental task in embodied artificial intelligence. Although significant progress has been made in semantic map construction and target direction prediction in current research, redundant exploration and exploration failures remain inevitable. A critical but underexplored direction is the timely termination of exploration to overcome these challenges. We observe a diminishing marginal effect between exploration steps and exploration rates and analyze the cost-benefit relationship of exploration. Inspired by this, we propose RATE-Nav, a Region-Aware Termination-Enhanced method. It includes a geometric predictive region segmentation algorithm and region-Based exploration estimation algorithm for exploration rate calculation. By leveraging the visual question answering capabilities of visual language models (VLMs) and exploration rates enables efficient this http URL-Nav achieves a success rate of 67.8% and an SPL of 31.3% on the HM3D dataset. And on the more challenging MP3D dataset, RATE-Nav shows approximately 10% improvement over previous zero-shot methods.

Title: Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Authors: Andre He, Daniel Fried, Sean Welleck
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02355
Pdf URL: https://arxiv.org/pdf/2506.02355
Copy Paste: [[2506.02355]] Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening(https://arxiv.org/abs/2506.02355)
Keywords: large language model
Abstract: Reinforcement learning has emerged as an effective framework for training large language models on structured language-conditioned tasks. We identify a critical flaw of Group Relative Policy Optimization (GRPO), a widely used RL algorithm in this setting. For tasks that require multi-sample performance, such as formal theorem proving, GRPO biasedly reinforces already probable solutions and neglects rare but correct proofs. This implicit bias impairs performance on pass@$N$ metrics at large sample sizes, limiting its practicality for training theorem provers. To address this, we introduce the unlikeliness reward, a straightforward method that explicitly encourages reinforcing rare correct solutions. Additionally, we find that increasing the number of PPO epochs further mitigates this bias. Our experiments confirm that incorporating the unlikeliness reward significantly improves pass@$N$ across a large range of N, outperforming standard GRPO and substantially increasing sample diversity. Applying our revised recipe to Lean, we achieve competitive performance with DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation, providing a simple yet effective recipe for training formal theorem provers with RL.

Title: InterRVOS: Interaction-aware Referring Video Object Segmentation

Authors: Woojeong Jin, Seongchan Kim, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02356
Pdf URL: https://arxiv.org/pdf/2506.02356
Copy Paste: [[2506.02356]] InterRVOS: Interaction-aware Referring Video Object Segmentation(https://arxiv.org/abs/2506.02356)
Keywords: segmentation
Abstract: Referring video object segmentation aims to segment the object in a video corresponding to a given natural language expression. While prior works have explored various referring scenarios, including motion-centric or multi-instance expressions, most approaches still focus on localizing a single target object in isolation. However, in comprehensive video understanding, an object's role is often defined by its interactions with other entities, which are largely overlooked in existing datasets and models. In this work, we introduce Interaction-aware referring video object sgementation (InterRVOS), a new task that requires segmenting both actor and target entities involved in an interaction. Each interactoin is described through a pair of complementary expressions from different semantic perspectives, enabling fine-grained modeling of inter-object relationships. To tackle this task, we propose InterRVOS-8K, the large-scale and automatically constructed dataset containing diverse interaction-aware expressions with corresponding masks, including challenging cases such as motion-only multi-instance expressions. We also present a baseline architecture, ReVIOSa, designed to handle actor-target segmentation from a single expression, achieving strong performance in both standard and interaction-focused settings. Furthermore, we introduce an actor-target-aware evalaution setting that enables a more targeted assessment of interaction understanding. Experimental results demonstrate that our approach outperforms prior methods in modeling complex object interactions for referring video object segmentation task, establishing a strong foundation for future research in interaction-centric video understanding. Our project page is available at \href{this https URL}{this https URL}.

Title: RoadFormer : Local-Global Feature Fusion for Road Surface Classification in Autonomous Driving

Authors: Tianze Wang, Zhang Zhang, Chao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02358
Pdf URL: https://arxiv.org/pdf/2506.02358
Copy Paste: [[2506.02358]] RoadFormer : Local-Global Feature Fusion for Road Surface Classification in Autonomous Driving(https://arxiv.org/abs/2506.02358)
Keywords: extraction, transformer
Abstract: The classification of the type of road surface (RSC) aims to utilize pavement features to identify the roughness, wet and dry conditions, and material information of the road surface. Due to its ability to effectively enhance road safety and traffic management, it has received widespread attention in recent years. In autonomous driving, accurate RSC allows vehicles to better understand the road environment, adjust driving strategies, and ensure a safer and more efficient driving experience. For a long time, vision-based RSC has been favored. However, existing visual classification methods have overlooked the exploration of fine-grained classification of pavement types (such as similar pavement textures). In this work, we propose a pure vision-based fine-grained RSC method for autonomous driving scenarios, which fuses local and global feature information through the stacking of convolutional and transformer modules. We further explore the stacking strategies of local and global feature extraction modules to find the optimal feature extraction strategy. In addition, since fine-grained tasks also face the challenge of relatively large intra-class differences and relatively small inter-class differences, we propose a Foreground-Background Module (FBM) that effectively extracts fine-grained context features of the pavement, enhancing the classification ability for complex pavements. Experiments conducted on a large-scale pavement dataset containing one million samples and a simplified dataset reorganized from this dataset achieved Top-1 classification accuracies of 92.52% and 96.50%, respectively, improving by 5.69% to 12.84% compared to SOTA methods. These results demonstrate that RoadFormer outperforms existing methods in RSC tasks, providing significant progress in improving the reliability of pavement perception in autonomous driving systems.

Title: MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models

Authors: Xueqi Cheng, Minxing Zheng, Shixiang Zhu, Yushun Dong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02362
Pdf URL: https://arxiv.org/pdf/2506.02362
Copy Paste: [[2506.02362]] MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models(https://arxiv.org/abs/2506.02362)
Keywords: protect, defense, attack, robust, extraction
Abstract: Model extraction attacks aim to replicate the functionality of a black-box model through query access, threatening the intellectual property (IP) of machine-learning-as-a-service (MLaaS) providers. Defending against such attacks is challenging, as it must balance efficiency, robustness, and utility preservation in the real-world scenario. Despite the recent advances, most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs. However, this assumption is increasingly unreliable, as modern models are trained on diverse datasets and attackers often operate under limited query budgets. As a result, the effectiveness of these defenses is significantly compromised in realistic deployment scenarios. To address this gap, we propose MISLEADER (enseMbles of dIStiLled modEls Against moDel ExtRaction), a novel defense strategy that does not rely on OOD assumptions. MISLEADER formulates model protection as a bilevel optimization problem that simultaneously preserves predictive fidelity on benign inputs and reduces extractability by potential clone models. Our framework combines data augmentation to simulate attacker queries with an ensemble of heterogeneous distilled models to improve robustness and diversity. We further provide a tractable approximation algorithm and derive theoretical error bounds to characterize defense effectiveness. Extensive experiments across various settings validate the utility-preserving and extraction-resistant properties of our proposed defense strategy. Our code is available at this https URL.

Title: A TRPCA-Inspired Deep Unfolding Network for Hyperspectral Image Denoising via Thresholded t-SVD and Top-K Sparse Transformer

Authors: Liang Li, Jianli Zhao, Sheng Fang, Siyu Chen, Hui Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02364
Pdf URL: https://arxiv.org/pdf/2506.02364
Copy Paste: [[2506.02364]] A TRPCA-Inspired Deep Unfolding Network for Hyperspectral Image Denoising via Thresholded t-SVD and Top-K Sparse Transformer(https://arxiv.org/abs/2506.02364)
Keywords: robust, interpretability, transformer
Abstract: Hyperspectral images (HSIs) are often degraded by complex mixed noise during acquisition and transmission, making effective denoising essential for subsequent analysis. Recent hybrid approaches that bridge model-driven and data-driven paradigms have shown great promise. However, most of these approaches lack effective alternation between different priors or modules, resulting in loosely coupled regularization and insufficient exploitation of their complementary strengths. Inspired by tensor robust principal component analysis (TRPCA), we propose a novel deep unfolding network (DU-TRPCA) that enforces stage-wise alternation between two tightly integrated modules: low-rank and sparse. The low-rank module employs thresholded tensor singular value decomposition (t-SVD), providing a widely adopted convex surrogate for tensor low-rankness and has been demonstrated to effectively capture the global spatial-spectral structure of HSIs. The Top-K sparse transformer module adaptively imposes sparse constraints, directly matching the sparse regularization in TRPCA and enabling effective removal of localized outliers and complex noise. This tightly coupled architecture preserves the stage-wise alternation between low-rank approximation and sparse refinement inherent in TRPCA, while enhancing representational capacity through attention mechanisms. Extensive experiments on synthetic and real-world HSIs demonstrate that DU-TRPCA surpasses state-of-the-art methods under severe mixed noise, while offering interpretability benefits and stable denoising dynamics inspired by iterative optimization. Code is available at this https URL.

Title: Approximate Borderline Sampling using Granular-Ball for Classification Tasks

Authors: Qin Xie, Qinghua Zhang, Shuyin Xia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02366
Pdf URL: https://arxiv.org/pdf/2506.02366
Copy Paste: [[2506.02366]] Approximate Borderline Sampling using Granular-Ball for Classification Tasks(https://arxiv.org/abs/2506.02366)
Keywords: robust, diffusion
Abstract: Data sampling enhances classifier efficiency and robustness through data compression and quality improvement. Recently, the sampling method based on granular-ball (GB) has shown promising performance in generality and noisy classification tasks. However, some limitations remain, including the absence of borderline sampling strategies and issues with class boundary blurring or shrinking due to overlap between GBs. In this paper, an approximate borderline sampling method using GBs is proposed for classification tasks. First, a restricted diffusion-based GB generation (RD-GBG) method is proposed, which prevents GB overlaps by constrained expansion, preserving precise geometric representation of GBs via redefined ones. Second, based on the concept of heterogeneous nearest neighbor, a GB-based approximate borderline sampling (GBABS) method is proposed, which is the first general sampling method capable of both borderline sampling and improving the quality of class noise datasets. Additionally, since RD-GBG incorporates noise detection and GBABS focuses on borderline samples, GBABS performs outstandingly on class noise datasets without the need for an optimal purity threshold. Experimental results demonstrate that the proposed methods outperform the GB-based sampling method and several representative sampling methods. Our source code is publicly available at this https URL.

Title: ViTNF: Leveraging Neural Fields to Boost Vision Transformers in Generalized Category Discovery

Authors: Jiayi Su, Dequan Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02367
Pdf URL: https://arxiv.org/pdf/2506.02367
Copy Paste: [[2506.02367]] ViTNF: Leveraging Neural Fields to Boost Vision Transformers in Generalized Category Discovery(https://arxiv.org/abs/2506.02367)
Keywords: transformer
Abstract: Generalized category discovery (GCD) is a highly popular task in open-world recognition, aiming to identify unknown class samples using known class data. By leveraging pre-training, meta-training, and fine-tuning, ViT achieves excellent few-shot learning capabilities. Its MLP head is a feedforward network, trained synchronously with the entire network in the same process, increasing the training cost and difficulty without fully leveraging the power of the feature extractor. This paper proposes a new architecture by replacing the MLP head with a neural field-based one. We first present a new static neural field function to describe the activity distribution of the neural field and then use two static neural field functions to build an efficient few-shot classifier. This neural field-based (NF) classifier consists of two coupled static neural fields. It stores the feature information of support samples by its elementary field, the known categories by its high-level field, and the category information of support samples by its cross-field connections. We replace the MLP head with the proposed NF classifier, resulting in a novel architecture ViTNF, and simplify the three-stage training mode by pre-training the feature extractor on source tasks and training the NF classifier with support samples in meta-testing separately, significantly reducing ViT's demand for training samples and the difficulty of model training. To enhance the model's capability in identifying new categories, we provide an effective algorithm to determine the lateral interaction scale of the elementary field. Experimental results demonstrate that our model surpasses existing state-of-the-art methods on CIFAR-100, ImageNet-100, CUB-200, and Standard Cars, achieving dramatic accuracy improvements of 19\% and 16\% in new and all classes, respectively, indicating a notable advantage in GCD.

Title: Reconciling Hessian-Informed Acceleration and Scalar-Only Communication for Efficient Federated Zeroth-Order Fine-Tuning

Authors: Zhe Li, Bicheng Ying, Zidong Liu, Chaosheng Dong, Haibo Yang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2506.02370
Pdf URL: https://arxiv.org/pdf/2506.02370
Copy Paste: [[2506.02370]] Reconciling Hessian-Informed Acceleration and Scalar-Only Communication for Efficient Federated Zeroth-Order Fine-Tuning(https://arxiv.org/abs/2506.02370)
Keywords: federate, large language model
Abstract: Recent dimension-free communication frameworks in Federated Learning (FL), such as DeComFL, significantly reduce per-round communication by transmitting only scalars via zeroth-order stochastic gradient descent (ZO-SGD). This method is particularly advantageous for federated fine-tuning of Large Language Models (LLMs). Yet, the high variance in ZO gradient estimation typically leads to slow convergence. Although leveraging Hessian information is known to enhance optimization speed, integrating this into FL presents significant challenges. These include clients' restrictions on local data and the critical need to maintain the dimension-free communication property. To overcome this limitation, we first introduce a generalized scalar-only communication FL framework that decouples dimension-free communication from standard ZO-SGD, enabling the integration of more advanced optimization strategies. Building on this framework, we propose HiSo, a fast federated fine-tuning method via Hessian-informed zeroth-order optimization and Scalar-only communication. Specifically, it leverages global curvature information to accelerate convergence while preserving the same minimal communication cost per round. Theoretically, we establish convergence guarantees that are independent of the global Lipschitz constant, and further show that HiSo achieves faster rates when the global Hessian exhibits a low effective rank -- a common phenomenon in LLMs. Extensive experiments on benchmark datasets and LLM fine-tuning tasks confirm that HiSo significantly outperforms existing ZO-based FL methods in both convergence speed and communication efficiency.

Title: SFBD Flow: A Continuous-Optimization Framework for Training Diffusion Models with Noisy Samples

Authors: Haoye Lu, Darren Lo, Yaoliang Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02371
Pdf URL: https://arxiv.org/pdf/2506.02371
Copy Paste: [[2506.02371]] SFBD Flow: A Continuous-Optimization Framework for Training Diffusion Models with Noisy Samples(https://arxiv.org/abs/2506.02371)
Keywords: privacy, diffusion, generative
Abstract: Diffusion models achieve strong generative performance but often rely on large datasets that may include sensitive content. This challenge is compounded by the models' tendency to memorize training data, raising privacy concerns. SFBD (Lu et al., 2025) addresses this by training on corrupted data and using limited clean samples to capture local structure and improve convergence. However, its iterative denoising and fine-tuning loop requires manual coordination, making it burdensome to implement. We reinterpret SFBD as an alternating projection algorithm and introduce a continuous variant, SFBD flow, that removes the need for alternating steps. We further show its connection to consistency constraint-based methods, and demonstrate that its practical instantiation, Online SFBD, consistently outperforms strong baselines across benchmarks.

Title: Exploring Explanations Improves the Robustness of In-Context Learning

Authors: Ukyo Honda, Tatsushi Oka
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02378
Pdf URL: https://arxiv.org/pdf/2506.02378
Copy Paste: [[2506.02378]] Exploring Explanations Improves the Robustness of In-Context Learning(https://arxiv.org/abs/2506.02378)
Keywords: robust, large language model
Abstract: In-context learning (ICL) has emerged as a successful paradigm for leveraging large language models (LLMs). However, it often struggles to generalize beyond the distribution of the provided demonstrations. A recent advancement in enhancing robustness is ICL with explanations (X-ICL), which improves prediction reliability by guiding LLMs to understand and articulate the reasoning behind correct labels. Building on this approach, we introduce an advanced framework that extends X-ICL by systematically exploring explanations for all possible labels (X$^2$-ICL), thereby enabling more comprehensive and robust decision-making. Experimental results on multiple natural language understanding datasets validate the effectiveness of X$^2$-ICL, demonstrating significantly improved robustness to out-of-distribution data compared to the existing ICL approaches.

Title: Univariate to Multivariate: LLMs as Zero-Shot Predictors for Time-Series Forecasting

Authors: Chamara Madarasingha, Nasrin Sohrabi, Zahir Tari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02389
Pdf URL: https://arxiv.org/pdf/2506.02389
Copy Paste: [[2506.02389]] Univariate to Multivariate: LLMs as Zero-Shot Predictors for Time-Series Forecasting(https://arxiv.org/abs/2506.02389)
Keywords: large language model
Abstract: Time-series prediction or forecasting is critical across many real-world dynamic systems, and recent studies have proposed using Large Language Models (LLMs) for this task due to their strong generalization capabilities and ability to perform well without extensive pre-training. However, their effectiveness in handling complex, noisy, and multivariate time-series data remains underexplored. To address this, we propose LLMPred which enhances LLM-based time-series prediction by converting time-series sequences into text and feeding them to LLMs for zero shot prediction along with two main data pre-processing techniques. First, we apply time-series sequence decomposition to facilitate accurate prediction on complex and noisy univariate sequences. Second, we extend this univariate prediction capability to multivariate data using a lightweight prompt-processing strategy. Extensive experiments with smaller LLMs such as Llama 2 7B, Llama 3.2 3B, GPT-4o-mini, and DeepSeek 7B demonstrate that LLMPred achieves competitive or superior performance compared to state-of-the-art baselines. Additionally, a thorough ablation study highlights the importance of the key components proposed in LLMPred.

Title: GAdaBoost: An Efficient and Robust AdaBoost Algorithm Based on Granular-Ball Structure

Authors: Qin Xie, Qinghua Zhang, Shuyin Xia, Xinran Zhou, Guoyin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02390
Pdf URL: https://arxiv.org/pdf/2506.02390
Copy Paste: [[2506.02390]] GAdaBoost: An Efficient and Robust AdaBoost Algorithm Based on Granular-Ball Structure(https://arxiv.org/abs/2506.02390)
Keywords: robust
Abstract: Adaptive Boosting (AdaBoost) faces significant challenges posed by label noise, especially in multiclass classification tasks. Existing methods either lack mechanisms to handle label noise effectively or suffer from high computational costs due to redundant data usage. Inspired by granular computing, this paper proposes granular adaptive boosting (GAdaBoost), a novel two-stage framework comprising a data granulation stage and an adaptive boosting stage, to enhance efficiency and robustness under noisy conditions. To validate its feasibility, an extension of SAMME, termed this http URL, is proposed. Specifically, first, a granular-ball generation method is designed to compress data while preserving diversity and mitigating label noise. Second, the granular ball-based SAMME algorithm focuses on granular balls rather than individual samples, improving efficiency and reducing sensitivity to noise. Experimental results on some noisy datasets show that the proposed approach achieves superior robustness and efficiency compared with existing methods, demonstrating that this work effectively extends AdaBoost and SAMME.

Title: Consultant Decoding: Yet Another Synergistic Mechanism

Authors: Chuanghao Ding, Jiaping Wang, Ziqing Yang, Xiaoliang Wang, Dahua Lin, Cam-Tu Nguyen, Fei Tan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02391
Pdf URL: https://arxiv.org/pdf/2506.02391
Copy Paste: [[2506.02391]] Consultant Decoding: Yet Another Synergistic Mechanism(https://arxiv.org/abs/2506.02391)
Keywords: large language model
Abstract: The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates require repeated LLMs calls to validate draft tokens, undermining the overall efficiency gain of SD. In this work, we revisit existing verification mechanisms and propose a novel synergetic mechanism Consultant Decoding (CD). Unlike SD, which relies on a metric derived from importance sampling for verification, CD verifies candidate drafts using token-level likelihoods computed solely by the LLM. CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality (around 100% of the target model's performance). Interestingly, this is achieved by combining models whose parameter sizes differ by two orders of magnitude. In addition, CD reduces the call frequency of the large target model to below 10%, particularly in more demanding tasks. CD's performance was even found to surpass that of the large target model, which theoretically represents the upper bound for speculative decoding.

Title: Improving Generalization of Neural Combinatorial Optimization for Vehicle Routing Problems via Test-Time Projection Learning

Authors: Yuanyao Chen, Rongsheng Chen, Fu Luo, Zhenkun Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02392
Pdf URL: https://arxiv.org/pdf/2506.02392
Copy Paste: [[2506.02392]] Improving Generalization of Neural Combinatorial Optimization for Vehicle Routing Problems via Test-Time Projection Learning(https://arxiv.org/abs/2506.02392)
Keywords: large language model
Abstract: Neural Combinatorial Optimization (NCO) has emerged as a promising learning-based paradigm for addressing Vehicle Routing Problems (VRPs) by minimizing the need for extensive manual engineering. While existing NCO methods, trained on small-scale instances (e.g., 100 nodes), have demonstrated considerable success on problems of similar scale, their performance significantly degrades when applied to large-scale scenarios. This degradation arises from the distributional shift between training and testing data, rendering policies learned on small instances ineffective for larger problems. To overcome this limitation, we introduce a novel learning framework driven by Large Language Models (LLMs). This framework learns a projection between the training and testing distributions, which is then deployed to enhance the scalability of the NCO model. Notably, unlike prevailing techniques that necessitate joint training with the neural network, our approach operates exclusively during the inference phase, obviating the need for model retraining. Extensive experiments demonstrate that our method enables a backbone model (trained on 100-node instances) to achieve superior performance on large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) of up to 100K nodes from diverse distributions.

Title: RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection

Authors: Yongxian Liu, Boyang Li, Ting Liu, Zaiping Lin, Wei An
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02393
Pdf URL: https://arxiv.org/pdf/2506.02393
Copy Paste: [[2506.02393]] RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection(https://arxiv.org/abs/2506.02393)
Keywords: extraction
Abstract: Infrared small target detection is a challenging task due to its unique characteristics (e.g., small, dim, shapeless and changeable). Recently published CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules. To achieve efficient and effective detection, we propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection. Specifically, RRCA-Net incorporates reusable-convolution block (RuCB) in a recurrent manner without introducing extra parameters. With the help of the repetitive iteration in RuCB, the high-level information of small targets in the deep layers can be well maintained and further refined. Then, a dual interactive attention aggregation module (DIAAM) is proposed to promote the mutual enhancement and fusion of refined information. In this way, RRCA-Net can both achieve high-level feature refinement and enhance the correlation of contextual information between adjacent layers. Moreover, to achieve steady convergence, we design a target characteristic inspired loss function (DpT-k loss) by integrating physical and mathematical constraints. Experimental results on three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate that our RRCA-Net can achieve comparable performance to the state-of-the-art methods while maintaining a small number of parameters, and act as a plug and play module to introduce consistent performance improvement for several popular IRSTD methods. Our code will be available at this https URL soon.

Title: The Devil is in the Darkness: Diffusion-Based Nighttime Dehazing Anchored in Brightness Perception

Authors: Xiaofeng Cong, Yu-Xin Zhang, Haoran Wei, Yeying Jin, Junming Hou, Jie Gui, Jing Zhang, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02395
Pdf URL: https://arxiv.org/pdf/2506.02395
Copy Paste: [[2506.02395]] The Devil is in the Darkness: Diffusion-Based Nighttime Dehazing Anchored in Brightness Perception(https://arxiv.org/abs/2506.02395)
Keywords: diffusion, generative
Abstract: While nighttime image dehazing has been extensively studied, converting nighttime hazy images to daytime-equivalent brightness remains largely unaddressed. Existing methods face two critical limitations: (1) datasets overlook the brightness relationship between day and night, resulting in the brightness mapping being inconsistent with the real world during image synthesis; and (2) models do not explicitly incorporate daytime brightness knowledge, limiting their ability to reconstruct realistic lighting. To address these challenges, we introduce the Diffusion-Based Nighttime Dehazing (DiffND) framework, which excels in both data synthesis and lighting reconstruction. Our approach starts with a data synthesis pipeline that simulates severe distortions while enforcing brightness consistency between synthetic and real-world scenes, providing a strong foundation for learning night-to-day brightness mapping. Next, we propose a restoration model that integrates a pre-trained diffusion model guided by a brightness perception network. This design harnesses the diffusion model's generative ability while adapting it to nighttime dehazing through brightness-aware optimization. Experiments validate our dataset's utility and the model's superior performance in joint haze removal and brightness mapping.

Title: Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather

Authors: Longyu Yang, Ping Hu, Shangbo Yuan, Lu Zhang, Jun Liu, Hengtao Shen, Xiaofeng Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02396
Pdf URL: https://arxiv.org/pdf/2506.02396
Copy Paste: [[2506.02396]] Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather(https://arxiv.org/abs/2506.02396)
Keywords: robust, extraction, segmentation
Abstract: Existing LiDAR semantic segmentation models often suffer from decreased accuracy when exposed to adverse weather conditions. Recent methods addressing this issue focus on enhancing training data through weather simulation or universal augmentation techniques. However, few works have studied the negative impacts caused by the heterogeneous domain shifts in the geometric structure and reflectance intensity of point clouds. In this paper, we delve into this challenge and address it with a novel Geometry-Reflectance Collaboration (GRC) framework that explicitly separates feature extraction for geometry and reflectance. Specifically, GRC employs a dual-branch architecture designed to independently process geometric and reflectance features initially, thereby capitalizing on their distinct characteristic. Then, GRC adopts a robust multi-level feature collaboration module to suppress redundant and unreliable information from both branches. Consequently, without complex simulation or augmentation, our method effectively extracts intrinsic information about the scene while suppressing interference, thus achieving better robustness and generalization in adverse weather conditions. We demonstrate the effectiveness of GRC through comprehensive experiments on challenging benchmarks, showing that our method outperforms previous approaches and establishes new state-of-the-art results.

Title: GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation

Authors: Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qianwen Zhang, Di Yin, Xing Sun, Xiao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02404
Pdf URL: https://arxiv.org/pdf/2506.02404
Copy Paste: [[2506.02404]] GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation(https://arxiv.org/abs/2506.02404)
Keywords: large language model
Abstract: Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: $(i)$ Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. $(ii)$ Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. $(iii)$ Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.

Title: Modelship Attribution: Tracing Multi-Stage Manipulations Across Generative Models

Authors: Zhiya Tan, Xin Zhang, Joey Tianyi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02405
Pdf URL: https://arxiv.org/pdf/2506.02405
Copy Paste: [[2506.02405]] Modelship Attribution: Tracing Multi-Stage Manipulations Across Generative Models(https://arxiv.org/abs/2506.02405)
Keywords: transformer, generative
Abstract: As generative techniques become increasingly accessible, authentic visuals are frequently subjected to iterative alterations by various individuals employing a variety of tools. Currently, to avoid misinformation and ensure accountability, a lot of research on detection and attribution is emerging. Although these methods demonstrate promise in single-stage manipulation scenarios, they fall short when addressing complex real-world iterative manipulation. In this paper, we are the first, to the best of our knowledge, to systematically model this real-world challenge and introduce a novel method to solve it. We define a task called "Modelship Attribution", which aims to trace the evolution of manipulated images by identifying the generative models involved and reconstructing the sequence of edits they performed. To realistically simulate this scenario, we utilize three generative models, StyleMapGAN, DiffSwap, and FacePartsSwap, that sequentially modify distinct regions of the same image. This process leads to the creation of the first modelship dataset, comprising 83,700 images (16,740 images*5). Given that later edits often overwrite the fingerprints of earlier models, the focus shifts from extracting blended fingerprints to characterizing each model's distinctive editing patterns. To tackle this challenge, we introduce the modelship attribution transformer (MAT), a purpose-built framework designed to effectively recognize and attribute the contributions of various models within complex, multi-stage manipulation workflows. Through extensive experiments and comparative analysis with other related methods, our results, including comprehensive ablation studies, demonstrate that the proposed approach is a highly effective solution for modelship attribution.

Title: Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

Authors: Wenhao Tang, Rong Qin, Heng Fang, Fengtao Zhou, Hao Chen, Xiang Li, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02408
Pdf URL: https://arxiv.org/pdf/2506.02408
Copy Paste: [[2506.02408]] Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology(https://arxiv.org/abs/2506.02408)
Keywords: extraction
Abstract: Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is this https URL.

Title: SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning

Authors: Zhengyuan Liu, Geyu Lin, Hui Li Tan, Huayun Zhang, Yanfeng Lu, Xiaoxue Gao, Stella Xin Yin, He Sun, Hock Huan Goh, Lung Hsiang Wong, Nancy F. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02412
Pdf URL: https://arxiv.org/pdf/2506.02412
Copy Paste: [[2506.02412]] SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning(https://arxiv.org/abs/2506.02412)
Keywords: robust, generative
Abstract: The integration of generative artificial intelligence into educational applications has enhanced personalized and interactive learning experiences, and it shows strong potential to promote young learners language acquisition. However, it is still challenging to ensure consistent and robust performance across different languages and cultural contexts, and kids-friendly design requires simplified instructions, engaging interactions, and age-appropriate scaffolding to maintain motivation and optimize learning outcomes. In this work, we introduce SingaKids, a dialogic tutor designed to facilitate language learning through picture description tasks. Our system integrates dense image captioning, multilingual dialogic interaction, speech understanding, and engaging speech generation to create an immersive learning environment in four languages: English, Mandarin, Malay, and Tamil. We further improve the system through multilingual pre-training, task-specific tuning, and scaffolding optimization. Empirical studies with elementary school students demonstrate that SingaKids provides effective dialogic teaching, benefiting learners at different performance levels.

Title: AERO: A Redirection-Based Optimization Framework Inspired by Judo for Robust Probabilistic Forecasting

Authors: Karthikeyan Vaiapury
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02415
Pdf URL: https://arxiv.org/pdf/2506.02415
Copy Paste: [[2506.02415]] AERO: A Redirection-Based Optimization Framework Inspired by Judo for Robust Probabilistic Forecasting(https://arxiv.org/abs/2506.02415)
Keywords: robust
Abstract: Optimization remains a fundamental pillar of machine learning, yet existing methods often struggle to maintain stability and adaptability in dynamic, non linear systems, especially under uncertainty. We introduce AERO (Adversarial Energy-based Redirection Optimization), a novel framework inspired by the redirection principle in Judo, where external disturbances are leveraged rather than resisted. AERO reimagines optimization as a redirection process guided by 15 interrelated axioms encompassing adversarial correction, energy conservation, and disturbance-aware learning. By projecting gradients, integrating uncertainty driven dynamics, and managing learning energy, AERO offers a principled approach to stable and robust model updates. Applied to probabilistic solar energy forecasting, AERO demonstrates substantial gains in predictive accuracy, reliability, and adaptability, especially in noisy and uncertain environments. Our findings highlight AERO as a compelling new direction in the theoretical and practical landscape of optimization.

Title: Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models

Authors: Nurislam Tursynbek, Hastings Greer, Basar Demir, Marc Niethammer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02419
Pdf URL: https://arxiv.org/pdf/2506.02419
Copy Paste: [[2506.02419]] Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models(https://arxiv.org/abs/2506.02419)
Keywords: diffusion
Abstract: Diffusion models, while trained for image generation, have emerged as powerful foundational feature extractors for downstream tasks. We find that off-the-shelf diffusion models, trained exclusively to generate natural RGB images, can identify semantically meaningful correspondences in medical images. Building on this observation, we propose to leverage diffusion model features as a similarity measure to guide deformable image registration networks. We show that common intensity-based similarity losses often fail in challenging scenarios, such as when certain anatomies are visible in one image but absent in another, leading to anatomically inaccurate alignments. In contrast, our method identifies true semantic correspondences, aligning meaningful structures while disregarding those not present across images. We demonstrate superior performance of our approach on two tasks: multimodal 2D registration (DXA to X-Ray) and monomodal 3D registration (brain-extracted to non-brain-extracted MRI). Code: this https URL

Title: Gender Inequality in English Textbooks Around the World: an NLP Approach

Authors: Tairan Liu
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2506.02425
Pdf URL: https://arxiv.org/pdf/2506.02425
Copy Paste: [[2506.02425]] Gender Inequality in English Textbooks Around the World: an NLP Approach(https://arxiv.org/abs/2506.02425)
Keywords: large language model
Abstract: Textbooks play a critical role in shaping children's understanding of the world. While previous studies have identified gender inequality in individual countries' textbooks, few have examined the issue cross-culturally. This study applies natural language processing methods to quantify gender inequality in English textbooks from 22 countries across 7 cultural spheres. Metrics include character count, firstness (which gender is mentioned first), and TF-IDF word associations by gender. The analysis also identifies gender patterns in proper names appearing in TF-IDF word lists, tests whether large language models can distinguish between gendered word lists, and uses GloVe embeddings to examine how closely keywords associate with each gender. Results show consistent overrepresentation of male characters in terms of count, firstness, and named entities. All regions exhibit gender inequality, with the Latin cultural sphere showing the least disparity.

Title: Comparative Analysis of AI Agent Architectures for Entity Relationship Classification

Authors: Maryam Berijanian, Kuldeep Singh, Amin Sehati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02426
Pdf URL: https://arxiv.org/pdf/2506.02426
Copy Paste: [[2506.02426]] Comparative Analysis of AI Agent Architectures for Entity Relationship Classification(https://arxiv.org/abs/2506.02426)
Keywords: extraction, large language model
Abstract: Entity relationship classification remains a challenging task in information extraction, especially in scenarios with limited labeled data and complex relational structures. In this study, we conduct a comparative analysis of three distinct AI agent architectures designed to perform relation classification using large language models (LLMs). The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism, each leveraging different modes of reasoning and prompt adaptation. In particular, our dynamic example generation approach introduces real-time cooperative and adversarial prompting. We systematically compare their performance across multiple domains and model backends. Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting and approaches the performance of fine-tuned models. These findings offer practical guidance for the design of modular, generalizable LLM-based systems for structured relation extraction. The source codes and dataset are available at \href{this https URL}{this https URL}.

Title: From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models

Authors: Mahammed Kamruzzaman, Abdullah Al Monsur, Gene Louis Kim, Anshuman Chhabra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02431
Pdf URL: https://arxiv.org/pdf/2506.02431
Copy Paste: [[2506.02431]] From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models(https://arxiv.org/abs/2506.02431)
Keywords: large language model
Abstract: Emotions are a fundamental facet of human experience, varying across individuals, cultural contexts, and nationalities. Given the recent success of Large Language Models (LLMs) as role-playing agents, we examine whether LLMs exhibit emotional stereotypes when assigned nationality-specific personas. Specifically, we investigate how different countries are represented in pre-trained LLMs through emotion attributions and whether these attributions align with cultural norms. Our analysis reveals significant nationality-based differences, with emotions such as shame, fear, and joy being disproportionately assigned across regions. Furthermore, we observe notable misalignment between LLM-generated and human emotional responses, particularly for negative emotions, highlighting the presence of reductive and potentially biased stereotypes in LLM outputs.

Title: Empowering Functional Neuroimaging: A Pre-trained Generative Framework for Unified Representation of Neural Signals

Authors: Weiheng Yao, Xuhang Chen, Shuqiang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02433
Pdf URL: https://arxiv.org/pdf/2506.02433
Copy Paste: [[2506.02433]] Empowering Functional Neuroimaging: A Pre-trained Generative Framework for Unified Representation of Neural Signals(https://arxiv.org/abs/2506.02433)
Keywords: fair, generative
Abstract: Multimodal functional neuroimaging enables systematic analysis of brain mechanisms and provides discriminative representations for brain-computer interface (BCI) decoding. However, its acquisition is constrained by high costs and feasibility limitations. Moreover, underrepresentation of specific groups undermines fairness of BCI decoding model. To address these challenges, we propose a unified representation framework for multimodal functional neuroimaging via generative artificial intelligence (AI). By mapping multimodal functional neuroimaging into a unified representation space, the proposed framework is capable of generating data for acquisition-constrained modalities and underrepresented groups. Experiments show that the framework can generate data consistent with real brain activity patterns, provide insights into brain mechanisms, and improve performance on downstream tasks. More importantly, it can enhance model fairness by augmenting data for underrepresented groups. Overall, the framework offers a new paradigm for decreasing the cost of acquiring multimodal functional neuroimages and enhancing the fairness of BCI decoding models.

Title: A Review of Various Datasets for Machine Learning Algorithm-Based Intrusion Detection System: Advances and Challenges

Authors: Sudhanshu Sekhar Tripathy, Bichitrananda Behera
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02438
Pdf URL: https://arxiv.org/pdf/2506.02438
Copy Paste: [[2506.02438]] A Review of Various Datasets for Machine Learning Algorithm-Based Intrusion Detection System: Advances and Challenges(https://arxiv.org/abs/2506.02438)
Keywords: secure, security, protect, attack
Abstract: IDS aims to protect computer networks from security threats by detecting, notifying, and taking appropriate action to prevent illegal access and protect confidential information. As the globe becomes increasingly dependent on technology and automated processes, ensuring secured systems, applications, and networks has become one of the most significant problems of this era. The global web and digital technology have significantly accelerated the evolution of the modern world, necessitating the use of telecommunications and data transfer platforms. Researchers are enhancing the effectiveness of IDS by incorporating popular datasets into machine learning algorithms. IDS, equipped with machine learning classifiers, enhances security attack detection accuracy by identifying normal or abnormal network traffic. This paper explores the methods of capturing and reviewing intrusion detection systems (IDS) and evaluates the challenges existing datasets face. A deluge of research on machine learning (ML) and deep learning (DL) architecture-based intrusion detection techniques has been conducted in the past ten years on various cybersecurity datasets, including KDDCUP'99, NSL-KDD, UNSW-NB15, CICIDS-2017, and CSE-CIC-IDS2018. We conducted a literature review and presented an in-depth analysis of various intrusion detection methods that use SVM, KNN, DT, LR, NB, RF, XGBOOST, Adaboost, and ANN. We provide an overview of each technique, explaining the role of the classifiers and algorithms used. A detailed tabular analysis highlights the datasets used, classifiers employed, attacks detected, evaluation metrics, and conclusions drawn. This article offers a thorough review for future IDS research.

Title: Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Authors: Shuang Li, Jiaxu Leng, Changjiang Kuang, Mingpi Tan, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02439
Pdf URL: https://arxiv.org/pdf/2506.02439
Copy Paste: [[2506.02439]] Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification(https://arxiv.org/abs/2506.02439)
Keywords: transformer
Abstract: Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at this https URL.

Title: Should LLM Safety Be More Than Refusing Harmful Instructions?

Authors: Utsav Maskey, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02442
Pdf URL: https://arxiv.org/pdf/2506.02442
Copy Paste: [[2506.02442]] Should LLM Safety Be More Than Refusing Harmful Instructions?(https://arxiv.org/abs/2506.02442)
Keywords: attack, robust, large language model
Abstract: This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.

Title: SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

Authors: Lingwei Dang, Ruizhi Shao, Hongwen Zhang, Wei Min, Yebin Liu, Qingyao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02444
Pdf URL: https://arxiv.org/pdf/2506.02444
Copy Paste: [[2506.02444]] SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios(https://arxiv.org/abs/2506.02444)
Keywords: diffusion
Abstract: Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at \href{this https URL}{this https URL}.

Title: IP-Dialog: Evaluating Implicit Personalization in Dialogue Systems with Synthetic Data

Authors: Bo Peng, Zhiheng Wang, Heyang Gong, Chaochao Lu
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.02449
Pdf URL: https://arxiv.org/pdf/2506.02449
Copy Paste: [[2506.02449]] IP-Dialog: Evaluating Implicit Personalization in Dialogue Systems with Synthetic Data(https://arxiv.org/abs/2506.02449)
Keywords: privacy
Abstract: In modern dialogue systems, the ability to implicitly infer user backgrounds from conversations and leverage this information for personalized assistance is crucial. However, the scarcity of high-quality data remains a fundamental challenge to evaluating and improving this capability. Traditional dataset construction methods are labor-intensive, resource-demanding, and raise privacy concerns. To address these issues, we propose a novel approach for automatic synthetic data generation and introduce the Implicit Personalized Dialogue (IP-Dialog) benchmark along with a training dataset, covering 10 tasks and 12 user attribute types. Additionally, we develop a systematic evaluation framework with four metrics to assess both attribute awareness and reasoning capabilities. We further propose five causal graphs to elucidate models' reasoning pathways during implicit personalization. Extensive experiments yield insightful observations and prove the reliability of our dataset.

Title: Weak Supervision for Real World Graphs

Authors: Pratheeksha Nair, Reihaneh Rabbany
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02451
Pdf URL: https://arxiv.org/pdf/2506.02451
Copy Paste: [[2506.02451]] Weak Supervision for Real World Graphs(https://arxiv.org/abs/2506.02451)
Keywords: robust
Abstract: Node classification in real world graphs often suffers from label scarcity and noise, especially in high stakes domains like human trafficking detection and misinformation monitoring. While direct supervision is limited, such graphs frequently contain weak signals, noisy or indirect cues, that can still inform learning. We propose WSNET, a novel weakly supervised graph contrastive learning framework that leverages these weak signals to guide robust representation learning. WSNET integrates graph structure, node features, and multiple noisy supervision sources through a contrastive objective tailored for weakly labeled data. Across three real world datasets and synthetic benchmarks with controlled noise, WSNET consistently outperforms state of the art contrastive and noisy label learning methods by up to 15% in F1 score. Our results highlight the effectiveness of contrastive learning under weak supervision and the promise of exploiting imperfect labels in graph based settings.

Title: ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model

Authors: Wenshuo Chen, Kuimou Yu, Haozhe Jia, Kaishen Yuan, Bowen Tian, Songning Lai, Hongru Xiao, Erhang Zhang, Lei Wang, Yutao Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02452
Pdf URL: https://arxiv.org/pdf/2506.02452
Copy Paste: [[2506.02452]] ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model(https://arxiv.org/abs/2506.02452)
Keywords: diffusion
Abstract: While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose **(ANT)**, an **A**daptive **N**eural **T**emporal-Aware architecture. ANT orchestrates semantic granularity through: **(i) Semantic Temporally Adaptive (STA) Module:** Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. **(ii) Dynamic Classifier-Free Guidance scheduling (DCFG):** Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. **(iii) Temporal-semantic reweighting:** Quantitatively aligns text influence with phase requirements. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion.

Title: Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Authors: Zhaorui Yang, Bo Pan, Han Wang, Yiyao Wang, Xingyu Liu, Minfeng Zhu, Bo Zhang, Wei Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02454
Pdf URL: https://arxiv.org/pdf/2506.02454
Copy Paste: [[2506.02454]] Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework(https://arxiv.org/abs/2506.02454)
Keywords: large language model
Abstract: Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.

Title: ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment

Authors: Martin JJ. Bucher, Iro Armeni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02459
Pdf URL: https://arxiv.org/pdf/2506.02459
Copy Paste: [[2506.02459]] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment(https://arxiv.org/abs/2506.02459)
Keywords: diffusion, generative
Abstract: Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scenes either oversimplify object semantics through one-hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. In contrast, LLM-based methods enable richer semantics via natural language (e.g., 'modern studio with light wood furniture') but do not support editing, remain limited to rectangular layouts or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models. Our approach features a compact structured scene representation with explicit room boundaries that frames scene editing as a next-token prediction task. We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment, enabling a specially trained language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition. We further introduce a novel voxelization-based evaluation that captures fine-grained geometry beyond 3D bounding boxes. Experimental results surpass state-of-the-art on object addition while maintaining competitive results on full scene synthesis.

Title: MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework

Authors: Yupeng Qi, Ziyu Lyu, Min Yang, Yanlin Wang, Lu Bai, Lixin Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02460
Pdf URL: https://arxiv.org/pdf/2506.02460
Copy Paste: [[2506.02460]] MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework(https://arxiv.org/abs/2506.02460)
Keywords: large language model
Abstract: As large language models (LLMs) are increasingly applied across various domains, enhancing safety while maintaining the helpfulness of LLMs has become a critical challenge. Recent studies solve this problem through safety-constrained online preference optimization or safety-constrained offline preference optimization. However, the safety-constrained online methods often suffer from excessive safety, which might reduce helpfulness, while the safety-constrained offline methods perform poorly in adaptively balancing safety and helpfulness. To address these limitations, we propose MidPO, a \textbf{\underline{Mi}}xture of Experts (MoE) framework for safety-helpfulness \textbf{\underline{d}}ual \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization. Firstly, MidPO devises single-preference enhanced direct preference optimization approach to transform the base model into two independent experts, termed safety and helpfulness experts, and fine-tunes the two independent experts for optimal safety or helpfulness performance. Secondly, to achieve an effective balance between safety and helpfulness, MidPO incorporates the two experts into the MoE framework and designs a dynamic routing mechanism to allocate contributions from each expert adaptively. We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness. The code and models will be released.

Title: XToM: Exploring the Multilingual Theory of Mind for Large Language Models

Authors: Chunkit Chan, Yauwai Yim, Hongchuan Zeng, Zhiying Zou, Xinyuan Cheng, Zhifan Sun, Zheye Deng, Kawai Chung, Yuzhuo Ao, Yixiang Fan, Cheng Jiayang, Ercong Nie, Ginny Y. Wong, Helmut Schmid, Hinrich Schütze, Simon See, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02461
Pdf URL: https://arxiv.org/pdf/2506.02461
Copy Paste: [[2506.02461]] XToM: Exploring the Multilingual Theory of Mind for Large Language Models(https://arxiv.org/abs/2506.02461)
Keywords: large language model
Abstract: Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.

Title: HRTR: A Single-stage Transformer for Fine-grained Sub-second Action Segmentation in Stroke Rehabilitation

Authors: Halil Ismail Helvaci, Justin Philip Huber, Jihye Bae, Sen-ching Samson Cheung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02472
Pdf URL: https://arxiv.org/pdf/2506.02472
Copy Paste: [[2506.02472]] HRTR: A Single-stage Transformer for Fine-grained Sub-second Action Segmentation in Stroke Rehabilitation(https://arxiv.org/abs/2506.02472)
Keywords: transformer, segmentation
Abstract: Stroke rehabilitation often demands precise tracking of patient movements to monitor progress, with complexities of rehabilitation exercises presenting two critical challenges: fine-grained and sub-second (under one-second) action detection. In this work, we propose the High Resolution Temporal Transformer (HRTR), to time-localize and classify high-resolution (fine-grained), sub-second actions in a single-stage transformer, eliminating the need for multi-stage methods and post-processing. Without any refinements, HRTR outperforms state-of-the-art systems on both stroke related and general datasets, achieving Edit Score (ES) of 70.1 on StrokeRehab Video, 69.4 on StrokeRehab IMU, and 88.4 on 50Salads.

Title: Generative Perception of Shape and Material from Differential Motion

Authors: Xinran Nicole Han, Ko Nishino, Todd Zickler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02473
Pdf URL: https://arxiv.org/pdf/2506.02473
Copy Paste: [[2506.02473]] Generative Perception of Shape and Material from Differential Motion(https://arxiv.org/abs/2506.02473)
Keywords: diffusion, generative
Abstract: Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions quickly converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, our work suggests a generative perception approach for improving visual reasoning in physically-embodied systems.

Title: Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay

Authors: Kunyu Wang, Xueyang Fu, Chengzhi Cao, Chengjie Ge, Wei Zhai, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02477
Pdf URL: https://arxiv.org/pdf/2506.02477
Copy Paste: [[2506.02477]] Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay(https://arxiv.org/abs/2506.02477)
Keywords: generative
Abstract: Current image de-raining methods primarily learn from a limited dataset, leading to inadequate performance in varied real-world rainy conditions. To tackle this, we introduce a new framework that enables networks to progressively expand their de-raining knowledge base by tapping into a growing pool of datasets, significantly boosting their adaptability. Drawing inspiration from the human brain's ability to continuously absorb and generalize from ongoing experiences, our approach borrow the mechanism of the complementary learning system. Specifically, we first deploy Generative Adversarial Networks (GANs) to capture and retain the unique features of new data, mirroring the hippocampus's role in learning and memory. Then, the de-raining network is trained with both existing and GAN-synthesized data, mimicking the process of hippocampal replay and interleaved learning. Furthermore, we employ knowledge distillation with the replayed data to replicate the synergy between the neocortex's activity patterns triggered by hippocampal replays and the pre-existing neocortical knowledge. This comprehensive framework empowers the de-raining network to amass knowledge from various datasets, continually enhancing its performance on previously unseen rainy scenes. Our testing on three benchmark de-raining networks confirms the framework's effectiveness. It not only facilitates continuous knowledge accumulation across six datasets but also surpasses state-of-the-art methods in generalizing to new real-world scenarios.

Title: FroM: Frobenius Norm-Based Data-Free Adaptive Model Merging

Authors: Zijian Li, Xiaocheng Feng, Huixin Liu, Yichong Huang, Ting Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02478
Pdf URL: https://arxiv.org/pdf/2506.02478
Copy Paste: [[2506.02478]] FroM: Frobenius Norm-Based Data-Free Adaptive Model Merging(https://arxiv.org/abs/2506.02478)
Keywords: data-free, large language model
Abstract: With the development of large language models, fine-tuning has emerged as an effective method to enhance performance in specific scenarios by injecting domain-specific knowledge. In this context, model merging techniques provide a solution for fusing knowledge from multiple fine-tuning models by combining their parameters. However, traditional methods often encounter task interference when merging full fine-tuning models, and this problem becomes even more evident in parameter-efficient fine-tuning scenarios. In this paper, we introduce an improvement to the RegMean method, which indirectly leverages the training data to approximate the outputs of the linear layers before and after merging. We propose an adaptive merging method called FroM, which directly measures the model parameters using the Frobenius norm, without any training data. By introducing an additional hyperparameter for control, FroM outperforms baseline methods across various fine-tuning scenarios, alleviating the task interference problem.

Title: BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

Authors: Kalyan Nakka, Nitesh Saxena
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2506.02479
Pdf URL: https://arxiv.org/pdf/2506.02479
Copy Paste: [[2506.02479]] BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage(https://arxiv.org/abs/2506.02479)
Keywords: attack, robust, steal, large language model
Abstract: The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we develop a novel black-box jailbreak attack, called BitBypass, that leverages hyphen-separated bitstream camouflage for jailbreaking aligned LLMs. This represents a new direction in jailbreaking by exploiting fundamental information representation of data as continuous bits, rather than leveraging prompt engineering or adversarial manipulations. Our evaluation of five state-of-the-art LLMs, namely GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral, in adversarial perspective, revealed the capabilities of BitBypass in bypassing their safety alignment and tricking them into generating harmful and unsafe content. Further, we observed that BitBypass outperforms several state-of-the-art jailbreak attacks in terms of stealthiness and attack success. Overall, these results highlights the effectiveness and efficiency of BitBypass in jailbreaking these state-of-the-art LLMs.

Title: ORPP: Self-Optimizing Role-playing Prompts to Enhance Language Model Capabilities

Authors: Yifan Duan, Yihong Tang, Kehai Chen, Liqiang Nie, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02480
Pdf URL: https://arxiv.org/pdf/2506.02480
Copy Paste: [[2506.02480]] ORPP: Self-Optimizing Role-playing Prompts to Enhance Language Model Capabilities(https://arxiv.org/abs/2506.02480)
Keywords: large language model
Abstract: High-quality prompts are crucial for eliciting outstanding performance from large language models (LLMs) on complex tasks. Existing research has explored model-driven strategies for prompt optimization. However, these methods often suffer from high computational overhead or require strong optimization capabilities from the model itself, which limits their broad this http URL address these challenges, we propose ORPP (Optimized Role-Playing Prompt),a framework that enhances model performance by optimizing and generating role-playing prompts. The core idea of ORPP is to confine the prompt search space to role-playing scenarios, thereby fully activating the model's intrinsic capabilities through carefully crafted, high-quality role-playing prompts. Specifically, ORPP first performs iterative optimization on a small subset of training samples to generate high-quality role-playing prompts. Then, leveraging the model's few-shot learning capability, it transfers the optimization experience to efficiently generate suitable prompts for the remaining this http URL experimental results show that ORPP not only matches but in most cases surpasses existing mainstream prompt optimization methods in terms of performance. Notably, ORPP demonstrates superior "plug-and-play" capability. In most cases, it can be integrated with various other prompt methods and further enhance their effectiveness.

Title: Do Language Models Think Consistently? A Study of Value Preferences Across Varying Response Lengths

Authors: Inderjeet Nair, Lu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02481
Pdf URL: https://arxiv.org/pdf/2506.02481
Copy Paste: [[2506.02481]] Do Language Models Think Consistently? A Study of Value Preferences Across Varying Response Lengths(https://arxiv.org/abs/2506.02481)
Keywords: robust
Abstract: Evaluations of LLMs' ethical risks and value inclinations often rely on short-form surveys and psychometric tests, yet real-world use involves long-form, open-ended responses -- leaving value-related risks and preferences in practical settings largely underexplored. In this work, we ask: Do value preferences inferred from short-form tests align with those expressed in long-form outputs? To address this question, we compare value preferences elicited from short-form reactions and long-form responses, varying the number of arguments in the latter to capture users' differing verbosity preferences. Analyzing five LLMs (llama3-8b, gemma2-9b, mistral-7b, qwen2-7b, and olmo-7b), we find (1) a weak correlation between value preferences inferred from short-form and long-form responses across varying argument counts, and (2) similarly weak correlation between preferences derived from any two distinct long-form generation settings. (3) Alignment yields only modest gains in the consistency of value expression. Further, we examine how long-form generation attributes relate to value preferences, finding that argument specificity negatively correlates with preference strength, while representation across scenarios shows a positive correlation. Our findings underscore the need for more robust methods to ensure consistent value expression across diverse applications.

Title: Enhancing Large Language Models with Neurosymbolic Reasoning for Multilingual Tasks

Authors: Sina Bagheri Nezhad, Ameeta Agrawal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02483
Pdf URL: https://arxiv.org/pdf/2506.02483
Copy Paste: [[2506.02483]] Enhancing Large Language Models with Neurosymbolic Reasoning for Multilingual Tasks(https://arxiv.org/abs/2506.02483)
Keywords: robust, large language model
Abstract: Large language models (LLMs) often struggle to perform multi-target reasoning in long-context scenarios where relevant information is scattered across extensive documents. To address this challenge, we introduce NeuroSymbolic Augmented Reasoning (NSAR), which combines the benefits of neural and symbolic reasoning during inference. NSAR explicitly extracts symbolic facts from text and generates executable Python code to handle complex reasoning steps. Through extensive experiments across seven languages and diverse context lengths, we demonstrate that NSAR significantly outperforms both a vanilla RAG baseline and advanced prompting strategies in accurately identifying and synthesizing multiple pieces of information. Our results highlight the effectiveness of combining explicit symbolic operations with neural inference for robust, interpretable, and scalable reasoning in multilingual settings.

Title: Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models

Authors: Hongtao Huang, Xiaojun Chang, Lina Yao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02488
Pdf URL: https://arxiv.org/pdf/2506.02488
Copy Paste: [[2506.02488]] Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models(https://arxiv.org/abs/2506.02488)
Keywords: diffusion, generative
Abstract: Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but are constrained by high computational costs due to iterative multi-step inference. While Neural Architecture Search (NAS) can optimize DMs, existing methods are hindered by retraining requirements, exponential search complexity from step-wise optimization, and slow evaluation relying on massive image generation. To address these challenges, we propose Flexiffusion, a training-free NAS framework that jointly optimizes generation schedules and model architectures without modifying pre-trained parameters. Our key insight is to decompose the generation process into flexible segments of equal length, where each segment dynamically combines three step types: full (complete computation), partial (cache-reused computation), and null (skipped computation). This segment-wise search space reduces the candidate pool exponentially compared to step-wise NAS while preserving architectural diversity. Further, we introduce relative FID (rFID), a lightweight evaluation metric for NAS that measures divergence from a teacher model's outputs instead of ground truth, slashing evaluation time by over $90\%$. In practice, Flexiffusion achieves at least $2\times$ acceleration across LDMs, Stable Diffusion, and DDPMs on ImageNet and MS-COCO, with FID degradation under $5\%$, outperforming prior NAS and caching methods. Notably, it attains $5.1\times$ speedup on Stable Diffusion with near-identical CLIP scores. Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.

Title: Co-Evidential Fusion with Information Volume for Medical Image Segmentation

Authors: Yuanpeng He, Lijian Li, Tianxiang Zhan, Chi-Man Pun, Wenpin Jiao, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02492
Pdf URL: https://arxiv.org/pdf/2506.02492
Copy Paste: [[2506.02492]] Co-Evidential Fusion with Information Volume for Medical Image Segmentation(https://arxiv.org/abs/2506.02492)
Keywords: segmentation
Abstract: Although existing semi-supervised image segmentation methods have achieved good performance, they cannot effectively utilize multiple sources of voxel-level uncertainty for targeted learning. Therefore, we propose two main improvements. First, we introduce a novel pignistic co-evidential fusion strategy using generalized evidential deep learning, extended by traditional D-S evidence theory, to obtain a more precise uncertainty measure for each voxel in medical samples. This assists the model in learning mixed labeled information and establishing semantic associations between labeled and unlabeled data. Second, we introduce the concept of information volume of mass function (IVUM) to evaluate the constructed evidence, implementing two evidential learning schemes. One optimizes evidential deep learning by combining the information volume of the mass function with original uncertainty measures. The other integrates the learning pattern based on the co-evidential fusion strategy, using IVUM to design a new optimization objective. Experiments on four datasets demonstrate the competitive performance of our method.

Title: Towards In-the-wild 3D Plane Reconstruction from a Single Image

Authors: Jiachen Liu, Rui Yu, Sili Chen, Sharon X. Huang, Hengkai Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02493
Pdf URL: https://arxiv.org/pdf/2506.02493
Copy Paste: [[2506.02493]] Towards In-the-wild 3D Plane Reconstruction from a Single Image(https://arxiv.org/abs/2506.02493)
Keywords: transformer
Abstract: 3D plane reconstruction from a single image is a crucial yet challenging topic in 3D computer vision. Previous state-of-the-art (SOTA) methods have focused on training their system on a single dataset from either indoor or outdoor domain, limiting their generalizability across diverse testing data. In this work, we introduce a novel framework dubbed ZeroPlane, a Transformer-based model targeting zero-shot 3D plane detection and reconstruction from a single image, over diverse domains and environments. To enable data-driven models across multiple domains, we have curated a large-scale planar benchmark, comprising over 14 datasets and 560,000 high-resolution, dense planar annotations for diverse indoor and outdoor scenes. To address the challenge of achieving desirable planar geometry on multi-dataset training, we propose to disentangle the representation of plane normal and offset, and employ an exemplar-guided, classification-then-regression paradigm to learn plane and offset respectively. Additionally, we employ advanced backbones as image encoder, and present an effective pixel-geometry-enhanced plane embedding module to further facilitate planar reconstruction. Extensive experiments across multiple zero-shot evaluation datasets have demonstrated that our approach significantly outperforms previous methods on both reconstruction accuracy and generalizability, especially over in-the-wild data. Our code and data are available at: this https URL.

Title: LumosFlow: Motion-Guided Long Video Generation

Authors: Jiahao Chen, Hangjie Yuan, Yichen Qian, Jingyun Liang, Jiazheng Xing, Pengwei Liu, Weihua Chen, Fan Wang, Bing Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02497
Pdf URL: https://arxiv.org/pdf/2506.02497
Copy Paste: [[2506.02497]] LumosFlow: Motion-Guided Long Video Generation(https://arxiv.org/abs/2506.02497)
Keywords: diffusion
Abstract: Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: this https URL

Title: KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG

Authors: Yongjian Li, HaoCheng Chu, Yukun Yan, Zhenghao Liu, Shi Yu, Zheni Zeng, Ruobing Wang, Sen Song, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02503
Pdf URL: https://arxiv.org/pdf/2506.02503
Copy Paste: [[2506.02503]] KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG(https://arxiv.org/abs/2506.02503)
Keywords: robust, generative, large language model
Abstract: Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.

Title: M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset

Authors: Jie Zhu, Junhui Li, Yalong Wen, Xiandong Li, Lifan Guo, Feng Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02510
Pdf URL: https://arxiv.org/pdf/2506.02510
Copy Paste: [[2506.02510]] M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset(https://arxiv.org/abs/2506.02510)
Keywords: extraction, large language model
Abstract: Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called $\texttt{M$^3$FinMeeting}$, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, $\texttt{M$^3$FinMeeting}$ supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, $\texttt{M$^3$FinMeeting}$ includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of $\texttt{M$^3$FinMeeting}$ as a benchmark for assessing LLMs' financial meeting comprehension skills.

Title: RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

Authors: Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, Yin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02528
Pdf URL: https://arxiv.org/pdf/2506.02528
Copy Paste: [[2506.02528]] RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers(https://arxiv.org/abs/2506.02528)
Keywords: diffusion, transformer, large language model
Abstract: Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.

Title: Answer Convergence as a Signal for Early Stopping in Reasoning

Authors: Xin Liu, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02536
Pdf URL: https://arxiv.org/pdf/2506.02536
Copy Paste: [[2506.02536]] Answer Convergence as a Signal for Early Stopping in Reasoning(https://arxiv.org/abs/2506.02536)
Keywords: large language model
Abstract: Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs) but often leads to verbose and redundant outputs, thus increasing inference cost. We hypothesize that many reasoning steps are unnecessary for producing correct answers. To investigate this, we start with a systematic study to examine what is the minimum reasoning required for a model to reach a stable decision. We find that on math reasoning tasks like math, models typically converge to their final answers after 60\% of the reasoning steps, suggesting substantial redundancy in the remaining content. Based on these insights, we propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations. Experiments across five benchmarks and five open-weights LLMs show that our methods significantly reduce token usage with little or no accuracy drop. In particular, on NaturalQuestions, Answer Consistency reduces tokens by over 40\% while further improving accuracy. Our work underscores the importance of cost-effective reasoning methods that operate at inference time, offering practical benefits for real-world applications.

Title: VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

Authors: Hao Yan, Handong Zheng, Hao Wang, Liang Yin, Xingchen Liu, Zhenbiao Cao, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, Chao Weng, Wei Chen, Yuliang Liu, Xiang Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02537
Pdf URL: https://arxiv.org/pdf/2506.02537
Copy Paste: [[2506.02537]] VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning(https://arxiv.org/abs/2506.02537)
Keywords: interpretability, large language model
Abstract: Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at this https URL

Title: VerificAgent: Integrating Expert Knowledge and Fact-Checked Memory for Robust Domain-Specific Task Planning

Authors: Thong Q. Nguyen, Shubhang Desai, Yash Jain, Tanvir Aumi, Vishal Chowdhary
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02539
Pdf URL: https://arxiv.org/pdf/2506.02539
Copy Paste: [[2506.02539]] VerificAgent: Integrating Expert Knowledge and Fact-Checked Memory for Robust Domain-Specific Task Planning(https://arxiv.org/abs/2506.02539)
Keywords: robust
Abstract: Continual memory augmentation allows computer-use agents (CUAs) to learn from past interactions and refine their task-solving strategies over time. However, unchecked memory accumulation can introduce spurious or hallucinated "learnings" that degrade agent performance, particularly in domain-specific workflows such as productivity software. We present a novel framework, VerificAgent, that effectively manages memory for CUAs through (1) an expert-curated seed of domain knowledge, (2) iterative, trajectory-based memory refinement during training, and (3) a post-hoc fact-checking pass by human experts to sanitize accumulated memory before deployment. On OSWorld productivity tasks, VerificAgent achieves a 111.1% relative improvement in success rate over baseline CUA without any additional fine-tuning.

Title: Rethinking Post-Unlearning Behavior of Large Vision-Language Models

Authors: Minsung Kim, Nakyeong Yang, Kyomin Jung
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.02541
Pdf URL: https://arxiv.org/pdf/2506.02541
Copy Paste: [[2506.02541]] Rethinking Post-Unlearning Behavior of Large Vision-Language Models(https://arxiv.org/abs/2506.02541)
Keywords: privacy, generative
Abstract: Machine unlearning is used to mitigate the privacy risks of Large Vision-Language Models (LVLMs) arising from training on large-scale web data. However, existing unlearning methods often fail to carefully select substitute outputs for forget targets, resulting in Unlearning Aftermaths-undesirable behaviors such as degenerate, hallucinated, or excessively refused responses. We highlight that, especially for generative LVLMs, it is crucial to consider the quality and informativeness of post-unlearning responses rather than relying solely on naive suppression. To address this, we introduce a new unlearning task for LVLMs that requires models to provide privacy-preserving yet informative and visually grounded responses. We also propose PUBG, a novel unlearning method that explicitly guides post-unlearning behavior toward a desirable output distribution. Experiments show that, while existing methods suffer from Unlearning Aftermaths despite successfully preventing privacy violations, PUBG effectively mitigates these issues, generating visually grounded and informative responses without privacy leakage for forgotten targets.

Title: CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG

Authors: Yang Tian, Fan Liu, Jingyuan Zhang, Victoria W., Yupeng Hu, Liqiang Nie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02544
Pdf URL: https://arxiv.org/pdf/2506.02544
Copy Paste: [[2506.02544]] CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG(https://arxiv.org/abs/2506.02544)
Keywords: large language model
Abstract: Multimodal Retrieval-Augmented Generation (MMRAG) has been introduced to enhance Multimodal Large Language Models by incorporating externally retrieved multimodal knowledge, but it introduces two challenges: Parametric-Retrieved Knowledge Inconsistency (PRKI), where discrepancies between parametric and retrieved knowledge create uncertainty in determining reliability, and Visual-Textual Knowledge Inconsistency (VTKI), where misalignment between visual and textual sources disrupts entity representation. To address these challenges, we propose \textbf{C}r\textbf{o}ss-source knowledge \textbf{Re}conciliation for \textbf{M}ulti\textbf{M}odal \textbf{RAG} (CoRe-MMRAG), a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources. CoRe-MMRAG follows a four-stage pipeline: it first generates an internal response from parametric knowledge, then selects the most relevant multimodal evidence via joint similarity assessment, generates an external response, and finally integrates both to produce a reliable answer. Additionally, a specialized training paradigm enhances knowledge source discrimination, multimodal integration, and unified answer generation. Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods, achieving 5.6\% and 9.3\% performance gains on InfoSeek and Encyclopedic-VQA, respectively. We release code and data at \href{this https URL}{this https URL}.

Title: Attention Knows Whom to Trust: Attention-based Trust Management for LLM Multi-Agent Systems

Authors: Pengfei He, Zhenwei Dai, Xianfeng Tang, Yue Xing, Hui Liu, Jingying Zeng, Qiankun Peng, Shrivats Agrawal, Samarth Varshney, Suhang Wang, Jiliang Tang, Qi He
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02546
Pdf URL: https://arxiv.org/pdf/2506.02546
Copy Paste: [[2506.02546]] Attention Knows Whom to Trust: Attention-based Trust Management for LLM Multi-Agent Systems(https://arxiv.org/abs/2506.02546)
Keywords: robust, large language model
Abstract: Large Language Model-based Multi-Agent Systems (LLM-MAS) have demonstrated strong capabilities in solving complex tasks but remain vulnerable when agents receive unreliable messages. This vulnerability stems from a fundamental gap: LLM agents treat all incoming messages equally without evaluating their trustworthiness. While some existing studies approach the trustworthiness, they focus on a single type of harmfulness rather than analyze it in a holistic approach from multiple trustworthiness perspectives. In this work, we propose Attention Trust Score (A-Trust), a lightweight, attention-based method for evaluating message trustworthiness. Inspired by human communication literature[1], through systematically analyzing attention behaviors across six orthogonal trust dimensions, we find that certain attention heads in the LLM specialize in detecting specific types of violations. Leveraging these insights, A-Trust directly infers trustworthiness from internal attention patterns without requiring external prompts or verifiers. Building upon A-Trust, we develop a principled and efficient trust management system (TMS) for LLM-MAS, enabling both message-level and agent-level trust assessment. Experiments across diverse multi-agent settings and tasks demonstrate that applying our TMS significantly enhances robustness against malicious inputs.

Title: CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

Authors: Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02548
Pdf URL: https://arxiv.org/pdf/2506.02548
Copy Paste: [[2506.02548]] CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale(https://arxiv.org/abs/2506.02548)
Keywords: security, large language model
Abstract: Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously. Thoroughly assessing their cybersecurity capabilities is critical and urgent, given the high stakes in this domain. However, existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope. To address this gap, we introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities found and patched across 188 large software projects. While it includes tasks of various settings, CyberGym primarily focuses on the generation of proof-of-concept (PoC) tests for vulnerability reproduction, based on text descriptions and corresponding source repositories. Solving this task is particularly challenging, as it requires comprehensive reasoning across entire codebases to locate relevant code fragments and produce effective PoCs that accurately trigger the target vulnerability starting from the program's entry point. Our evaluation across 4 state-of-the-art agent frameworks and 9 LLMs reveals that even the best combination (OpenHands and Claude-3.7-Sonnet) achieves only a 11.9% reproduction success rate, mainly on simpler cases. Beyond reproducing historical vulnerabilities, we find that PoCs generated by LLM agents can reveal new vulnerabilities, identifying 15 zero-days affecting the latest versions of the software projects.

Title: Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025

Authors: Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02550
Pdf URL: https://arxiv.org/pdf/2506.02550
Copy Paste: [[2506.02550]] Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025(https://arxiv.org/abs/2506.02550)
Keywords: extraction, transformer, large language model
Abstract: In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at this https URL.

Title: Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective

Authors: Shenghua He, Tian Xia, Xuan Zhou, Hui Wei
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.02553
Pdf URL: https://arxiv.org/pdf/2506.02553
Copy Paste: [[2506.02553]] Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective(https://arxiv.org/abs/2506.02553)
Keywords: large language model
Abstract: We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or infeasible to obtain in LLM applications. In this work, we provide a unifying theoretical perspective. We introduce the Trajectory Policy Gradient Theorem, which shows that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model, regardless of whether the Zero-Reward Assumption holds or not, for algorithms in the REINFORCE and Actor-Critic families. This result reveals that widely used methods such as PPO, GRPO, ReMax, and RLOO inherently possess the capacity to model token-level reward signals, offering a theoretical justification for response-level reward approaches. Our findings pave the way for more practical, efficient LLM fine-tuning, allowing developers to treat training algorithms as black boxes and focus on improving the response-level reward model with auxiliary sub-models. We also offer a detailed analysis of popular RL and non-RL methods, comparing their theoretical foundations and practical advantages across common LLM tasks. Finally, we propose a new algorithm: Token-Reinforced Policy Optimization (TRePO), a theoretically grounded method that is simpler than PPO, matches GRPO in memory efficiency, and holds promise for broad applicability.

Title: Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Authors: Shizhan Gong, Yankai Jiang, Qi Dou, Farzan Farnia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02557
Pdf URL: https://arxiv.org/pdf/2506.02557
Copy Paste: [[2506.02557]] Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models(https://arxiv.org/abs/2506.02557)
Keywords: large language model
Abstract: Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance.

Title: DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing

Authors: Zixiang Li, Haoyu Wang, Wei Wang, Chuangchuang Tan, Yunchao Wei, Yao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02560
Pdf URL: https://arxiv.org/pdf/2506.02560
Copy Paste: [[2506.02560]] DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing(https://arxiv.org/abs/2506.02560)
Keywords: robust, diffusion
Abstract: Diffusion models have achieved remarkable success in image generation and editing tasks. Inversion within these models aims to recover the latent noise representation for a real or generated image, enabling reconstruction, editing, and other downstream tasks. However, to date, most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility. This limitation arises from the difficulty of maintaining both semantic alignment and structural consistency during the inversion process. In this work, we introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image to guide the inversion process. Specifically, DCI formulates the inversion process as a dual-condition fixed-point optimization problem, minimizing both the latent noise gap and the reconstruction error under the joint guidance. This design anchors the inversion trajectory in both semantic and visual space, leading to more accurate and editable latent representations. Our novel setup brings new understanding to the inversion process. Extensive experiments demonstrate that DCI achieves state-of-the-art performance across multiple editing tasks, significantly improving both reconstruction quality and editing precision. Furthermore, we also demonstrate that our method achieves strong results in reconstruction tasks, implying a degree of robustness and generalizability approaching the ultimate goal of the inversion process.

Title: Pruning General Large Language Models into Customized Expert Models

Authors: Yirao Zhao, Guizhen Chen, Kenji Kawaguchi, Lidong Bing, Wenxuan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02561
Pdf URL: https://arxiv.org/pdf/2506.02561
Copy Paste: [[2506.02561]] Pruning General Large Language Models into Customized Expert Models(https://arxiv.org/abs/2506.02561)
Keywords: large language model
Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their substantial model sizes often require substantial computational resources. To preserve computing resources and accelerate inference speed, it is crucial to prune redundant parameters, especially for experienced users who often need compact expert models tailored to specific downstream scenarios. However, most existing pruning methods focus on preserving the model's general capabilities, often requiring extensive post-training or suffering from degraded performance due to coarse-grained pruning. In this work, we design a $\underline{Cus}$tom $\underline{Prun}$ing method ($\texttt{Cus-Prun}$) to prune a large general model into a smaller lightweight expert model, which is positioned along the "language", "domain" and "task" dimensions. By identifying and pruning irrelevant neurons of each dimension, $\texttt{Cus-Prun}$ creates expert models without any post-training. Our experiments demonstrate that $\texttt{Cus-Prun}$ consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.

Title: Privacy-Preserving Federated Convex Optimization: Balancing Partial-Participation and Efficiency via Noise Cancellation

Authors: Roie Reshef, Kfir Yehuda Levy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02563
Pdf URL: https://arxiv.org/pdf/2506.02563
Copy Paste: [[2506.02563]] Privacy-Preserving Federated Convex Optimization: Balancing Partial-Participation and Efficiency via Noise Cancellation(https://arxiv.org/abs/2506.02563)
Keywords: privacy, federate
Abstract: This paper tackles the challenge of achieving Differential Privacy (DP) in Federated Learning (FL) under partial-participation, where only a subset of the machines participate in each time-step. While previous work achieved optimal performance in full-participation settings, these methods struggled to extend to partial-participation scenarios. Our approach fills this gap by introducing a novel noise-cancellation mechanism that preserves privacy without sacrificing convergence rates or computational efficiency. We analyze our method within the Stochastic Convex Optimization (SCO) framework and show that it delivers optimal performance for both homogeneous and heterogeneous data distributions. This work expands the applicability of DP in FL, offering an efficient and practical solution for privacy-preserving learning in distributed systems with partial participation.

Title: Contrast & Compress: Learning Lightweight Embeddings for Short Trajectories

Authors: Abhishek Vivekanandan, Christian Hubschneider, J. Marius Zöllner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02571
Pdf URL: https://arxiv.org/pdf/2506.02571
Copy Paste: [[2506.02571]] Contrast & Compress: Learning Lightweight Embeddings for Short Trajectories(https://arxiv.org/abs/2506.02571)
Keywords: robust, interpretability, transformer
Abstract: The ability to retrieve semantically and directionally similar short-range trajectories with both accuracy and efficiency is foundational for downstream applications such as motion forecasting and autonomous navigation. However, prevailing approaches often depend on computationally intensive heuristics or latent anchor representations that lack interpretability and controllability. In this work, we propose a novel framework for learning fixed-dimensional embeddings for short trajectories by leveraging a Transformer encoder trained with a contrastive triplet loss that emphasize the importance of discriminative feature spaces for trajectory data. We analyze the influence of Cosine and FFT-based similarity metrics within the contrastive learning paradigm, with a focus on capturing the nuanced directional intent that characterizes short-term maneuvers. Our empirical evaluation on the Argoverse 2 dataset demonstrates that embeddings shaped by Cosine similarity objectives yield superior clustering of trajectories by both semantic and directional attributes, outperforming FFT-based baselines in retrieval tasks. Notably, we show that compact Transformer architectures, even with low-dimensional embeddings (e.g., 16 dimensions, but qualitatively down to 4), achieve a compelling balance between retrieval performance (minADE, minFDE) and computational overhead, aligning with the growing demand for scalable and interpretable motion priors in real-time systems. The resulting embeddings provide a compact, semantically meaningful, and efficient representation of trajectory data, offering a robust alternative to heuristic similarity measures and paving the way for more transparent and controllable motion forecasting pipelines.

Title: HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference

Authors: Ping Gong, Jiawei Yi, Shengnan Wang, Juncheng Zhang, Zewen Jin, Ouxiang Zhou, Ruibo Liu, Guanbin Xu, Youhui Bai, Bowen Ye, Kun Yuan, Tong Yang, Gong Zhang, Renhai Chen, Feng Wu, Cheng Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02572
Pdf URL: https://arxiv.org/pdf/2506.02572
Copy Paste: [[2506.02572]] HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference(https://arxiv.org/abs/2506.02572)
Keywords: large language model
Abstract: Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-$k$ attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance between efficiency and accuracy. In this paper, we introduce HATA (Hash-Aware Top-$k$ Attention), a novel approach that systematically integrates low-overhead learning-to-hash techniques into the Top-$k$ attention process. Different from the existing top-k attention methods which are devoted to seeking an absolute estimation of qk score, typically with a great cost, HATA maps queries and keys into binary hash codes, and acquires the relative qk score order with a quite low cost, which is sufficient for realizing top-k attention. Extensive experiments demonstrate that HATA achieves up to 7.2$\times$ speedup compared to vanilla full attention while maintaining model accuracy. In addition, HATA outperforms the state-of-the-art top-$k$ attention methods in both accuracy and efficiency across multiple mainstream LLM models and diverse tasks. HATA is open source at this https URL.

Title: IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages

Authors: Muhammad Falensi Azmi, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02573
Pdf URL: https://arxiv.org/pdf/2506.02573
Copy Paste: [[2506.02573]] IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages(https://arxiv.org/abs/2506.02573)
Keywords: large language model
Abstract: Although region-specific large language models (LLMs) are increasingly developed, their safety remains underexplored, particularly in culturally diverse settings like Indonesia, where sensitivity to local norms is essential and highly valued by the community. In this work, we present IndoSafety, the first high-quality, human-verified safety evaluation dataset tailored for the Indonesian context, covering five language varieties: formal and colloquial Indonesian, along with three major local languages: Javanese, Sundanese, and Minangkabau. IndoSafety is constructed by extending prior safety frameworks to develop a taxonomy that captures Indonesia's sociocultural context. We find that existing Indonesian-centric LLMs often generate unsafe outputs, particularly in colloquial and local language settings, while fine-tuning on IndoSafety significantly improves safety while preserving task performance. Our work highlights the critical need for culturally grounded safety evaluation and provides a concrete step toward responsible LLM deployment in multilingual settings. Warning: This paper contains example data that may be offensive, harmful, or biased.

Title: Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM

Authors: Maria Levchenko
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.02589
Pdf URL: https://arxiv.org/pdf/2506.02589
Copy Paste: [[2506.02589]] Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM(https://arxiv.org/abs/2506.02589)
Keywords: transformer, large language model
Abstract: This paper addresses the challenge of Named Entity Recognition (NER) for person names within the specialized domain of Russian news texts concerning cultural events. The study utilizes the unique SPbLitGuide dataset, a collection of event announcements from Saint Petersburg spanning 1999 to 2019. A comparative evaluation of diverse NER models is presented, encompassing established transformer-based architectures such as DeepPavlov, RoBERTa, and SpaCy, alongside recent Large Language Models (LLMs) including GPT-3.5, GPT-4, and GPT-4o. Key findings highlight the superior performance of GPT-4o when provided with specific prompting for JSON output, achieving an F1 score of 0.93. Furthermore, GPT-4 demonstrated the highest precision at 0.99. The research contributes to a deeper understanding of current NER model capabilities and limitations when applied to morphologically rich languages like Russian within the cultural heritage domain, offering insights for researchers and practitioners. Follow-up evaluation with GPT-4.1 (April 2025) achieves F1=0.94 for both simple and structured prompts, demonstrating rapid progress across model families and simplified deployment requirements.

Title: On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures

Authors: Minh Duc Bui, Kyung Eun Park, Goran Glavaš, Fabian David Schmidt, Katharina von der Wense
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02591
Pdf URL: https://arxiv.org/pdf/2506.02591
Copy Paste: [[2506.02591]] On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures(https://arxiv.org/abs/2506.02591)
Keywords: large language model
Abstract: Measurement systems (e.g., currencies) differ across cultures, but the conversions between them are well defined so that humans can state facts using any measurement system of their choice. Being available to users from diverse cultural backgrounds, large language models (LLMs) should also be able to provide accurate information irrespective of the measurement system at hand. Using newly compiled datasets we test if this is the case for seven open-source LLMs, addressing three key research questions: (RQ1) What is the default system used by LLMs for each type of measurement? (RQ2) Do LLMs' answers and their accuracy vary across different measurement systems? (RQ3) Can LLMs mitigate potential challenges w.r.t. underrepresented systems via reasoning? Our findings show that LLMs default to the measurement system predominantly used in the data. Additionally, we observe considerable instability and variance in performance across different measurement systems. While this instability can in part be mitigated by employing reasoning methods such as chain-of-thought (CoT), this implies longer responses and thereby significantly increases test-time compute (and inference costs), marginalizing users from cultural backgrounds that use underrepresented measurement systems.

Title: Beyond the Surface: Measuring Self-Preference in LLM Judgments

Authors: Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02592
Pdf URL: https://arxiv.org/pdf/2506.02592
Copy Paste: [[2506.02592]] Beyond the Surface: Measuring Self-Preference in LLM Judgments(https://arxiv.org/abs/2506.02592)
Keywords: large language model
Abstract: Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at this https URL.

Title: EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

Authors: Fan Gao, Dongyuan Li, Ding Xia, Fei Mi, Yasheng Wang, Lifeng Shang, Baojun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02596
Pdf URL: https://arxiv.org/pdf/2506.02596
Copy Paste: [[2506.02596]] EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing(https://arxiv.org/abs/2506.02596)
Keywords: large language model
Abstract: Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose \benchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the \textit{Open-Ended} and \textit{Constrained} sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types. With \benchName, we aim to advance LLM-based Chinese essay evaluation and inspire future research on improving essay generation in educational settings.

Title: Hyperspectral Image Generation with Unmixing Guided Diffusion Model

Authors: Shiyu Shen, Bin Pan, Ziye Zhang, Zhenwei Shi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.02601
Pdf URL: https://arxiv.org/pdf/2506.02601
Copy Paste: [[2506.02601]] Hyperspectral Image Generation with Unmixing Guided Diffusion Model(https://arxiv.org/abs/2506.02601)
Keywords: diffusion, generative
Abstract: Recently, hyperspectral image generation has received increasing attention, but existing generative models rely on conditional generation schemes, which limits the diversity of generated images. Diffusion models are popular for their ability to generate high-quality samples, but adapting these models from RGB to hyperspectral data presents the challenge of high dimensionality and physical constraints. To address these challenges, we propose a novel diffusion model guided by hyperspectral unmixing. Our model comprises two key modules: an unmixing autoencoder module and an abundance diffusion module. The unmixing autoencoder module leverages unmixing guidance to shift the generative task from the image space to the low-dimensional abundance space, significantly reducing computational complexity while preserving high fidelity. The abundance diffusion module generates samples that satisfy the constraints of non-negativity and unity, ensuring the physical consistency of the reconstructed HSIs. Additionally, we introduce two evaluation metrics tailored to hyperspectral data. Empirical results, evaluated using both traditional metrics and our proposed metrics, indicate that our model is capable of generating high-quality and diverse hyperspectral images, offering an advancement in hyperspectral data generation.

Title: One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation

Authors: Xue Wu, Jingwei Xin, Zhijun Tu, Jie Hu, Jie Li, Nannan Wang, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02605
Pdf URL: https://arxiv.org/pdf/2506.02605
Copy Paste: [[2506.02605]] One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation(https://arxiv.org/abs/2506.02605)
Keywords: diffusion
Abstract: Diffusion-based models have been widely used in various visual generation tasks, showing promising results in image super-resolution (SR), while typically being limited by dozens or even hundreds of sampling steps. Although existing methods aim to accelerate the inference speed of multi-step diffusion-based SR methods through knowledge distillation, their generated images exhibit insufficient semantic alignment with real images, resulting in suboptimal perceptual quality reconstruction, specifically reflected in the CLIPIQA score. These methods still have many challenges in perceptual quality and semantic fidelity. Based on the challenges, we propose VPD-SR, a novel visual perception diffusion distillation framework specifically designed for SR, aiming to construct an effective and efficient one-step SR model. Specifically, VPD-SR consists of two components: Explicit Semantic-aware Supervision (ESS) and High-Frequency Perception (HFP) loss. Firstly, the ESS leverages the powerful visual perceptual understanding capabilities of the CLIP model to extract explicit semantic supervision, thereby enhancing semantic consistency. Then, Considering that high-frequency information contributes to the visual perception quality of images, in addition to the vanilla distillation loss, the HFP loss guides the student model to restore the missing high-frequency details in degraded images that are critical for enhancing perceptual quality. Lastly, we expand VPD-SR in adversarial training manner to further enhance the authenticity of the generated content. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed VPD-SR achieves superior performance compared to both previous state-of-the-art methods and the teacher model with just one-step sampling.

Title: Simple, Good, Fast: Self-Supervised World Models Free of Baggage

Authors: Jan Robine, Marc Höftmann, Stefan Harmeling
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02612
Pdf URL: https://arxiv.org/pdf/2506.02612
Copy Paste: [[2506.02612]] Simple, Good, Fast: Self-Supervised World Models Free of Baggage(https://arxiv.org/abs/2506.02612)
Keywords: robust, transformer
Abstract: What are the essential components of world models? How far do we get with world models that are not employing RNNs, transformers, discrete representations, and image reconstructions? This paper introduces SGF, a Simple, Good, and Fast world model that uses self-supervised representation learning, captures short-time dependencies through frame and action stacking, and enhances robustness against model errors through data augmentation. We extensively discuss SGF's connections to established world models, evaluate the building blocks in ablation studies, and demonstrate good performance through quantitative comparisons on the Atari 100k benchmark.

Title: Synthetic Iris Image Databases and Identity Leakage: Risks and Mitigation Strategies

Authors: Ada Sawilska, Mateusz Trokielewicz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02626
Pdf URL: https://arxiv.org/pdf/2506.02626
Copy Paste: [[2506.02626]] Synthetic Iris Image Databases and Identity Leakage: Risks and Mitigation Strategies(https://arxiv.org/abs/2506.02626)
Keywords: biometric, diffusion, generative
Abstract: This paper presents a comprehensive overview of iris image synthesis methods, which can alleviate the issues associated with gathering large, diverse datasets of biometric data from living individuals, which are considered pivotal for biometric methods development. These methods for synthesizing iris data range from traditional, hand crafted image processing-based techniques, through various iterations of GAN-based image generators, variational autoencoders (VAEs), as well as diffusion models. The potential and fidelity in iris image generation of each method is discussed and examples of inferred predictions are provided. Furthermore, the risks of individual biometric features leakage from the training sets are considered, together with possible strategies for preventing them, which have to be implemented should these generative methods be considered a valid replacement of real-world biometric datasets.

Title: HAM: A Hyperbolic Step to Regulate Implicit Bias

Authors: Tom Jacobs, Advait Gadhikar, Celia Rubio-Madrigal, Rebekka Burkholz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02630
Pdf URL: https://arxiv.org/pdf/2506.02630
Copy Paste: [[2506.02630]] HAM: A Hyperbolic Step to Regulate Implicit Bias(https://arxiv.org/abs/2506.02630)
Keywords: large language model
Abstract: Understanding the implicit bias of optimization algorithms has become central to explaining the generalization behavior of deep learning models. For instance, the hyperbolic implicit bias induced by the overparameterization $m \odot w$--though effective in promoting sparsity--can result in a small effective learning rate, which slows down convergence. To overcome this obstacle, we propose HAM (Hyperbolic Aware Minimization), which alternates between an optimizer step and a new hyperbolic mirror step. We derive the Riemannian gradient flow for its combination with gradient descent, leading to improved convergence and a similar beneficial hyperbolic geometry as $m \odot w$ for feature learning. We provide an interpretation of the the algorithm by relating it to natural gradient descent, and an exact characterization of its implicit bias for underdetermined linear regression. HAM's implicit bias consistently boosts performance--even of dense training, as we demonstrate in experiments across diverse tasks, including vision, graph and node classification, and large language model fine-tuning. HAM is especially effective in combination with different sparsification methods, improving upon the state of the art. The hyperbolic step requires minimal computational and memory overhead, it succeeds even with small batch sizes, and its implementation integrates smoothly with existing optimizers.

Title: ControlMambaIR: Conditional Controls with State-Space Model for Image Restoration

Authors: Cheng Yang, Lijing Liang, Zhixun Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02633
Pdf URL: https://arxiv.org/pdf/2506.02633
Copy Paste: [[2506.02633]] ControlMambaIR: Conditional Controls with State-Space Model for Image Restoration(https://arxiv.org/abs/2506.02633)
Keywords: robust, diffusion
Abstract: This paper proposes ControlMambaIR, a novel image restoration method designed to address perceptual challenges in image deraining, deblurring, and denoising tasks. By integrating the Mamba network architecture with the diffusion model, the condition network achieves refined conditional control, thereby enhancing the control and optimization of the image generation process. To evaluate the robustness and generalization capability of our method across various image degradation conditions, extensive experiments were conducted on several benchmark datasets, including Rain100H, Rain100L, GoPro, and SSID. The results demonstrate that our proposed approach consistently surpasses existing methods in perceptual quality metrics, such as LPIPS and FID, while maintaining comparable performance in image distortion metrics, including PSNR and SSIM, highlighting its effectiveness and adaptability. Notably, ablation experiments reveal that directly noise prediction in the diffusion process achieves better performance, effectively balancing noise suppression and detail preservation. Furthermore, the findings indicate that the Mamba architecture is particularly well-suited as a conditional control network for diffusion models, outperforming both CNN- and Attention-based approaches in this context. Overall, these results highlight the flexibility and effectiveness of ControlMambaIR in addressing a range of image restoration perceptual challenges.

Title: A Pretrained Probabilistic Transformer for City-Scale Traffic Volume Prediction

Authors: Shiyu Shen, Bin Pan, Guirong Xue
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02654
Pdf URL: https://arxiv.org/pdf/2506.02654
Copy Paste: [[2506.02654]] A Pretrained Probabilistic Transformer for City-Scale Traffic Volume Prediction(https://arxiv.org/abs/2506.02654)
Keywords: robust, transformer
Abstract: City-scale traffic volume prediction plays a pivotal role in intelligent transportation systems, yet remains a challenge due to the inherent incompleteness and bias in observational data. Although deep learning-based methods have shown considerable promise, most existing approaches produce deterministic point estimates, thereby neglecting the uncertainty arising from unobserved traffic flows. Furthermore, current models are typically trained in a city-specific manner, which hinders their generalizability and limits scalability across diverse urban contexts. To overcome these limitations, we introduce TrafficPPT, a Pretrained Probabilistic Transformer designed to model traffic volume as a distributional aggregation of trajectories. Our framework fuses heterogeneous data sources-including real-time observations, historical trajectory data, and road network topology-enabling robust and uncertainty-aware traffic inference. TrafficPPT is initially pretrained on large-scale simulated data spanning multiple urban scenarios, and later fine-tuned on target cities to ensure effective domain adaptation. Experiments on real-world datasets show that TrafficPPT consistently surpasses state-of-the-art baselines, particularly under conditions of extreme data sparsity. Code will be open.

Title: Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs

Authors: Manon Reusens, Bart Baesens, David Jurgens
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02659
Pdf URL: https://arxiv.org/pdf/2506.02659
Copy Paste: [[2506.02659]] Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs(https://arxiv.org/abs/2506.02659)
Keywords: large language model
Abstract: Personalized Large Language Models (LLMs) are increasingly used in diverse applications, where they are assigned a specific persona - such as a happy high school teacher - to guide their responses. While prior research has examined how well LLMs adhere to predefined personas in writing style, a comprehensive analysis of consistency across different personas and task types is lacking. In this paper, we introduce a new standardized framework to analyze consistency in persona-assigned LLMs. We define consistency as the extent to which a model maintains coherent responses when assigned the same persona across different tasks and runs. Our framework evaluates personas across four different categories (happiness, occupation, personality, and political stance) spanning multiple task dimensions (survey writing, essay generation, social media post generation, single turn, and multi-turn conversations). Our findings reveal that consistency is influenced by multiple factors, including the assigned persona, stereotypes, and model design choices. Consistency also varies across tasks, increasing with more structured tasks and additional context. All code is available on GitHub.

Title: Tarallo: Evading Behavioral Malware Detectors in the Problem Space

Authors: Gabriele Digregorio, Salvatore Maccarrone, Mario D'Onghia, Luigi Gallo, Michele Carminati, Mario Polino, Stefano Zanero
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02660
Pdf URL: https://arxiv.org/pdf/2506.02660
Copy Paste: [[2506.02660]] Tarallo: Evading Behavioral Malware Detectors in the Problem Space(https://arxiv.org/abs/2506.02660)
Keywords: attack
Abstract: Machine learning algorithms can effectively classify malware through dynamic behavior but are susceptible to adversarial attacks. Existing attacks, however, often fail to find an effective solution in both the feature and problem spaces. This issue arises from not addressing the intrinsic nondeterministic nature of malware, namely executing the same sample multiple times may yield significantly different behaviors. Hence, the perturbations computed for a specific behavior may be ineffective for others observed in subsequent executions. In this paper, we show how an attacker can augment their chance of success by leveraging a new and more efficient feature space algorithm for sequential data, which we have named PS-FGSM, and by adopting two problem space strategies specially tailored to address nondeterminism in the problem space. We implement our novel algorithm and attack strategies in Tarallo, an end-to-end adversarial framework that significantly outperforms previous works in both white and black-box scenarios. Our preliminary analysis in a sandboxed environment and against two RNN-based malware detectors, shows that Tarallo achieves a success rate up to 99% on both feature and problem space attacks while significantly minimizing the number of modifications required for misclassification.

Title: Beyond Invisibility: Learning Robust Visible Watermarks for Stronger Copyright Protection

Authors: Tianci Liu, Tong Yang, Quan Zhang, Qi Lei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02665
Pdf URL: https://arxiv.org/pdf/2506.02665
Copy Paste: [[2506.02665]] Beyond Invisibility: Learning Robust Visible Watermarks for Stronger Copyright Protection(https://arxiv.org/abs/2506.02665)
Keywords: security, protect, robust, watermark
Abstract: As AI advances, copyrighted content faces growing risk of unauthorized use, whether through model training or direct misuse. Building upon invisible adversarial perturbation, recent works developed copyright protections against specific AI techniques such as unauthorized personalization through DreamBooth that are misused. However, these methods offer only short-term security, as they require retraining whenever the underlying model architectures change. To establish long-term protection aiming at better robustness, we go beyond invisible perturbation, and propose a universal approach that embeds \textit{visible} watermarks that are \textit{hard-to-remove} into images. Grounded in a new probabilistic and inverse problem-based formulation, our framework maximizes the discrepancy between the \textit{optimal} reconstruction and the original content. We develop an effective and efficient approximation algorithm to circumvent a intractable bi-level optimization. Experimental results demonstrate superiority of our approach across diverse scenarios.

Title: Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet

Authors: Xiao Chen, Jiazhen Huang, Qinting Jiang, Fanding Huang, Xianghua Fu, Jingyan Jiang, Zhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02671
Pdf URL: https://arxiv.org/pdf/2506.02671
Copy Paste: [[2506.02671]] Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet(https://arxiv.org/abs/2506.02671)
Keywords: robust
Abstract: Test-time adaptation (TTA) has emerged as a critical technique for enhancing the generalization capability of vision-language models (VLMs) during inference. However, existing approaches often incur substantial computational costs and exhibit poor scalability, primarily due to sample-wise adaptation granularity and reliance on costly auxiliary designs such as data augmentation. To address these limitations, we introduce SAIL (Small Aid, Big Leap), a novel adapter-based TTA framework that leverages a lightweight, learnable AdaptNet to enable efficient and scalable model adaptation. As SAIL's core, a frozen pre-trained VLM collaborates with AdaptNet through a confidence-based interpolation weight, generating robust predictions during inference. These predictions serve as self-supervised targets to align AdaptNet's outputs through efficient batch-wise processing, dramatically reducing computational costs without modifying the VLM or requiring memory caches. To mitigate catastrophic forgetting during continual adaptation, we propose a gradient-aware reset strategy driven by a gradient drift indicator (GDI), which dynamically detects domain transitions and strategically resets AdaptNet for stable adaptation. Extensive experiments across diverse benchmarks on two scenarios demonstrate that SAIL achieves state-of-the-art performance while maintaining low computational costs. These results highlight SAIL's effectiveness, efficiency and scalability for real-world deployment. The code will be released upon acceptance.

Title: EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Authors: Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Wei Chengzhi, Lin Yan, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02672
Pdf URL: https://arxiv.org/pdf/2506.02672
Copy Paste: [[2506.02672]] EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving(https://arxiv.org/abs/2506.02672)
Keywords: large language model
Abstract: We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.

Title: Decentralized COVID-19 Health System Leveraging Blockchain

Authors: Lingsheng Chen, Shipeng Ye, Xiaoqi Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02674
Pdf URL: https://arxiv.org/pdf/2506.02674
Copy Paste: [[2506.02674]] Decentralized COVID-19 Health System Leveraging Blockchain(https://arxiv.org/abs/2506.02674)
Keywords: privacy
Abstract: With the development of the Internet, the amount of data generated by the medical industry each year has grown exponentially. The Electronic Health Record (EHR) manages the electronic data generated during the user's treatment process. Typically, an EHR data manager belongs to a medical institution. This traditional centralized data management model has many unreasonable or inconvenient aspects, such as difficulties in data sharing, and it is hard to verify the authenticity and integrity of the data. The decentralized, non-forgeable, data unalterable and traceable features of blockchain are in line with the application requirements of EHR. This paper takes the most common COVID-19 as the application scenario and designs a COVID-19 health system based on blockchain, which has extensive research and application value. Considering that the public and transparent nature of blockchain violates the privacy requirements of some health data, in the system design stage, from the perspective of practical application, the data is divided into public data and private data according to its characteristics. For private data, data encryption methods are adopted to ensure data privacy. The searchable encryption technology is combined with blockchain technology to achieve the retrieval function of encrypted data. Then, the proxy re-encryption technology is used to realize authorized access to data. In the system implementation part, based on the Hyperledger Fabric architecture, some functions of the system design are realized, including data upload, retrieval of the latest data and historical data. According to the environment provided by the development architecture, Go language chaincode (smart contract) is written to implement the relevant system functions.

Title: Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation

Authors: Jintao Tong, Yixiong Zou, Guangyao Chen, Yuhua Li, Ruixuan Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02677
Pdf URL: https://arxiv.org/pdf/2506.02677
Copy Paste: [[2506.02677]] Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation(https://arxiv.org/abs/2506.02677)
Keywords: segmentation
Abstract: Cross-Domain Few-Shot Segmentation (CD-FSS) aims to transfer knowledge from a source-domain dataset to unseen target-domain datasets with limited annotations. Current methods typically compare the distance between training and testing samples for mask prediction. However, we find an entanglement problem exists in this widely adopted method, which tends to bind sourcedomain patterns together and make each of them hard to transfer. In this paper, we aim to address this problem for the CD-FSS task. We first find a natural decomposition of the ViT structure, based on which we delve into the entanglement problem for an interpretation. We find the decomposed ViT components are crossly compared between images in distance calculation, where the rational comparisons are entangled with those meaningless ones by their equal importance, leading to the entanglement problem. Based on this interpretation, we further propose to address the entanglement problem by learning to weigh for all comparisons of ViT components, which learn disentangled features and re-compose them for the CD-FSS task, benefiting both the generalization and finetuning. Experiments show that our model outperforms the state-of-the-art CD-FSS method by 1.92% and 1.88% in average accuracy under 1-shot and 5-shot settings, respectively.

Title: TL;DR: Too Long, Do Re-weighting for Effcient LLM Reasoning Compression

Authors: Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Ying Nian Wu, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, Cheng-Lin Liu
Subjects: cs.CL, cs.CE, math.NA
Abstract URL: https://arxiv.org/abs/2506.02678
Pdf URL: https://arxiv.org/pdf/2506.02678
Copy Paste: [[2506.02678]] TL;DR: Too Long, Do Re-weighting for Effcient LLM Reasoning Compression(https://arxiv.org/abs/2506.02678)
Keywords: large language model
Abstract: Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.

Title: Poster: FedBlockParadox -- A Framework for Simulating and Securing Decentralized Federated Learning

Authors: Gabriele Digregorio, Francesco Bleggi, Federico Caroli, Michele Carminati, Stefano Zanero, Stefano Longari
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02679
Pdf URL: https://arxiv.org/pdf/2506.02679
Copy Paste: [[2506.02679]] Poster: FedBlockParadox -- A Framework for Simulating and Securing Decentralized Federated Learning(https://arxiv.org/abs/2506.02679)
Keywords: secure, privacy, attack, robust, federate
Abstract: A significant body of research in decentralized federated learning focuses on combining the privacy-preserving properties of federated learning with the resilience and transparency offered by blockchain-based systems. While these approaches are promising, they often lack flexible tools to evaluate system robustness under adversarial conditions. To fill this gap, we present FedBlockParadox, a modular framework for modeling and evaluating decentralized federated learning systems built on blockchain technologies, with a focus on resilience against a broad spectrum of adversarial attack scenarios. It supports multiple consensus protocols, validation methods, aggregation strategies, and configurable attack models. By enabling controlled experiments, FedBlockParadox provides a valuable resource for researchers developing secure, decentralized learning solutions. The framework is open-source and built to be extensible by the community.

Title: Solving Inverse Problems with FLAIR

Authors: Julius Erbach, Dominik Narnhofer, Andreas Dombos, Bernt Schiele, Jan Eric Lenssen, Konrad Schindler
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.02680
Pdf URL: https://arxiv.org/pdf/2506.02680
Copy Paste: [[2506.02680]] Solving Inverse Problems with FLAIR(https://arxiv.org/abs/2506.02680)
Keywords: diffusion, generative
Abstract: Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach has not yet led to comparable fidelity. There are several key obstacles: (i) the encoding into a lower-dimensional latent space makes the underlying (forward) mapping non-linear; (ii) the data likelihood term is usually intractable; and (iii) learned generative models struggle to recover rare, atypical data modes during inference. We present FLAIR, a novel training free variational framework that leverages flow-based generative models as a prior for inverse problems. To that end, we introduce a variational objective for flow matching that is agnostic to the type of degradation, and combine it with deterministic trajectory adjustments to recover atypical modes. To enforce exact consistency with the observed data, we decouple the optimization of the data fidelity and regularization terms. Moreover, we introduce a time-dependent calibration scheme in which the strength of the regularization is modulated according to off-line accuracy estimates. Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity.

Title: Decompose, Plan in Parallel, and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints

Authors: Zhengdong Lu, Weikai Lu, Yiling Tao, Yun Dai, ZiXuan Chen, Huiping Zhuang, Cen Chen, Hao Peng, Ziqian Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02683
Pdf URL: https://arxiv.org/pdf/2506.02683
Copy Paste: [[2506.02683]] Decompose, Plan in Parallel, and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints(https://arxiv.org/abs/2506.02683)
Keywords: large language model
Abstract: Despite significant advances in Large Language Models (LLMs), planning tasks still present challenges for LLM-based agents. Existing planning methods face two key limitations: heavy constraints and cascading errors. To address these limitations, we propose a novel parallel planning paradigm, which Decomposes, Plans for subtasks in Parallel, and Merges subplans into a final plan (DPPM). Specifically, DPPM decomposes the complex task based on constraints into subtasks, generates the subplan for each subtask in parallel, and merges them into a global plan. In addition, our approach incorporates a verification and refinement module, enabling error correction and conflict resolution. Experimental results demonstrate that DPPM significantly outperforms existing methods in travel planning tasks.

Title: MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching

Authors: Liang Yue, Yihong Tang, Kehai Chen, Jie Liu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02689
Pdf URL: https://arxiv.org/pdf/2506.02689
Copy Paste: [[2506.02689]] MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching(https://arxiv.org/abs/2506.02689)
Keywords: large language model
Abstract: Instruction fine-tuning is crucial in NLP tasks, enhancing pretrained models' instruction-following capabilities and task-specific performance. However, obtaining high-quality fine-tuning data for large models is challenging due to data collection difficulties and high production costs. To address this, we propose MASTER, a novel data augmentation method that enriches original data through interactions among multiple agents with varying cognitive levels. We simulate three pedagogically grounded teaching scenarios, leveraging multi-agent conversations to generate high-quality teacher-student interaction data. Utilizing MASTER, we construct BOOST-QA, a fine-tuning dataset augmented from existing datasets like Orca-Math-200k, ProcQA, and OpenHermes2.5. Experiments show that models fine-tuned with BOOST-QA perform excellently across multiple benchmarks, demonstrating strong multitask generalization. Notably, MASTER significantly improves models' reasoning abilities in complex tasks, providing valuable insights for future research.

Title: XicorAttention: Time Series Transformer Using Attention with Nonlinear Correlation

Authors: Daichi Kimura, Tomonori Izumitani, Hisashi Kashima
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02694
Pdf URL: https://arxiv.org/pdf/2506.02694
Copy Paste: [[2506.02694]] XicorAttention: Time Series Transformer Using Attention with Nonlinear Correlation(https://arxiv.org/abs/2506.02694)
Keywords: transformer
Abstract: Various Transformer-based models have been proposed for time series forecasting. These models leverage the self-attention mechanism to capture long-term temporal or variate dependencies in sequences. Existing methods can be divided into two approaches: (1) reducing computational cost of attention by making the calculations sparse, and (2) reshaping the input data to aggregate temporal features. However, existing attention mechanisms may not adequately capture inherent nonlinear dependencies present in time series data, leaving room for improvement. In this study, we propose a novel attention mechanism based on Chatterjee's rank correlation coefficient, which measures nonlinear dependencies between variables. Specifically, we replace the matrix multiplication in standard attention mechanisms with this rank coefficient to measure the query-key relationship. Since computing Chatterjee's correlation coefficient involves sorting and ranking operations, we introduce a differentiable approximation employing SoftSort and SoftRank. Our proposed mechanism, ``XicorAttention,'' integrates it into several state-of-the-art Transformer models. Experimental results on real-world datasets demonstrate that incorporating nonlinear correlation into the attention improves forecasting accuracy by up to approximately 9.1\% compared to existing models.

Title: LayoutRAG: Retrieval-Augmented Model for Content-agnostic Conditional Layout Generation

Authors: Yuxuan Wu, Le Wang, Sanping Zhou, Mengnan Liu, Gang Hua, Haoxiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02697
Pdf URL: https://arxiv.org/pdf/2506.02697
Copy Paste: [[2506.02697]] LayoutRAG: Retrieval-Augmented Model for Content-agnostic Conditional Layout Generation(https://arxiv.org/abs/2506.02697)
Keywords: diffusion
Abstract: Controllable layout generation aims to create plausible visual arrangements of element bounding boxes within a graphic design according to certain optional constraints, such as the type or position of a specific component. While recent diffusion or flow-matching models have achieved considerable advances in multifarious conditional generation tasks, there remains considerable room for generating optimal arrangements under given conditions. In this work, we propose to carry out layout generation through retrieving by conditions and reference-guided generation. Specifically, we retrieve appropriate layout templates according to given conditions as references. The references are then utilized to guide the denoising or flow-based transport process. By retrieving layouts compatible with the given conditions, we can uncover the potential information not explicitly provided in the given condition. Such an approach offers more effective guidance to the model during the generation process, in contrast to previous models that feed the condition to the model and let the model infer the unprovided layout attributes directly. Meanwhile, we design a condition-modulated attention that selectively absorbs retrieval knowledge, adapting to the difference between retrieved templates and given conditions. Extensive experiment results show that our method successfully produces high-quality layouts that meet the given conditions and outperforms existing state-of-the-art models. Code will be released upon acceptance.

Title: Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences

Authors: Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02698
Pdf URL: https://arxiv.org/pdf/2506.02698
Copy Paste: [[2506.02698]] Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences(https://arxiv.org/abs/2506.02698)
Keywords: diffusion
Abstract: Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data. Although substantial resources are expended in collecting and labeling datasets, a critical aspect is often neglected: \textit{preferences vary across individuals and should be represented with more granularity.} To address this, we propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective, along with a numerical upper bound estimation for the diffusion optimization objective. First, we introduce a smoothed preference distribution to replace the original binary distribution. We employ a reward model to simulate human preferences and apply preference likelihood averaging to improve the DPO loss, such that the loss function approaches zero when preferences are similar. Furthermore, we utilize an inversion technique to simulate the trajectory preference distribution of the diffusion model, enabling more accurate alignment with the optimization objective. Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods through straightforward modifications. Our SmPO-Diffusion achieves state-of-the-art performance in preference evaluation, outperforming baselines across metrics with lower training costs. The project page is this https URL.

Title: On Entity Identification in Language Models

Authors: Masaki Sakata, Sho Yokoi, Benjamin Heinzerling, Takumi Ito, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02701
Pdf URL: https://arxiv.org/pdf/2506.02701
Copy Paste: [[2506.02701]] On Entity Identification in Language Models(https://arxiv.org/abs/2506.02701)
Keywords: transformer
Abstract: We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions. We first formulate two problems of entity mentions -- ambiguity and variability -- and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated. Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9. Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers. Additionally, we clarify how the characteristics of entity representations influence word prediction performance. These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.

Title: Privacy Leaks by Adversaries: Adversarial Iterations for Membership Inference Attack

Authors: Jing Xue, Zhishen Sun, Haishan Ye, Luo Luo, Xiangyu Chang, Ivor Tsang, Guang Dai
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02711
Pdf URL: https://arxiv.org/pdf/2506.02711
Copy Paste: [[2506.02711]] Privacy Leaks by Adversaries: Adversarial Iterations for Membership Inference Attack(https://arxiv.org/abs/2506.02711)
Keywords: privacy, attack, membership infer
Abstract: Membership inference attack (MIA) has become one of the most widely used and effective methods for evaluating the privacy risks of machine learning models. These attacks aim to determine whether a specific sample is part of the model's training set by analyzing the model's output. While traditional membership inference attacks focus on leveraging the model's posterior output, such as confidence on the target sample, we propose IMIA, a novel attack strategy that utilizes the process of generating adversarial samples to infer membership. We propose to infer the member properties of the target sample using the number of iterations required to generate its adversarial sample. We conduct experiments across multiple models and datasets, and our results demonstrate that the number of iterations for generating an adversarial sample is a reliable feature for membership inference, achieving strong performance both in black-box and white-box attack scenarios. This work provides a new perspective for evaluating model privacy and highlights the potential of adversarial example-based features for privacy leakage assessment.

Title: Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems

Authors: Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, Zenglin Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02718
Pdf URL: https://arxiv.org/pdf/2506.02718
Copy Paste: [[2506.02718]] Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems(https://arxiv.org/abs/2506.02718)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks, yet their deployment in real-world applications is hindered by fixed knowledge cutoffs and difficulties in generating controllable, accurate outputs in a single inference. Multi-agent systems (MAS) built from specialized LLM agents offer a promising solution, enabling dynamic collaboration and iterative reasoning. However, optimizing these systems remains a challenge, as conventional methods such as prompt engineering and supervised fine-tuning entail high engineering overhead and limited adaptability. Reinforcement learning (RL), particularly multi-agent reinforcement learning (MARL), provides a scalable framework by refining agent policies based on system-level feedback. Nevertheless, existing MARL algorithms, such as Multi-Agent Proximal Policy Optimization (MAPPO), rely on Critic networks, which can cause training instability and increase computational burden. To address these limitations and target the prototypical Multi-Agent Search System (MASS), we propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), a novel Critic-free algorithm that guides policy updates by estimating relative reward advantages across heterogeneous groups of rollouts. MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead. Additionally, we introduce three group rollout sampling strategies that trade off between efficiency and effectiveness. Experiments on a multi-agent LLM-based search system demonstrate that MHGPO consistently outperforms MAPPO in both task performance and computational efficiency, without requiring warm-up, underscoring its potential for stable and scalable optimization of complex LLM-based MAS.

Title: RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models

Authors: Qihang Yan, Xinyu Zhang, Luming Guo, Qi Zhang, Feifan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02726
Pdf URL: https://arxiv.org/pdf/2506.02726
Copy Paste: [[2506.02726]] RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models(https://arxiv.org/abs/2506.02726)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) struggle with accuracy, domain-specific reasoning, and interpretability in vertical domains. Traditional preference alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often overlook the underlying knowledge sources and reasoning logic. This paper introduces RACE-Align (Retrieval-Augmented and Chain-of-Thought Enhanced Alignment), a novel framework designed to address these limitations. RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning, then aligns LLMs using the DPO algorithm. The core innovation lies in its preference data construction strategy: it integrates AI-driven retrieval for factual grounding, enhancing knowledgeability and accuracy, and emphasizes the optimization of domain-specific CoT, treating the reasoning process itself as a key preference dimension. A multi-stage, AI-driven refinement pipeline cost-effectively generates these preference pairs. Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align significantly outperforms the original base model and a model fine-tuned only with Supervised Fine-Tuning (SFT). Improvements were observed across multiple dimensions, including answer accuracy, information richness, application of TCM thinking patterns, logicality and depth of reasoning, and interpretability. These findings suggest RACE-Align offers an effective pathway to enhance LLMs' knowledge application, reasoning reliability, and process transparency in complex vertical domains.

Title: GeneA-SLAM2: Dynamic SLAM with AutoEncoder-Preprocessed Genetic Keypoints Resampling and Depth Variance-Guided Dynamic Region Removal

Authors: Shufan Qing, Anzhen Li, Qiandi Wang, Yuefeng Niu, Mingchen Feng, Guoliang Hu, Jinqiao Wu, Fengtao Nan, Yingchun Fan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.02736
Pdf URL: https://arxiv.org/pdf/2506.02736
Copy Paste: [[2506.02736]] GeneA-SLAM2: Dynamic SLAM with AutoEncoder-Preprocessed Genetic Keypoints Resampling and Depth Variance-Guided Dynamic Region Removal(https://arxiv.org/abs/2506.02736)
Keywords: robust, segmentation
Abstract: Existing semantic SLAM in dynamic environments mainly identify dynamic regions through object detection or semantic segmentation methods. However, in certain highly dynamic scenarios, the detection boxes or segmentation masks cannot fully cover dynamic regions. Therefore, this paper proposes a robust and efficient GeneA-SLAM2 system that leverages depth variance constraints to handle dynamic scenes. Our method extracts dynamic pixels via depth variance and creates precise depth masks to guide the removal of dynamic objects. Simultaneously, an autoencoder is used to reconstruct keypoints, improving the genetic resampling keypoint algorithm to obtain more uniformly distributed keypoints and enhance the accuracy of pose estimation. Our system was evaluated on multiple highly dynamic sequences. The results demonstrate that GeneA-SLAM2 maintains high accuracy in dynamic scenes compared to current methods. Code is available at: this https URL.

Title: Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Authors: Negin Baghbanzadeh, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02738
Pdf URL: https://arxiv.org/pdf/2506.02738
Copy Paste: [[2506.02738]] Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning(https://arxiv.org/abs/2506.02738)
Keywords: robust, extraction, transformer
Abstract: Compound figures, which are multi-panel composites containing diverse subfigures, are ubiquitous in biomedical literature, yet large-scale subfigure extraction remains largely unaddressed. Prior work on subfigure extraction has been limited in both dataset size and generalizability, leaving a critical open question: How does high-fidelity image-text alignment via large-scale subfigure extraction impact representation learning in vision-language models? We address this gap by introducing a scalable subfigure extraction pipeline based on transformer-based object detection, trained on a synthetic corpus of 500,000 compound figures, and achieving state-of-the-art performance on both ImageCLEF 2016 and synthetic benchmarks. Using this pipeline, we release OPEN-PMC-18M, a large-scale high quality biomedical vision-language dataset comprising 18 million clinically relevant subfigure-caption pairs spanning radiology, microscopy, and visible light photography. We train and evaluate vision-language models on our curated datasets and show improved performance across retrieval, zero-shot classification, and robustness benchmarks, outperforming existing baselines. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

Title: RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS

Authors: Chuanyu Fu, Yuqi Zhang, Kunbin Yao, Guanying Chen, Yuan Xiong, Chuan Huang, Shuguang Cui, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02751
Pdf URL: https://arxiv.org/pdf/2506.02751
Copy Paste: [[2506.02751]] RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS(https://arxiv.org/abs/2506.02751)
Keywords: robust
Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances. To address this, we propose RobustSplat, a robust solution based on two critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method. Our project page is this https URL.

Title: Multi-task Learning with Active Learning for Arabic Offensive Speech Detection

Authors: Aisha Alansari, Hamzah Luqman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02753
Pdf URL: https://arxiv.org/pdf/2506.02753
Copy Paste: [[2506.02753]] Multi-task Learning with Active Learning for Arabic Offensive Speech Detection(https://arxiv.org/abs/2506.02753)
Keywords: security
Abstract: The rapid growth of social media has amplified the spread of offensive, violent, and vulgar speech, which poses serious societal and cybersecurity concerns. Detecting such content in Arabic text is particularly complex due to limited labeled data, dialectal variations, and the language's inherent complexity. This paper proposes a novel framework that integrates multi-task learning (MTL) with active learning to enhance offensive speech detection in Arabic social media text. By jointly training on two auxiliary tasks, violent and vulgar speech, the model leverages shared representations to improve the detection accuracy of the offensive speech. Our approach dynamically adjusts task weights during training to balance the contribution of each task and optimize performance. To address the scarcity of labeled data, we employ an active learning strategy through several uncertainty sampling techniques to iteratively select the most informative samples for model training. We also introduce weighted emoji handling to better capture semantic cues. Experimental results on the OSACT2022 dataset show that the proposed framework achieves a state-of-the-art macro F1-score of 85.42%, outperforming existing methods while using significantly fewer fine-tuning samples. The findings of this study highlight the potential of integrating MTL with active learning for efficient and accurate offensive language detection in resource-constrained settings.

Title: Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection

Authors: Ruiying Lu, Jinhan Liu, Chuan Du, Dandan Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02757
Pdf URL: https://arxiv.org/pdf/2506.02757
Copy Paste: [[2506.02757]] Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection(https://arxiv.org/abs/2506.02757)
Keywords: interpretability
Abstract: Tabular anomaly detection, which aims at identifying deviant samples, has been crucial in a variety of real-world applications, such as medical disease identification, financial fraud detection, intrusion monitoring, etc. Although recent deep learning-based methods have achieved competitive performances, these methods suffer from representation entanglement and the lack of global correlation modeling, which hinders anomaly detection performance. To tackle the problem, we incorporate mask modeling and prototype learning into tabular anomaly detection. The core idea is to design learnable masks by disentangled representation learning within a projection space and extracting normal dependencies as explicit global prototypes. Specifically, the overall model involves two parts: (i) During encoding, we perform mask modeling in both the data space and projection space with orthogonal basis vectors for learning shared disentangled normal patterns; (ii) During decoding, we decode multiple masked representations in parallel for reconstruction and learn association prototypes to extract normal characteristic correlations. Our proposal derives from a distribution-matching perspective, where both projection space learning and association prototype learning are formulated as optimal transport problems, and the calibration distances are utilized to refine the anomaly scores. Quantitative and qualitative experiments on 20 tabular benchmarks demonstrate the effectiveness and interpretability of our model.

Title: Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs

Authors: Stefano Bannò, Kate Knill, Mark Gales
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02758
Pdf URL: https://arxiv.org/pdf/2506.02758
Copy Paste: [[2506.02758]] Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs(https://arxiv.org/abs/2506.02758)
Keywords: large language model
Abstract: Vocabulary use is a fundamental aspect of second language (L2) proficiency. To date, its assessment by automated systems has typically examined the context-independent, or part-of-speech (PoS) related use of words. This paper introduces a novel approach to enable fine-grained vocabulary evaluation exploiting the precise use of words within a sentence. The scheme combines large language models (LLMs) with the English Vocabulary Profile (EVP). The EVP is a standard lexical resource that enables in-context vocabulary use to be linked with proficiency level. We evaluate the ability of LLMs to assign proficiency levels to individual words as they appear in L2 learner writing, addressing key challenges such as polysemy, contextual variation, and multi-word expressions. We compare LLMs to a PoS-based baseline. LLMs appear to exploit additional semantic information that yields improved performance. We also explore correlations between word-level proficiency and essay-level proficiency. Finally, the approach is applied to examine the consistency of the EVP proficiency levels. Results show that LLMs are well-suited for the task of vocabulary assessment.

Title: Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations

Authors: Fatma Youssef Mohammed, Kostas Alexis
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02764
Pdf URL: https://arxiv.org/pdf/2506.02764
Copy Paste: [[2506.02764]] Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations(https://arxiv.org/abs/2506.02764)
Keywords: transformer
Abstract: Computational human attention modeling in free-viewing and task-specific settings is often studied separately, with limited exploration of whether a common representation exists between them. This work investigates this question and proposes a neural network architecture that builds upon the Human Attention transformer (HAT) to test the hypothesis. Our results demonstrate that free-viewing and visual search can efficiently share a common representation, allowing a model trained in free-viewing attention to transfer its knowledge to task-driven visual search with a performance drop of only 3.86% in the predicted fixation scanpaths, measured by the semantic sequence score (SemSS) metric which reflects the similarity between predicted and human scanpaths. This transfer reduces computational costs by 92.29% in terms of GFLOPs and 31.23% in terms of trainable parameters.

Title: A Dynamic Transformer Network for Vehicle Detection

Authors: Chunwei Tian, Kai Liu, Bob Zhang, Zhixiang Huang, Chia-Wen Lin, David Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02765
Pdf URL: https://arxiv.org/pdf/2506.02765
Copy Paste: [[2506.02765]] A Dynamic Transformer Network for Vehicle Detection(https://arxiv.org/abs/2506.02765)
Keywords: transformer
Abstract: Stable consumer electronic systems can assist traffic better. Good traffic consumer electronic systems require collaborative work between traffic algorithms and hardware. However, performance of popular traffic algorithms containing vehicle detection methods based on deep networks via learning data relation rather than learning differences in different lighting and occlusions is limited. In this paper, we present a dynamic Transformer network for vehicle detection (DTNet). DTNet utilizes a dynamic convolution to guide a deep network to dynamically generate weights to enhance adaptability of an obtained detector. Taking into relations of different information account, a mixed attention mechanism based channel attention and Transformer is exploited to strengthen relations of channels and pixels to extract more salient information for vehicle detection. To overcome the drawback of difference in an image account, a translation-variant convolution relies on spatial location information to refine obtained structural information for vehicle detection. Experimental results illustrate that our DTNet is competitive for vehicle detection. Code of the proposed DTNet can be obtained at this https URL.

Title: FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts

Authors: Tongyuan Bai, Wangyuanfan Bai, Dong Chen, Tieru Wu, Manyi Li, Rui Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02781
Pdf URL: https://arxiv.org/pdf/2506.02781
Copy Paste: [[2506.02781]] FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts(https://arxiv.org/abs/2506.02781)
Keywords: diffusion, transformer
Abstract: Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene this http URL, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.

Title: Automated Measurement of Optic Nerve Sheath Diameter Using Ocular Ultrasound Video

Authors: Renxing Li, Weiyi Tang, Peiqi Li, Qiming Huang, Jiayuan She, Shengkai Li, Haoran Xu, Yeyun Wan, Jing Liu, Hailong Fu, Xiang Li, Jiangang Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02789
Pdf URL: https://arxiv.org/pdf/2506.02789
Copy Paste: [[2506.02789]] Automated Measurement of Optic Nerve Sheath Diameter Using Ocular Ultrasound Video(https://arxiv.org/abs/2506.02789)
Keywords: segmentation
Abstract: Objective. Elevated intracranial pressure (ICP) is recognized as a biomarker of secondary brain injury, with a significant linear correlation observed between optic nerve sheath diameter (ONSD) and ICP. Frequent monitoring of ONSD could effectively support dynamic evaluation of ICP. However, ONSD measurement is heavily reliant on the operator's experience and skill, particularly in manually selecting the optimal frame from ultrasound sequences and measuring ONSD. Approach. This paper presents a novel method to automatically identify the optimal frame from video sequences for ONSD measurement by employing the Kernel Correlation Filter (KCF) tracking algorithm and Simple Linear Iterative Clustering (SLIC) segmentation algorithm. The optic nerve sheath is mapped and measured using a Gaussian Mixture Model (GMM) combined with a KL-divergence-based method. Results. When compared with the average measurements of two expert clinicians, the proposed method achieved a mean error, mean squared deviation, and intraclass correlation coefficient (ICC) of 0.04, 0.054, and 0.782, respectively. Significance. The findings suggest that this method provides highly accurate automated ONSD measurements, showing potential for clinical application.

Title: SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

Authors: Sifan Li, Yujun Cai, Yiwei Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.02803
Pdf URL: https://arxiv.org/pdf/2506.02803
Copy Paste: [[2506.02803]] SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking(https://arxiv.org/abs/2506.02803)
Keywords: security, robust
Abstract: Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.

Title: CART-based Synthetic Tabular Data Generation for Imbalanced Regression

Authors: António Pedro Pinheiro, Rita P. Ribeiro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02811
Pdf URL: https://arxiv.org/pdf/2506.02811
Copy Paste: [[2506.02811]] CART-based Synthetic Tabular Data Generation for Imbalanced Regression(https://arxiv.org/abs/2506.02811)
Keywords: interpretability, generative
Abstract: Handling imbalanced target distributions in regression tasks remains a significant challenge in tabular data settings where underrepresented regions can hinder model performance. Among data-level solutions, some proposals, such as random sampling and SMOTE-based approaches, propose adapting classification techniques to regression tasks. However, these methods typically rely on crisp, artificial thresholds over the target variable, a limitation inherited from classification settings that can introduce arbitrariness, often leading to non-intuitive and potentially misleading problem formulations. While recent generative models, such as GANs and VAEs, provide flexible sample synthesis, they come with high computational costs and limited interpretability. In this study, we propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression. The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space and employs a threshold-free, feature-driven generation process. Our experimental study focuses on the prediction of extreme target values across benchmark datasets. The results indicate that the proposed method is competitive with other resampling and generative strategies in terms of performance, while offering faster execution and greater transparency. These results highlight the method's potential as a transparent, scalable data-level strategy for improving regression models in imbalanced domains.

Title: ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations

Authors: Ekaterina Grishina, Mikhail Gorbunov, Maxim Rakhuba
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02818
Pdf URL: https://arxiv.org/pdf/2506.02818
Copy Paste: [[2506.02818]] ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations(https://arxiv.org/abs/2506.02818)
Keywords: large language model
Abstract: Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the number of parameters of these models. However, it seems unrealistic to expect that weight matrices of pretrained models can be accurately represented by structured matrices without any fine-tuning. To overcome this issue, we utilize the fact that LLM output is invariant under certain orthogonal transformations of weight matrices. This insight can be leveraged to identify transformations that significantly improve the compressibility of weights within structured classes. The proposed approach is applicable to various types of structured matrices that support efficient projection operations. Code is available at this https URL

Title: TO-GATE: Clarifying Questions and Summarizing Responses with Trajectory Optimization for Eliciting Human Preference

Authors: Yulin Dou, Jiangming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02827
Pdf URL: https://arxiv.org/pdf/2506.02827
Copy Paste: [[2506.02827]] TO-GATE: Clarifying Questions and Summarizing Responses with Trajectory Optimization for Eliciting Human Preference(https://arxiv.org/abs/2506.02827)
Keywords: large language model
Abstract: Large language models (LLMs) can effectively elicit human preferences through multi-turn dialogue. Complex tasks can be accomplished through iterative clarifying questions and final responses generated by an LLM acting as a questioner (STaR-GATE; Andukuri et al., 2024}). However, existing approaches based on self-taught reasoning struggle to identify optimal dialogue trajectories and avoid irrelevant questions to the tasks. To address this limitation, we propose TO-GATE, a novel framework that enhances question generation through trajectory optimization, which consists of two key components: a clarification resolver that generates optimal questioning trajectories, and a summarizer that ensures task-aligned final responses. The trajectory optimization enables the model to produce effective elicitation questions and summary responses tailored to specific tasks. Experimental results demonstrate that TO-GATE significantly outperforms baseline methods, achieving a 9.32% improvement on standard preference elicitation tasks.

Title: Random Registers for Cross-Domain Few-Shot Learning

Authors: Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02843
Pdf URL: https://arxiv.org/pdf/2506.02843
Copy Paste: [[2506.02843]] Random Registers for Cross-Domain Few-Shot Learning(https://arxiv.org/abs/2506.02843)
Keywords: transformer
Abstract: Cross-domain few-shot learning (CDFSL) aims to transfer knowledge from a data-sufficient source domain to data-scarce target domains. Although Vision Transformer (ViT) has shown superior capability in many vision tasks, its transferability against huge domain gaps in CDFSL is still under-explored. In this paper, we find an intriguing phenomenon: during the source-domain training, prompt tuning, as a common way to train ViT, could be harmful for the generalization of ViT in target domains, but setting them to random noises (i.e., random registers) could consistently improve target-domain performance. We then delve into this phenomenon for an interpretation. We find that learnable prompts capture domain information during the training on the source dataset, which views irrelevant visual patterns as vital cues for recognition. This can be viewed as a kind of overfitting and increases the sharpness of the loss landscapes. In contrast, random registers are essentially a novel way of perturbing attention for the sharpness-aware minimization, which helps the model find a flattened minimum in loss landscapes, increasing the transferability. Based on this phenomenon and interpretation, we further propose a simple but effective approach for CDFSL to enhance the perturbation on attention maps by adding random registers on the semantic regions of image tokens, improving the effectiveness and efficiency of random registers. Extensive experiments on four benchmarks validate our rationale and state-of-the-art performance. Codes and models are available at this https URL.

Title: Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

Authors: Di Wen, Lei Qi, Kunyu Peng, Kailun Yang, Fei Teng, Ao Luo, Jia Fu, Yufan Chen, Ruiping Liu, Yitian Shi, M. Saquib Sarfraz, Rainer Stiefelhagen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02845
Pdf URL: https://arxiv.org/pdf/2506.02845
Copy Paste: [[2506.02845]] Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments(https://arxiv.org/abs/2506.02845)
Keywords: robust
Abstract: Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at this https URL.

Title: METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding

Authors: Mengyue Wang, Shuo Chen, Kristian Kersting, Volker Tresp, Yunpu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02850
Pdf URL: https://arxiv.org/pdf/2506.02850
Copy Paste: [[2506.02850]] METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding(https://arxiv.org/abs/2506.02850)
Keywords: large language model
Abstract: Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to accelerate VLLMs' inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining comparable or even superior accuracy.

Title: Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation

Authors: Mingjie Wei, Xuemei Xie, Yutong Zhong, Guangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02853
Pdf URL: https://arxiv.org/pdf/2506.02853
Copy Paste: [[2506.02853]] Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation(https://arxiv.org/abs/2506.02853)
Keywords: transformer
Abstract: Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets. The code is available at this https URL.

Title: Hierarchical Self-Prompting SAM: A Prompt-Free Medical Image Segmentation Framework

Authors: Mengmeng Zhang, Xingyuan Dai, Yicheng Sun, Jing Wang, Yueyang Yao, Xiaoyan Gong, Fuze Cong, Feiyue Wang, Yisheng Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02854
Pdf URL: https://arxiv.org/pdf/2506.02854
Copy Paste: [[2506.02854]] Hierarchical Self-Prompting SAM: A Prompt-Free Medical Image Segmentation Framework(https://arxiv.org/abs/2506.02854)
Keywords: robust, segmentation
Abstract: Although the Segment Anything Model (SAM) is highly effective in natural image segmentation, it requires dependencies on prompts, which limits its applicability to medical imaging where manual prompts are often unavailable. Existing efforts to fine-tune SAM for medical segmentation typically struggle to remove this dependency. We propose Hierarchical Self-Prompting SAM (HSP-SAM), a novel self-prompting framework that enables SAM to achieve strong performance in prompt-free medical image segmentation. Unlike previous self-prompting methods that remain limited to positional prompts similar to vanilla SAM, we are the first to introduce learning abstract prompts during the self-prompting process. This simple and intuitive self-prompting framework achieves superior performance on classic segmentation tasks such as polyp and skin lesion segmentation, while maintaining robustness across diverse medical imaging modalities. Furthermore, it exhibits strong generalization to unseen datasets, achieving improvements of up to 14.04% over previous state-of-the-art methods on some challenging benchmarks. These results suggest that abstract prompts encapsulate richer and higher-dimensional semantic information compared to positional prompts, thereby enhancing the model's robustness and generalization performance. All models and codes will be released upon acceptance.

Title: Enhancing Abnormality Identification: Robust Out-of-Distribution Strategies for Deepfake Detection

Authors: Luca Maiano, Fabrizio Casadei, Irene Amerini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02857
Pdf URL: https://arxiv.org/pdf/2506.02857
Copy Paste: [[2506.02857]] Enhancing Abnormality Identification: Robust Out-of-Distribution Strategies for Deepfake Detection(https://arxiv.org/abs/2506.02857)
Keywords: robust, generative
Abstract: Detecting deepfakes has become a critical challenge in Computer Vision and Artificial Intelligence. Despite significant progress in detection techniques, generalizing them to open-set scenarios continues to be a persistent difficulty. Neural networks are often trained on the closed-world assumption, but with new generative models constantly evolving, it is inevitable to encounter data generated by models that are not part of the training distribution. To address these challenges, in this paper, we propose two novel Out-Of-Distribution (OOD) detection approaches. The first approach is trained to reconstruct the input image, while the second incorporates an attention mechanism for detecting OODs. Our experiments validate the effectiveness of the proposed approaches compared to existing state-of-the-art techniques. Our method achieves promising results in deepfake detection and ranks among the top-performing configurations on the benchmark, demonstrating their potential for robust, adaptable solutions in dynamic, real-world applications.

Title: ATAG: AI-Agent Application Threat Assessment with Attack Graphs

Authors: Parth Atulbhai Gandhi, Akansha Shukla, David Tayouri, Beni Ifland, Yuval Elovici, Rami Puzis, Asaf Shabtai
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02859
Pdf URL: https://arxiv.org/pdf/2506.02859
Copy Paste: [[2506.02859]] ATAG: AI-Agent Application Threat Assessment with Attack Graphs(https://arxiv.org/abs/2506.02859)
Keywords: secure, security, attack, robust, large language model
Abstract: Evaluating the security of multi-agent systems (MASs) powered by large language models (LLMs) is challenging, primarily because of the systems' complex internal dynamics and the evolving nature of LLM vulnerabilities. Traditional attack graph (AG) methods often lack the specific capabilities to model attacks on LLMs. This paper introduces AI-agent application Threat assessment with Attack Graphs (ATAG), a novel framework designed to systematically analyze the security risks associated with AI-agent applications. ATAG extends the MulVAL logic-based AG generation tool with custom facts and interaction rules to accurately represent AI-agent topologies, vulnerabilities, and attack scenarios. As part of this research, we also created the LLM vulnerability database (LVD) to initiate the process of standardizing LLM vulnerabilities documentation. To demonstrate ATAG's efficacy, we applied it to two multi-agent applications. Our case studies demonstrated the framework's ability to model and generate AGs for sophisticated, multi-step attack scenarios exploiting vulnerabilities such as prompt injection, excessive agency, sensitive information disclosure, and insecure output handling across interconnected agents. ATAG is an important step toward a robust methodology and toolset to help understand, visualize, and prioritize complex attack paths in multi-agent AI systems (MAASs). It facilitates proactive identification and mitigation of AI-agent threats in multi-agent applications.

Title: BNPO: Beta Normalization Policy Optimization

Authors: Changyi Xiao, Mengdi Zhang, Yixin Cao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02864
Pdf URL: https://arxiv.org/pdf/2506.02864
Copy Paste: [[2506.02864]] BNPO: Beta Normalization Policy Optimization(https://arxiv.org/abs/2506.02864)
Keywords: large language model
Abstract: Recent studies, including DeepSeek-R1 and Kimi-k1.5, have demonstrated that reinforcement learning with rule-based, binary-valued reward functions can significantly enhance the reasoning capabilities of large language models. These models primarily utilize REINFORCE-based policy optimization techniques, such as REINFORCE with baseline and group relative policy optimization (GRPO). However, a key limitation remains: current policy optimization methods either neglect reward normalization or employ static normalization strategies, which fail to adapt to the dynamic nature of policy updates during training. This may result in unstable gradient estimates and hinder training stability. To address this issue, we propose Beta Normalization Policy Optimization (BNPO), a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated parameters. BNPO aligns the normalization with the changing policy distribution, enabling more precise and lower-variance gradient estimation, which in turn promotes stable training dynamics. We provide theoretical analysis demonstrating BNPO's variance-reducing properties and show that it generalizes both REINFORCE and GRPO under binary-valued reward settings. Furthermore, we introduce an advantage decomposition mechanism to extend BNPO's applicability to more complex reward systems. Experimental results confirm that BNPO achieves state-of-the-art performance among policy optimization methods on reasoning tasks. The code is available at this https URL.

Title: Pan-Arctic Permafrost Landform and Human-built Infrastructure Feature Detection with Vision Transformers and Location Embeddings

Authors: Amal S. Perera, David Fernandez, Chandi Witharana, Elias Manos, Michael Pimenta, Anna K. Liljedahl, Ingmar Nitze, Yili Yang, Todd Nicholson, Chia-Yu Hsu, Wenwen Li, Guido Grosse
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02868
Pdf URL: https://arxiv.org/pdf/2506.02868
Copy Paste: [[2506.02868]] Pan-Arctic Permafrost Landform and Human-built Infrastructure Feature Detection with Vision Transformers and Location Embeddings(https://arxiv.org/abs/2506.02868)
Keywords: robust, transformer, large language model
Abstract: Accurate mapping of permafrost landforms, thaw disturbances, and human-built infrastructure at pan-Arctic scale using sub-meter satellite imagery is increasingly critical. Handling petabyte-scale image data requires high-performance computing and robust feature detection models. While convolutional neural network (CNN)-based deep learning approaches are widely used for remote sensing (RS),similar to the success in transformer based large language models, Vision Transformers (ViTs) offer advantages in capturing long-range dependencies and global context via attention mechanisms. ViTs support pretraining via self-supervised learning-addressing the common limitation of labeled data in Arctic feature detection and outperform CNNs on benchmark datasets. Arctic also poses challenges for model generalization, especially when features with the same semantic class exhibit diverse spectral characteristics. To address these issues for Arctic feature detection, we integrate geospatial location embeddings into ViTs to improve adaptation across regions. This work investigates: (1) the suitability of pre-trained ViTs as feature extractors for high-resolution Arctic remote sensing tasks, and (2) the benefit of combining image and location embeddings. Using previously published datasets for Arctic feature detection, we evaluate our models on three tasks-detecting ice-wedge polygons (IWP), retrogressive thaw slumps (RTS), and human-built infrastructure. We empirically explore multiple configurations to fuse image embeddings and location embeddings. Results show that ViTs with location embeddings outperform prior CNN-based models on two of the three tasks including F1 score increase from 0.84 to 0.92 for RTS detection, demonstrating the potential of transformer-based models with spatial awareness for Arctic RS applications.

Title: Token and Span Classification for Entity Recognition in French Historical Encyclopedias

Authors: Ludovic Moncla, Hédi Zeghidi
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.02872
Pdf URL: https://arxiv.org/pdf/2506.02872
Copy Paste: [[2506.02872]] Token and Span Classification for Entity Recognition in French Historical Encyclopedias(https://arxiv.org/abs/2506.02872)
Keywords: transformer, generative
Abstract: Named Entity Recognition (NER) in historical texts presents unique challenges due to non-standardized language, archaic orthography, and nested or overlapping entities. This study benchmarks a diverse set of NER approaches, ranging from classical Conditional Random Fields (CRFs) and spaCy-based models to transformer-based architectures such as CamemBERT and sequence-labeling models like Flair. Experiments are conducted on the GeoEDdA dataset, a richly annotated corpus derived from 18th-century French encyclopedias. We propose framing NER as both token-level and span-level classification to accommodate complex nested entity structures typical of historical documents. Additionally, we evaluate the emerging potential of few-shot prompting with generative language models for low-resource scenarios. Our results demonstrate that while transformer-based models achieve state-of-the-art performance, especially on nested entities, generative models offer promising alternatives when labeled data are scarce. The study highlights ongoing challenges in historical NER and suggests avenues for hybrid approaches combining symbolic and neural methods to better capture the intricacies of early modern French text.

Title: CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective

Authors: Jintian Shao, Yiming Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02878
Pdf URL: https://arxiv.org/pdf/2506.02878
Copy Paste: [[2506.02878]] CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective(https://arxiv.org/abs/2506.02878)
Keywords: large language model
Abstract: Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes. Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes.

Title: GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation

Authors: Sohyun Lee, Yeho Kwon, Lukas Hoyer, Suha Kwak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02882
Pdf URL: https://arxiv.org/pdf/2506.02882
Copy Paste: [[2506.02882]] GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation(https://arxiv.org/abs/2506.02882)
Keywords: robust, segmentation
Abstract: Improving robustness of the Segment Anything Model (SAM) to input degradations is critical for its deployment in high-stakes applications such as autonomous driving and robotics. Our approach to this challenge prioritizes three key aspects: first, parameter efficiency to maintain the inherent generalization capability of SAM; second, fine-grained and input-aware robustification to precisely address the input corruption; and third, adherence to standard training protocols for ease of training. To this end, we propose gated-rank adaptation (GaRA). GaRA introduces lightweight adapters into intermediate layers of the frozen SAM, where each adapter dynamically adjusts the effective rank of its weight matrix based on the input by selectively activating (rank-1) components of the matrix using a learned gating module. This adjustment enables fine-grained and input-aware robustification without compromising the generalization capability of SAM. Our model, GaRA-SAM, significantly outperforms prior work on all robust segmentation benchmarks. In particular, it surpasses the previous best IoU score by up to 21.3\%p on ACDC, a challenging real corrupted image dataset.

Title: Overcoming Challenges of Partial Client Participation in Federated Learning : A Comprehensive Review

Authors: Mrinmay Sen, Shruti Aparna, Rohit Agarwal, Chalavadi Krishna Mohan
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2506.02887
Pdf URL: https://arxiv.org/pdf/2506.02887
Copy Paste: [[2506.02887]] Overcoming Challenges of Partial Client Participation in Federated Learning : A Comprehensive Review(https://arxiv.org/abs/2506.02887)
Keywords: robust, federate, fair
Abstract: Federated Learning (FL) is a learning mechanism that falls under the distributed training umbrella, which collaboratively trains a shared global model without disclosing the raw data from different clients. This paper presents an extensive survey on the impact of partial client participation in federated learning. While much of the existing research focuses on addressing issues such as generalization, robustness, and fairness caused by data heterogeneity under the assumption of full client participation, limited attention has been given to the practical and theoretical challenges arising from partial client participation, which is common in real-world scenarios. This survey provides an in-depth review of existing FL methods designed to cope with partial client participation. We offer a comprehensive analysis supported by theoretical insights and empirical findings, along with a structured categorization of these methods, highlighting their respective advantages and disadvantages.

Title: Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

Authors: Jakub Krajewski, Marcin Chochowski, Daniel Korzekwa
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.02890
Pdf URL: https://arxiv.org/pdf/2506.02890
Copy Paste: [[2506.02890]] Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights(https://arxiv.org/abs/2506.02890)
Keywords: large language model
Abstract: Mixture of Experts (MoE) architectures have emerged as pivotal for scaling Large Language Models (LLMs) efficiently. Fine-grained MoE approaches - utilizing more numerous, smaller experts - have demonstrated potential in improving model convergence and quality. This work proposes a set of training recipes and provides a comprehensive empirical evaluation of fine-grained MoE, directly comparing its scaling properties against standard MoE configurations for models with up to 56B total (17B active) parameters. We investigate convergence speed, model performance on downstream benchmarks, and practical training considerations across various setups. Overall, at the largest scale we show that fine-grained MoE achieves better validation loss and higher accuracy across a set of downstream benchmarks. This study offers empirical grounding and practical insights for leveraging fine-grained MoE in the development of future large-scale models.

Title: Dense Match Summarization for Faster Two-view Estimation

Authors: Jonathan Astermark, Anders Heyden, Viktor Larsson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02893
Pdf URL: https://arxiv.org/pdf/2506.02893
Copy Paste: [[2506.02893]] Dense Match Summarization for Faster Two-view Estimation(https://arxiv.org/abs/2506.02893)
Keywords: robust
Abstract: In this paper, we speed up robust two-view relative pose from dense correspondences. Previous work has shown that dense matchers can significantly improve both accuracy and robustness in the resulting pose. However, the large number of matches comes with a significantly increased runtime during robust estimation in RANSAC. To avoid this, we propose an efficient match summarization scheme which provides comparable accuracy to using the full set of dense matches, while having 10-100x faster runtime. We validate our approach on standard benchmark datasets together with multiple state-of-the-art dense matchers.

Title: A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation

Authors: Verena Blaschke, Miriam Winkler, Constantin Förster, Gabriele Wenger-Glemser, Barbara Plank
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.02894
Pdf URL: https://arxiv.org/pdf/2506.02894
Copy Paste: [[2506.02894]] A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation(https://arxiv.org/abs/2506.02894)
Keywords: robust
Abstract: Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. We provide both dialectal and Standard German transcriptions, and analyze the linguistic differences between them. We benchmark several multilingual state-of-the-art ASR models on speech translation into Standard German, and find differences between how much the output resembles the dialectal vs. standardized transcriptions. Qualitative error analyses of the best ASR model reveal that it sometimes normalizes grammatical differences, but often stays closer to the dialectal constructions.

Title: Sociodynamics-inspired Adaptive Coalition and Client Selection in Federated Learning

Authors: Alessandro Licciardi, Roberta Raineri, Anton Proskurnikov, Lamberto Rondoni, Lorenzo Zino
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02897
Pdf URL: https://arxiv.org/pdf/2506.02897
Copy Paste: [[2506.02897]] Sociodynamics-inspired Adaptive Coalition and Client Selection in Federated Learning(https://arxiv.org/abs/2506.02897)
Keywords: privacy, federate
Abstract: Federated Learning (FL) enables privacy-preserving collaborative model training, yet its practical strength is often undermined by client data heterogeneity, which severely degrades model performance. This paper proposes that data heterogeneity across clients' distributions can be effectively addressed by adopting an approach inspired by opinion dynamics over temporal social networks. We introduce \shortname (Federated Coalition Variance Reduction with Boltzmann Exploration), a variance-reducing selection algorithm in which (1) clients dynamically organize into non-overlapping clusters based on asymptotic agreements, and (2) from each cluster, one client is selected to minimize the expected variance of its model update. Our experiments show that in heterogeneous scenarios our algorithm outperforms existing FL algorithms, yielding more accurate results and faster convergence, validating the efficacy of our approach.

Title: Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning

Authors: Yin Fang, Qiao Jin, Guangzhi Xiong, Bowen Jin, Xianrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, Zhiyong Lu
Subjects: cs.CL, cs.AI, cs.CE, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02911
Pdf URL: https://arxiv.org/pdf/2506.02911
Copy Paste: [[2506.02911]] Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning(https://arxiv.org/abs/2506.02911)
Keywords: large language model
Abstract: Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at this https URL.

Title: Towards Auto-Annotation from Annotation Guidelines: A Benchmark through 3D LiDAR Detection

Authors: Yechi Ma, Wei Hua, Shu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02914
Pdf URL: https://arxiv.org/pdf/2506.02914
Copy Paste: [[2506.02914]] Towards Auto-Annotation from Annotation Guidelines: A Benchmark through 3D LiDAR Detection(https://arxiv.org/abs/2506.02914)
Keywords: segmentation
Abstract: A crucial yet under-appreciated prerequisite in machine learning solutions for real-applications is data annotation: human annotators are hired to manually label data according to detailed, expert-crafted guidelines. This is often a laborious, tedious, and costly process. To study methods for facilitating data annotation, we introduce a new benchmark AnnoGuide: Auto-Annotation from Annotation Guidelines. It aims to evaluate automated methods for data annotation directly from expert-defined annotation guidelines, eliminating the need for manual labeling. As a case study, we repurpose the well-established nuScenes dataset, commonly used in autonomous driving research, which provides comprehensive annotation guidelines for labeling LiDAR point clouds with 3D cuboids across 18 object classes. These guidelines include a few visual examples and textual descriptions, but no labeled 3D cuboids in LiDAR data, making this a novel task of multi-modal few-shot 3D detection without 3D annotations. The advances of powerful foundation models (FMs) make AnnoGuide especially timely, as FMs offer promising tools to tackle its challenges. We employ a conceptually straightforward pipeline that (1) utilizes open-source FMs for object detection and segmentation in RGB images, (2) projects 2D detections into 3D using known camera poses, and (3) clusters LiDAR points within the frustum of each 2D detection to generate a 3D cuboid. Starting with a non-learned solution that leverages off-the-shelf FMs, we progressively refine key components and achieve significant performance improvements, boosting 3D detection mAP from 12.1 to 21.9! Nevertheless, our results highlight that AnnoGuide remains an open and challenging problem, underscoring the urgent need for developing LiDAR-based FMs. We release our code and models at GitHub: this https URL

Title: INESC-ID @ eRisk 2025: Exploring Fine-Tuned, Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification

Authors: Diogo A.P. Nunes, Eugénio Ribeiro
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02924
Pdf URL: https://arxiv.org/pdf/2506.02924
Copy Paste: [[2506.02924]] INESC-ID @ eRisk 2025: Exploring Fine-Tuned, Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification(https://arxiv.org/abs/2506.02924)
Keywords: large language model
Abstract: In this work, we describe our team's approach to eRisk's 2025 Task 1: Search for Symptoms of Depression. Given a set of sentences and the Beck's Depression Inventory - II (BDI) questionnaire, participants were tasked with submitting up to 1,000 sentences per depression symptom in the BDI, sorted by relevance. Participant submissions were evaluated according to standard Information Retrieval (IR) metrics, including Average Precision (AP) and R-Precision (R-PREC). The provided training data, however, consisted of sentences labeled as to whether a given sentence was relevant or not w.r.t. one of BDI's symptoms. Due to this labeling limitation, we framed our development as a binary classification task for each BDI symptom, and evaluated accordingly. To that end, we split the available labeled data into training and validation sets, and explored foundation model fine-tuning, sentence similarity, Large Language Model (LLM) prompting, and ensemble techniques. The validation results revealed that fine-tuning foundation models yielded the best performance, particularly when enhanced with synthetic data to mitigate class imbalance. We also observed that the optimal approach varied by symptom. Based on these insights, we devised five independent test runs, two of which used ensemble methods. These runs achieved the highest scores in the official IR evaluation, outperforming submissions from 16 other teams.

Title: From Theory to Practice with RAVEN-UCB: Addressing Non-Stationarity in Multi-Armed Bandits through Variance Adaptation

Authors: Junyi Fang, Yuxun Chen, Yuxin Chen, Chen Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02933
Pdf URL: https://arxiv.org/pdf/2506.02933
Copy Paste: [[2506.02933]] From Theory to Practice with RAVEN-UCB: Addressing Non-Stationarity in Multi-Armed Bandits through Variance Adaptation(https://arxiv.org/abs/2506.02933)
Keywords: robust
Abstract: The Multi-Armed Bandit (MAB) problem is challenging in non-stationary environments where reward distributions evolve dynamically. We introduce RAVEN-UCB, a novel algorithm that combines theoretical rigor with practical efficiency via variance-aware adaptation. It achieves tighter regret bounds than UCB1 and UCB-V, with gap-dependent regret of order $K \sigma_{\max}^2 \log T / \Delta$ and gap-independent regret of order $\sqrt{K T \log T}$. RAVEN-UCB incorporates three innovations: (1) variance-driven exploration using $\sqrt{\hat{\sigma}_k^2 / (N_k + 1)}$ in confidence bounds, (2) adaptive control via $\alpha_t = \alpha_0 / \log(t + \epsilon)$, and (3) constant-time recursive updates for efficiency. Experiments across non-stationary patterns - distributional changes, periodic shifts, and temporary fluctuations - in synthetic and logistics scenarios demonstrate its superiority over state-of-the-art baselines, confirming theoretical and practical robustness.

Title: MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver

Authors: Yuepeng Zheng, Fu Luo, Zhenkun Wang, Yaoxin Wu, Yu Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02935
Pdf URL: https://arxiv.org/pdf/2506.02935
Copy Paste: [[2506.02935]] MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver(https://arxiv.org/abs/2506.02935)
Keywords: robust
Abstract: Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables the efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model's generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.

Title: MIND: Material Interface Generation from UDFs for Non-Manifold Surface Reconstruction

Authors: Xuhui Chen, Fei Hou, Wencheng Wang, Hong Qin, Ying He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02938
Pdf URL: https://arxiv.org/pdf/2506.02938
Copy Paste: [[2506.02938]] MIND: Material Interface Generation from UDFs for Non-Manifold Surface Reconstruction(https://arxiv.org/abs/2506.02938)
Keywords: robust, extraction
Abstract: Unsigned distance fields (UDFs) are widely used in 3D deep learning due to their ability to represent shapes with arbitrary topology. While prior work has largely focused on learning UDFs from point clouds or multi-view images, extracting meshes from UDFs remains challenging, as the learned fields rarely attain exact zero distances. A common workaround is to reconstruct signed distance fields (SDFs) locally from UDFs to enable surface extraction via Marching Cubes. However, this often introduces topological artifacts such as holes or spurious components. Moreover, local SDFs are inherently incapable of representing non-manifold geometry, leading to complete failure in such cases. To address this gap, we propose MIND (Material Interface from Non-manifold Distance fields), a novel algorithm for generating material interfaces directly from UDFs, enabling non-manifold mesh extraction from a global perspective. The core of our method lies in deriving a meaningful spatial partitioning from the UDF, where the target surface emerges as the interface between distinct regions. We begin by computing a two-signed local field to distinguish the two sides of manifold patches, and then extend this to a multi-labeled global field capable of separating all sides of a non-manifold structure. By combining this multi-labeled field with the input UDF, we construct material interfaces that support non-manifold mesh extraction via a multi-labeled Marching Cubes algorithm. Extensive experiments on UDFs generated from diverse data sources, including point cloud reconstruction, multi-view reconstruction, and medial axis transforms, demonstrate that our approach robustly handles complex non-manifold surfaces and significantly outperforms existing methods.

Title: An Algorithmic Pipeline for GDPR-Compliant Healthcare Data Anonymisation: Moving Toward Standardisation

Authors: Hamza Khan, Lore Menten, Liesbet M. Peeters
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02942
Pdf URL: https://arxiv.org/pdf/2506.02942
Copy Paste: [[2506.02942]] An Algorithmic Pipeline for GDPR-Compliant Healthcare Data Anonymisation: Moving Toward Standardisation(https://arxiv.org/abs/2506.02942)
Keywords: privacy, protect
Abstract: High-quality real-world data (RWD) is essential for healthcare but must be transformed to comply with the General Data Protection Regulation (GDPR). GDPRs broad definitions of quasi-identifiers (QIDs) and sensitive attributes (SAs) complicate implementation. We aim to standardise RWD anonymisation for GDPR compliance while preserving data utility by introducing an algorithmic method to identify QIDs and SAs and evaluate utility in anonymised datasets. We conducted a systematic literature review via ProQuest and PubMed to inform a three-stage anonymisation pipeline: identification, de-identification, and quasi-identifier dimension evaluation. The pipeline was implemented, validated, and tested on two mock RWD datasets (500 and 1000 rows). Privacy was assessed using k-anonymity, l-diversity, and t-closeness; utility was measured by non-uniform entropy (NUE). The review yielded two studies on QID/SA identification and five on utility metrics. Applying the pipeline, attributes were classified by re-identification risk using alpha and beta thresholds (25 percent/1 percent for 500 rows; 10 percent/1 percent for 1000 rows). Privacy metrics improved k-anonymity from 1 to 4 (500 rows) and 1 to 110 (1000 rows). NUE scores were 69.26 percent and 69.05 percent, respectively, indicating consistent utility despite varying privacy gains. We present a GDPR-compliant anonymisation pipeline for healthcare RWD that provides a reproducible approach to QID/SA identification and utility evaluation; publicly available code promotes standardisation, data privacy, and open science.

Title: Quantitative LLM Judges

Authors: Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, Branislav Kveton
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02945
Pdf URL: https://arxiv.org/pdf/2506.02945
Copy Paste: [[2506.02945]] Quantitative LLM Judges(https://arxiv.org/abs/2506.02945)
Keywords: large language model
Abstract: LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge's textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.

Title: Adaptive Graph Pruning for Multi-Agent Communication

Authors: Boyi Li, Zhonghan Zhao, Der-Horng Lee, Gaoang Wang
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2506.02951
Pdf URL: https://arxiv.org/pdf/2506.02951
Copy Paste: [[2506.02951]] Adaptive Graph Pruning for Multi-Agent Communication(https://arxiv.org/abs/2506.02951)
Keywords: large language model
Abstract: Large Language Model (LLM) based multi-agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes agent quantity (hard-pruning) and communication topology (soft-pruning). Specifically, our method employs a two-stage training strategy: firstly, independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard-pruning and soft-pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High-performing, achieving state-of-the-art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of $2.58\%\sim 9.84\%$; (2) Task-adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token-economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of $90\%+$; and (4) Training-efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.

Title: HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring

Authors: Zhixiong Su, Yichen Wang, Herun Wan, Zhaohan Zhang, Minnan Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02959
Pdf URL: https://arxiv.org/pdf/2506.02959
Copy Paste: [[2506.02959]] HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring(https://arxiv.org/abs/2506.02959)
Keywords: large language model
Abstract: The misuse of large language models (LLMs) poses potential risks, motivating the development of machine-generated text (MGT) detection. Existing literature primarily concentrates on binary, document-level detection, thereby neglecting texts that are composed jointly by human and LLM contributions. Hence, this paper explores the possibility of fine-grained MGT detection under human-AI coauthoring. We suggest fine-grained detectors can pave pathways toward coauthored text detection with a numeric AI ratio. Specifically, we propose a dataset, HACo-Det, which produces human-AI coauthored texts via an automatic pipeline with word-level attribution labels. We retrofit seven prevailing document-level detectors to generalize them to word-level detection. Then we evaluate these detectors on HACo-Det on both word- and sentence-level detection tasks. Empirical results show that metric-based methods struggle to conduct fine-grained detection with a 0.462 average F1 score, while finetuned models show superior performance and better generalization across domains. However, we argue that fine-grained co-authored text detection is far from solved. We further analyze factors influencing performance, e.g., context window, and highlight the limitations of current methods, pointing to potential avenues for improvement.

Title: FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Authors: Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, Lee KangYoon, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, Nicholas D. Lane
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02961
Pdf URL: https://arxiv.org/pdf/2506.02961
Copy Paste: [[2506.02961]] FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models(https://arxiv.org/abs/2506.02961)
Keywords: privacy, federate, large language model
Abstract: Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.

Title: FORLA:Federated Object-centric Representation Learning with Slot Attention

Authors: Guiqiu Liao, Matjaz Jogan, Eric Eaton, Daniel A. Hashimoto
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02964
Pdf URL: https://arxiv.org/pdf/2506.02964
Copy Paste: [[2506.02964]] FORLA:Federated Object-centric Representation Learning with Slot Attention(https://arxiv.org/abs/2506.02964)
Keywords: federate
Abstract: Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling domain-specific factors without supervision. We introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student-teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

Title: Memory-Efficient and Privacy-Preserving Collaborative Training for Mixture-of-Experts LLMs

Authors: Ze Yu Zhang, Bolin Ding, Bryan Kian Hsiang Low
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02965
Pdf URL: https://arxiv.org/pdf/2506.02965
Copy Paste: [[2506.02965]] Memory-Efficient and Privacy-Preserving Collaborative Training for Mixture-of-Experts LLMs(https://arxiv.org/abs/2506.02965)
Keywords: privacy, protect, attack, robust, large language model
Abstract: Mixture-of-Experts (MoE) has been gaining popularity due to its successful adaptation to large language models (LLMs). In this work, we introduce Privacy-preserving Collaborative Mixture-of-Experts (PC-MoE), which leverages the sparsity of the MoE architecture for memory-efficient decentralized collaborative LLM training, enabling multiple parties with limited GPU-memory and data resources to collectively train more capable LLMs than they could achieve individually. At the same time, this approach protects training data privacy of each participant by keeping training data, as well as parts of the forward pass signal and gradients locally within each party. By design, PC-MoE synergistically combines the strengths of distributed computation with strong confidentiality assurances. Unlike most privacy-preserving schemes, which pay for confidentiality with lower task accuracy, our framework breaks that trade-off: across seven popular LLM benchmarks, it almost matches (and sometimes exceeds) the performance and convergence rate of a fully centralized model, enjoys near 70% peak GPU RAM reduction, while being fully robust against reconstruction attacks.

Title: Computation- and Communication-Efficient Online FL for Resource-Constrained Aerial Vehicles

Authors: Md-Ferdous Pervej, Richeng Jin, Md Moin Uddin Chowdhury, Simran Singh, İsmail Güvenç, Huaiyu Dai
Subjects: cs.LG, cs.NI, eess.SY
Abstract URL: https://arxiv.org/abs/2506.02972
Pdf URL: https://arxiv.org/pdf/2506.02972
Copy Paste: [[2506.02972]] Computation- and Communication-Efficient Online FL for Resource-Constrained Aerial Vehicles(https://arxiv.org/abs/2506.02972)
Keywords: privacy, federate
Abstract: Privacy-preserving distributed machine learning (ML) and aerial connected vehicle (ACV)-assisted edge computing have drawn significant attention lately. Since the onboard sensors of ACVs can capture new data as they move along their trajectories, the continual arrival of such 'newly' sensed data leads to online learning and demands carefully crafting the trajectories. Besides, as typical ACVs are inherently resource-constrained, computation- and communication-efficient ML solutions are needed. Therefore, we propose a computation- and communication-efficient online aerial federated learning (2CEOAFL) algorithm to take the benefits of continual sensed data and limited onboard resources of the ACVs. In particular, considering independently owned ACVs act as selfish data collectors, we first model their trajectories according to their respective time-varying data distributions. We then propose a 2CEOAFL algorithm that allows the flying ACVs to (a) prune the received dense ML model to make it shallow, (b) train the pruned model, and (c) probabilistically quantize and offload their trained accumulated gradients to the central server (CS). Our extensive simulation results show that the proposed 2CEOAFL algorithm delivers comparable performances to its non-pruned and nonquantized, hence, computation- and communication-inefficient counterparts.

Title: Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation

Authors: Dingwei Chen, Ziqiang Liu, Feiteng Fang, Chak Tou Leong, Shiwen Ni, Ahmadreza Argha, Hamid Alinejad-Rokny, Min Yang, Chengming Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02973
Pdf URL: https://arxiv.org/pdf/2506.02973
Copy Paste: [[2506.02973]] Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation(https://arxiv.org/abs/2506.02973)
Keywords: diffusion, large language model
Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs, commonly referred to as ''hallucinations'', remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose PLI (Premature Layers Interpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs' internal mechanisms. To promote reproducibility, we will release our code and data upon acceptance.

Title: HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Authors: Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Zunnan Xu, Zhaoyang Zhang, Yixiao Ge, Xiu Li, Ying Shan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02975
Pdf URL: https://arxiv.org/pdf/2506.02975
Copy Paste: [[2506.02975]] HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation(https://arxiv.org/abs/2506.02975)
Keywords: transformer
Abstract: With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at this https URL.

Title: On the Robustness of Tabular Foundation Models: Test-Time Attacks and In-Context Defenses

Authors: Mohamed Djilani, Thibault Simonetto, Karim Tit, Florian Tambon, Paul Récamier, Salah Ghamizi, Maxime Cordy, Mike Papadakis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02978
Pdf URL: https://arxiv.org/pdf/2506.02978
Copy Paste: [[2506.02978]] On the Robustness of Tabular Foundation Models: Test-Time Attacks and In-Context Defenses(https://arxiv.org/abs/2506.02978)
Keywords: security, defense, attack, robust
Abstract: Recent tabular Foundational Models (FM) such as TabPFN and TabICL, leverage in-context learning to achieve strong performance without gradient updates or fine-tuning. However, their robustness to adversarial manipulation remains largely unexplored. In this work, we present a comprehensive study of the adversarial vulnerabilities of tabular FM, focusing on both their fragility to targeted test-time attacks and their potential misuse as adversarial tools. We show on three benchmarks in finance, cybersecurity and healthcare, that small, structured perturbations to test inputs can significantly degrade prediction accuracy, even when training context remain fixed. Additionally, we demonstrate that tabular FM can be repurposed to generate transferable evasion to conventional models such as random forests and XGBoost, and on a lesser extent to deep tabular models. To improve tabular FM, we formulate the robustification problem as an optimization of the weights (adversarial fine-tuning), or the context (adversarial in-context learning). We introduce an in-context adversarial training strategy that incrementally replaces the context with adversarial perturbed instances, without updating model weights. Our approach improves robustness across multiple tabular benchmarks. Together, these findings position tabular FM as both a target and a source of adversarial threats, highlighting the urgent need for robust training and evaluation practices in this emerging paradigm.

Title: Astrophotography turbulence mitigation via generative models

Authors: Joonyeoup Kim, Yu Yuan, Xingguang Zhang, Xijun Wang, Stanley Chan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.02981
Pdf URL: https://arxiv.org/pdf/2506.02981
Copy Paste: [[2506.02981]] Astrophotography turbulence mitigation via generative models(https://arxiv.org/abs/2506.02981)
Keywords: diffusion, generative
Abstract: Photography is the cornerstone of modern astronomical and space research. However, most astronomical images captured by ground-based telescopes suffer from atmospheric turbulence, resulting in degraded imaging quality. While multi-frame strategies like lucky imaging can mitigate some effects, they involve intensive data acquisition and complex manual processing. In this paper, we propose AstroDiff, a generative restoration method that leverages both the high-quality generative priors and restoration capabilities of diffusion models to mitigate atmospheric turbulence. Extensive experiments demonstrate that AstroDiff outperforms existing state-of-the-art learning-based methods in astronomical image turbulence mitigation, providing higher perceptual quality and better structural fidelity under severe turbulence conditions. Our code and additional results are available at this https URL

Title: Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

Authors: Richard Armitage
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.02987
Pdf URL: https://arxiv.org/pdf/2506.02987
Copy Paste: [[2506.02987]] Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis(https://arxiv.org/abs/2506.02987)
Keywords: large language model
Abstract: Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model was prompted to answer as a GP in the UK and was provided with full question information. Each question was attempted once by each model. Responses were scored against correct answers provided by GP SelfTest. Results: The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively. The average peer score for the same questions was 73.0%. Discussion: All models performed remarkably well, and all substantially exceeded the average performance of GPs and GP registrars who had answered the same questions. o3 demonstrated the best performance, while the performances of the other leading models were comparable with each other and were not substantially lower than that of o3. These findings strengthen the case for LLMs, particularly reasoning models, to support the delivery of primary care, especially those that have been specifically trained on primary care clinical data.

Title: It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems

Authors: Iuliia Zaitova, Badr M. Abdullah, Wei Xue, Dietrich Klakow, Bernd Möbius, Tania Avgustinova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02995
Pdf URL: https://arxiv.org/pdf/2506.02995
Copy Paste: [[2506.02995]] It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems(https://arxiv.org/abs/2506.02995)
Keywords: large language model
Abstract: Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.

Title: A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems

Authors: Đorđe Klisura, Astrid R Bernaga Torres, Anna Karen Gárate-Escamilla, Rajesh Roshan Biswal, Ke Yang, Hilal Pataci, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02998
Pdf URL: https://arxiv.org/pdf/2506.02998
Copy Paste: [[2506.02998]] A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems(https://arxiv.org/abs/2506.02998)
Keywords: privacy
Abstract: Privacy policies inform users about data collection and usage, yet their complexity limits accessibility for diverse populations. Existing Privacy Policy Question Answering (QA) systems exhibit performance disparities across English dialects, disadvantaging speakers of non-standard varieties. We propose a novel multi-agent framework inspired by human-centered design principles to mitigate dialectal biases. Our approach integrates a Dialect Agent, which translates queries into Standard American English (SAE) while preserving dialectal intent, and a Privacy Policy Agent, which refines predictions using domain expertise. Unlike prior approaches, our method does not require retraining or dialect-specific fine-tuning, making it broadly applicable across models and domains. Evaluated on PrivacyQA and PolicyQA, our framework improves GPT-4o-mini's zero-shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few-shot baselines without additional training data. These results highlight the effectiveness of structured agent collaboration in mitigating dialect biases and underscore the importance of designing NLP systems that account for linguistic diversity to ensure equitable access to privacy information.

Title: DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models

Authors: Jiarui Wang, Huiyu Duan, Juntong Wang, Ziheng Jia, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03007
Pdf URL: https://arxiv.org/pdf/2506.03007
Copy Paste: [[2506.03007]] DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models(https://arxiv.org/abs/2506.03007)
Keywords: generative
Abstract: With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present \textbf{DFBench}, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose \textbf{MoA-DF}, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at this https URL.

Title: Conditioning Large Language Models on Legal Systems? Detecting Punishable Hate Speech

Authors: Florian Ludwig, Torsten Zesch, Frederike Zufall
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03009
Pdf URL: https://arxiv.org/pdf/2506.03009
Copy Paste: [[2506.03009]] Conditioning Large Language Models on Legal Systems? Detecting Punishable Hate Speech(https://arxiv.org/abs/2506.03009)
Keywords: large language model
Abstract: The assessment of legal problems requires the consideration of a specific legal system and its levels of abstraction, from constitutional law to statutory law to case law. The extent to which Large Language Models (LLMs) internalize such legal systems is unknown. In this paper, we propose and investigate different approaches to condition LLMs at different levels of abstraction in legal systems. This paper examines different approaches to conditioning LLMs at multiple levels of abstraction in legal systems to detect potentially punishable hate speech. We focus on the task of classifying whether a specific social media posts falls under the criminal offense of incitement to hatred as prescribed by the German Criminal Code. The results show that there is still a significant performance gap between models and legal experts in the legal assessment of hate speech, regardless of the level of abstraction with which the models were conditioned. Our analysis revealed, that models conditioned on abstract legal knowledge lacked deep task understanding, often contradicting themselves and hallucinating answers, while models using concrete legal knowledge performed reasonably well in identifying relevant target groups, but struggled with classifying target conducts.

Title: Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning

Authors: Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.03035
Pdf URL: https://arxiv.org/pdf/2506.03035
Copy Paste: [[2506.03035]] Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning(https://arxiv.org/abs/2506.03035)
Keywords: large language model
Abstract: Understanding user queries is fundamental in many applications, such as home assistants, booking systems, or recommendations. Accordingly, it is crucial to develop accurate Spoken Language Understanding (SLU) approaches to ensure the reliability of the considered system. Current State-of-the-Art SLU techniques rely on large amounts of training data; however, only limited annotated examples are available for specific tasks or languages. In the meantime, instruction-tuned large language models (LLMs) have shown exceptional performance on unseen tasks in a few-shot setting when provided with adequate prompts. In this work, we propose to explore example selection by leveraging Information retrieval (IR) approaches to build an enhanced prompt that is applied to an SLU task. We evaluate the effectiveness of the proposed method on several SLU benchmarks. Experimental results show that lexical IR methods significantly enhance performance without increasing prompt length.

Title: Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

Authors: Jintian Shao, Yiming Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03038
Pdf URL: https://arxiv.org/pdf/2506.03038
Copy Paste: [[2506.03038]] Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective(https://arxiv.org/abs/2506.03038)
Keywords: robust, large language model
Abstract: Reinforcement learning (RL) enhances large language models (LLMs) in complex, long-chain-of-thought (long-CoT) reasoning. The advanced VAPO framework, despite sophisticated mechanisms like Decoupled GAE, theoretically faces fundamental limitations in comprehensively modeling and leveraging deep, long-term value for fine-grained, step-by-step policy guidance in extended reasoning chains. We argue these limitations stem from inherent difficulties in credit assignment, value function representational capacity with temporally abstracted goals, and translating global value signals into local policy improvements, especially with sparse rewards. Our theoretical analysis examines these aspects to illuminate VAPO's boundaries in long-term value modeling, aiming to deepen understanding of current RL for advanced reasoning and suggest future research for more robust LLM agents.

Title: Sample complexity of Schrödinger potential estimation

Authors: Nikita Puchkin, Iurii Pustovalov, Yuri Sapronov, Denis Suchkov, Alexey Naumov, Denis Belomestny
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03043
Pdf URL: https://arxiv.org/pdf/2506.03043
Copy Paste: [[2506.03043]] Sample complexity of Schrödinger potential estimation(https://arxiv.org/abs/2506.03043)
Keywords: diffusion, generative
Abstract: We address the problem of Schrödinger potential estimation, which plays a crucial role in modern generative modelling approaches based on Schrödinger bridges and stochastic optimal control for SDEs. Given a simple prior diffusion process, these methods search for a path between two given distributions $\rho_0$ and $\rho_T^*$ requiring minimal efforts. The optimal drift in this case can be expressed through a Schrödinger potential. In the present paper, we study generalization ability of an empirical Kullback-Leibler (KL) risk minimizer over a class of admissible log-potentials aimed at fitting the marginal distribution at time $T$. Under reasonable assumptions on the target distribution $\rho_T^*$ and the prior process, we derive a non-asymptotic high-probability upper bound on the KL-divergence between $\rho_T^*$ and the terminal density corresponding to the estimated log-potential. In particular, we show that the excess KL-risk may decrease as fast as $O(\log^2 n / n)$ when the sample size $n$ tends to infinity even if both $\rho_0$ and $\rho_T^*$ have unbounded supports.

Title: Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs

Authors: Yuval Kansal, Shmuel Berman, Lydia Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03051
Pdf URL: https://arxiv.org/pdf/2506.03051
Copy Paste: [[2506.03051]] Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs(https://arxiv.org/abs/2506.03051)
Keywords: large language model
Abstract: Factuality is a necessary precursor to useful educational tools. As adoption of Large Language Models (LLMs) in education continues of grow, ensuring correctness in all settings is paramount. Despite their strong English capabilities, LLM performance in other languages is largely untested. In this work, we evaluate the correctness of the Llama3.1 family of models in answering factual questions appropriate for middle and high school students. We demonstrate that LLMs not only provide extraneous and less truthful information, but also exacerbate existing biases against rare languages.

Title: Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

Authors: Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, Tao Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03065
Pdf URL: https://arxiv.org/pdf/2506.03065
Copy Paste: [[2506.03065]] Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers(https://arxiv.org/abs/2506.03065)
Keywords: diffusion, transformer
Abstract: While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09$\times$, 2.38$\times$, and 1.67$\times$ theoretical FLOP reduction, and actual inference speedups of 1.76$\times$, 1.85$\times$, and 1.58$\times$, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.

Title: EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

Authors: Mingzhe Li, Gehao Zhang, Zhenting Wang, Shiqing Ma, Siqi Pan, Richard Cartwright, Juan Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03067
Pdf URL: https://arxiv.org/pdf/2506.03067
Copy Paste: [[2506.03067]] EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models(https://arxiv.org/abs/2506.03067)
Keywords: interpretability, watermark, diffusion, large language model, segmentation
Abstract: Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.

Title: LEG-SLAM: Real-Time Language-Enhanced Gaussian Splatting for SLAM

Authors: Roman Titkov, Egor Zubkov, Dmitry Yudin, Jaafar Mahmoud, Malik Mohrat, Gennady Sidorov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03073
Pdf URL: https://arxiv.org/pdf/2506.03073
Copy Paste: [[2506.03073]] LEG-SLAM: Real-Time Language-Enhanced Gaussian Splatting for SLAM(https://arxiv.org/abs/2506.03073)
Keywords: extraction
Abstract: Modern Gaussian Splatting methods have proven highly effective for real-time photorealistic rendering of 3D scenes. However, integrating semantic information into this representation remains a significant challenge, especially in maintaining real-time performance for SLAM (Simultaneous Localization and Mapping) applications. In this work, we introduce LEG-SLAM -- a novel approach that fuses an optimized Gaussian Splatting implementation with visual-language feature extraction using DINOv2 followed by a learnable feature compressor based on Principal Component Analysis, while enabling an online dense SLAM. Our method simultaneously generates high-quality photorealistic images and semantically labeled scene maps, achieving real-time scene reconstruction with more than 10 fps on the Replica dataset and 18 fps on ScanNet. Experimental results show that our approach significantly outperforms state-of-the-art methods in reconstruction speed while achieving competitive rendering quality. The proposed system eliminates the need for prior data preparation such as camera's ego motion or pre-computed static semantic maps. With its potential applications in autonomous robotics, augmented reality, and other interactive domains, LEG-SLAM represents a significant step forward in real-time semantic 3D Gaussian-based SLAM. Project page: this https URL

Title: Agnostic Learning under Targeted Poisoning: Optimal Rates and the Role of Randomness

Authors: Bogdan Chornomaz, Yonatan Koren, Shay Moran, Tom Waknine
Subjects: cs.LG, math.PR
Abstract URL: https://arxiv.org/abs/2506.03075
Pdf URL: https://arxiv.org/pdf/2506.03075
Copy Paste: [[2506.03075]] Agnostic Learning under Targeted Poisoning: Optimal Rates and the Role of Randomness(https://arxiv.org/abs/2506.03075)
Keywords: attack
Abstract: We study the problem of learning in the presence of an adversary that can corrupt an $\eta$ fraction of the training examples with the goal of causing failure on a specific test point. In the realizable setting, prior work established that the optimal error under such instance-targeted poisoning attacks scales as $\Theta(d\eta)$, where $d$ is the VC dimension of the hypothesis class arXiv:2210.02713. In this work, we resolve the corresponding question in the agnostic setting. We show that the optimal excess error is $\tilde{\Theta}(\sqrt{d\eta})$, answering one of the main open problems left by Hanneke et al. To achieve this rate, it is necessary to use randomized learners: Hanneke et al. showed that deterministic learners can be forced to suffer error close to 1, even under small amounts of poisoning. Perhaps surprisingly, our upper bound remains valid even when the learner's random bits are fully visible to the adversary . In the other direction, our lower bound is stronger than standard PAC-style bounds: instead of tailoring a hard distribution separately for each sample size, we exhibit a single fixed distribution under which the adversary can enforce an excess error of $\Omega(\sqrt{d\eta})$ infinitely often.

Title: StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

Authors: Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03077
Pdf URL: https://arxiv.org/pdf/2506.03077
Copy Paste: [[2506.03077]] StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs(https://arxiv.org/abs/2506.03077)
Keywords: transformer
Abstract: Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at this https URL.

Title: ORV: 4D Occupancy-centric Robot Video Generation

Authors: Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Xin Jin, Hang Zhao, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03079
Pdf URL: https://arxiv.org/pdf/2506.03079
Copy Paste: [[2506.03079]] ORV: 4D Occupancy-centric Robot Video Generation(https://arxiv.org/abs/2506.03079)
Keywords: generative
Abstract: Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive. Recently, action-driven generative models have gained widespread adoption in robot learning and simulation, as they eliminate safety concerns and reduce maintenance efforts. However, the action sequences used in these methods often result in limited control precision and poor generalization due to their globally coarse alignment. To address these limitations, we propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation to provide more accurate semantic and geometric guidance for video generation. By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability. Furthermore, our framework supports the simultaneous generation of multi-view videos of robot gripping operations - an important capability for downstream robotic learning tasks. Extensive experimental results demonstrate that ORV consistently outperforms existing baseline methods across various datasets and sub-tasks. Demo, Code and Model: this https URL

Title: SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis

Authors: Ssharvien Kumar Sivakumar, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03082
Pdf URL: https://arxiv.org/pdf/2506.03082
Copy Paste: [[2506.03082]] SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis(https://arxiv.org/abs/2506.03082)
Keywords: diffusion, generative
Abstract: Surgical simulation plays a pivotal role in training novice surgeons, accelerating their learning curve and reducing intra-operative errors. However, conventional simulation tools fall short in providing the necessary photorealism and the variability of human anatomy. In response, current methods are shifting towards generative model-based simulators. Yet, these approaches primarily focus on using increasingly complex conditioning for precise synthesis while neglecting the fine-grained human control aspect. To address this gap, we introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control. We demonstrate SG2VID's capabilities across three public datasets featuring cataract and cholecystectomy surgery. While SG2VID outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomy's size and movement, entrance of new tools, as well as the overall scene layout. We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task when the training set is extended with our synthetic videos. Finally, to showcase SG2VID's ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.

Title: InterMamba: Efficient Human-Human Interaction Generation with Adaptive Spatio-Temporal Mamba

Authors: Zizhao Wu, Yingying Sun, Yiming Chen, Xiaoling Gu, Ruyu Liu, Jiazhou Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03084
Pdf URL: https://arxiv.org/pdf/2506.03084
Copy Paste: [[2506.03084]] InterMamba: Efficient Human-Human Interaction Generation with Adaptive Spatio-Temporal Mamba(https://arxiv.org/abs/2506.03084)
Keywords: transformer
Abstract: Human-human interaction generation has garnered significant attention in motion synthesis due to its vital role in understanding humans as social beings. However, existing methods typically rely on transformer-based architectures, which often face challenges related to scalability and efficiency. To address these issues, we propose a novel, efficient human-human interaction generation method based on the Mamba framework, designed to meet the demands of effectively capturing long-sequence dependencies while providing real-time feedback. Specifically, we introduce an adaptive spatio-temporal Mamba framework that utilizes two parallel SSM branches with an adaptive mechanism to integrate the spatial and temporal features of motion sequences. To further enhance the model's ability to capture dependencies within individual motion sequences and the interactions between different individual sequences, we develop two key modules: the self-adaptive spatio-temporal Mamba module and the cross-adaptive spatio-temporal Mamba module, enabling efficient feature learning. Extensive experiments demonstrate that our method achieves state-of-the-art results on two interaction datasets with remarkable quality and efficiency. Compared to the baseline method InterGen, our approach not only improves accuracy but also requires a minimal parameter size of just 66M ,only 36% of InterGen's, while achieving an average inference speed of 0.57 seconds, which is 46% of InterGen's execution time.

Title: Non-Asymptotic Length Generalization

Authors: Thomas Chen, Tengyu Ma, Zhiyuan Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03085
Pdf URL: https://arxiv.org/pdf/2506.03085
Copy Paste: [[2506.03085]] Non-Asymptotic Length Generalization(https://arxiv.org/abs/2506.03085)
Keywords: transformer
Abstract: Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is $2n - 2$ where $n$ is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang & Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is $O(T^2)$ when the ground-truth function has precision $T$, and that the length complexity of 2-layer C-RASP functions is $O(T^{O(K)})$ when the ground-truth function has precision $T$ and $K$ heads.

Title: How Explanations Leak the Decision Logic: Stealing Graph Neural Networks via Explanation Alignment

Authors: Bin Ma, Yuyuan Feng, Minhua Lin, Enyan Dai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03087
Pdf URL: https://arxiv.org/pdf/2506.03087
Copy Paste: [[2506.03087]] How Explanations Leak the Decision Logic: Stealing Graph Neural Networks via Explanation Alignment(https://arxiv.org/abs/2506.03087)
Keywords: security, protect, attack, steal
Abstract: Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to growing demands for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {\method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at this https URL.

Title: Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness

Authors: Lucas Piper, Arlindo L. Oliveira, Tiago Marques
Subjects: cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.03089
Pdf URL: https://arxiv.org/pdf/2506.03089
Copy Paste: [[2506.03089]] Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness(https://arxiv.org/abs/2506.03089)
Keywords: robust
Abstract: Convolutional neural networks (CNNs) trained on object recognition achieve high task performance but continue to exhibit vulnerability under a range of visual perturbations and out-of-domain images, when compared with biological vision. Prior work has demonstrated that coupling a standard CNN with a front-end block (VOneBlock) that mimics the primate primary visual cortex (V1) can improve overall model robustness. Expanding on this, we introduce Early Vision Networks (EVNets), a new class of hybrid CNNs that combine the VOneBlock with a novel SubcorticalBlock, whose architecture draws from computational models in neuroscience and is parameterized to maximize alignment with subcortical responses reported across multiple experimental studies. Without being optimized to do so, the assembly of the SubcorticalBlock with the VOneBlock improved V1 alignment across most standard V1 benchmarks, and better modeled extra-classical receptive field phenomena. In addition, EVNets exhibit stronger emergent shape bias and overperform the base CNN architecture by 8.5% on an aggregate benchmark of robustness evaluations, including adversarial perturbations, common corruptions, and domain shifts. Finally, we show that EVNets can be further improved when paired with a state-of-the-art data augmentation technique, surpassing the performance of the isolated data augmentation approach by 7.3% on our robustness benchmark. This result reveals complementary benefits between changes in architecture to better mimic biology and training-based machine learning approaches.

Title: From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Authors: Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03093
Pdf URL: https://arxiv.org/pdf/2506.03093
Copy Paste: [[2506.03093]] From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit(https://arxiv.org/abs/2506.03093)
Keywords: interpretability
Abstract: Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.

Title: FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Authors: Christian Schlarmann, Francesco Croce, Nicolas Flammarion, Matthias Hein
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03096
Pdf URL: https://arxiv.org/pdf/2506.03096
Copy Paste: [[2506.03096]] FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens(https://arxiv.org/abs/2506.03096)
Keywords: transformer
Abstract: Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.

Title: EgoVLM: Policy Optimization for Egocentric Video Understanding

Authors: Ashwin Vinod, Shrey Pandit, Aditya Vavre, Linshen Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03097
Pdf URL: https://arxiv.org/pdf/2506.03097
Copy Paste: [[2506.03097]] EgoVLM: Policy Optimization for Egocentric Video Understanding(https://arxiv.org/abs/2506.03097)
Keywords: robust, interpretability
Abstract: Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero's approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.

Title: Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Authors: Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chao Yang, Helen Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03106
Pdf URL: https://arxiv.org/pdf/2506.03106
Copy Paste: [[2506.03106]] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback(https://arxiv.org/abs/2506.03106)
Keywords: large language model
Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.

Title: ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

Authors: Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03107
Pdf URL: https://arxiv.org/pdf/2506.03107
Copy Paste: [[2506.03107]] ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions(https://arxiv.org/abs/2506.03107)
Keywords: diffusion, transformer
Abstract: Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.

Title: Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning

Authors: Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03110
Pdf URL: https://arxiv.org/pdf/2506.03110
Copy Paste: [[2506.03110]] Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning(https://arxiv.org/abs/2506.03110)
Keywords: transformer
Abstract: Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available at this https URL.

Title: Rectified Flows for Fast Multiscale Fluid Flow Modeling

Authors: Victor Armegioiu, Yannick Ramic, Siddhartha Mishra
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03111
Pdf URL: https://arxiv.org/pdf/2506.03111
Copy Paste: [[2506.03111]] Rectified Flows for Fast Multiscale Fluid Flow Modeling(https://arxiv.org/abs/2506.03111)
Keywords: diffusion
Abstract: The statistical modeling of fluid flows is very challenging due to their multiscale dynamics and extreme sensitivity to initial conditions. While recently proposed conditional diffusion models achieve high fidelity, they typically require hundreds of stochastic sampling steps at inference. We introduce a rectified flow framework that learns a time-dependent velocity field, transporting input to output distributions along nearly straight trajectories. By casting sampling as solving an ordinary differential equation (ODE) along this straighter flow field, our method makes each integration step much more effective, using as few as eight steps versus (more than) 128 steps in standard score-based diffusion, without sacrificing predictive fidelity. Experiments on challenging multiscale flow benchmarks show that rectified flows recover the same posterior distributions as diffusion models, preserve fine-scale features that MSE-trained baselines miss, and deliver high-resolution samples in a fraction of inference time.

Title: Zero-Shot Tree Detection and Segmentation from Aerial Forest Imagery

Authors: Michelle Chen, David Russell, Amritha Pallavoor, Derek Young, Jane Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03114
Pdf URL: https://arxiv.org/pdf/2506.03114
Copy Paste: [[2506.03114]] Zero-Shot Tree Detection and Segmentation from Aerial Forest Imagery(https://arxiv.org/abs/2506.03114)
Keywords: segmentation
Abstract: Large-scale delineation of individual trees from remote sensing imagery is crucial to the advancement of ecological research, particularly as climate change and other environmental factors rapidly transform forest landscapes across the world. Current RGB tree segmentation methods rely on training specialized machine learning models with labeled tree datasets. While these learning-based approaches can outperform manual data collection when accurate, the existing models still depend on training data that's hard to scale. In this paper, we investigate the efficacy of using a state-of-the-art image segmentation model, Segment Anything Model 2 (SAM2), in a zero-shot manner for individual tree detection and segmentation. We evaluate a pretrained SAM2 model on two tasks in this domain: (1) zero-shot segmentation and (2) zero-shot transfer by using predictions from an existing tree detection model as prompts. Our results suggest that SAM2 not only has impressive generalization capabilities, but also can form a natural synergy with specialized methods trained on in-domain labeled data. We find that applying large pretrained models to problems in remote sensing is a promising avenue for future progress. We make our code available at: this https URL.

Title: Controllable Human-centric Keyframe Interpolation with Generative Prior

Authors: Zujin Guo, Size Wu, Zhongang Cai, Wei Li, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03119
Pdf URL: https://arxiv.org/pdf/2506.03119
Copy Paste: [[2506.03119]] Controllable Human-centric Keyframe Interpolation with Generative Prior(https://arxiv.org/abs/2506.03119)
Keywords: diffusion, generative
Abstract: Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

Title: AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation

Authors: Prashanth Vijayaraghavan, Luyao Shi, Ehsan Degan, Vandana Mukherjee, Xin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.03122
Pdf URL: https://arxiv.org/pdf/2506.03122
Copy Paste: [[2506.03122]] AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation(https://arxiv.org/abs/2506.03122)
Keywords: large language model
Abstract: Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUIT-RL,a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, which further improves the instruction-tuned model using reward models that evaluate validity, efficiency, and output voltage. The refined model is then used directly to generate topologies that satisfy the design constraints. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework's effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.

Title: DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

Authors: Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03123
Pdf URL: https://arxiv.org/pdf/2506.03123
Copy Paste: [[2506.03123]] DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation(https://arxiv.org/abs/2506.03123)
Keywords: diffusion
Abstract: Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail this http URL approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \href{this https URL}{this https URL}.

Title: AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

Authors: Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03126
Pdf URL: https://arxiv.org/pdf/2506.03126
Copy Paste: [[2506.03126]] AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation(https://arxiv.org/abs/2506.03126)
Keywords: diffusion, large language model
Abstract: Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.

Title: Native-Resolution Image Synthesis

Authors: Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03131
Pdf URL: https://arxiv.org/pdf/2506.03131
Copy Paste: [[2506.03131]] Native-Resolution Image Synthesis(https://arxiv.org/abs/2506.03131)
Keywords: robust, diffusion, transformer, generative, large language model
Abstract: We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.

Title: SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation

Authors: Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, Yongliang Shen, Weiming Lu, Yueting Zhuang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03139
Pdf URL: https://arxiv.org/pdf/2506.03139
Copy Paste: [[2506.03139]] SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation(https://arxiv.org/abs/2506.03139)
Keywords: large language model
Abstract: Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at this https URL.

Title: Not All Tokens Are Meant to Be Forgotten

Authors: Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Douglas Zytko, Prashant Khanduri, Dongxiao Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03142
Pdf URL: https://arxiv.org/pdf/2506.03142
Copy Paste: [[2506.03142]] Not All Tokens Are Meant to Be Forgotten(https://arxiv.org/abs/2506.03142)
Keywords: privacy, large language model
Abstract: Large Language Models (LLMs), pre-trained on massive text corpora, exhibit remarkable human-level language understanding, reasoning, and decision-making abilities. However, they tend to memorize unwanted information, such as private or copyrighted content, raising significant privacy and legal concerns. Unlearning has emerged as a promising solution, but existing methods face a significant challenge of over-forgetting. This issue arises because they indiscriminately suppress the generation of all the tokens in forget samples, leading to a substantial loss of model utility. To overcome this challenge, we introduce the Targeted Information Forgetting (TIF) framework, which consists of (1) a flexible targeted information identifier designed to differentiate between unwanted words (UW) and general words (GW) in the forget samples, and (2) a novel Targeted Preference Optimization approach that leverages Logit Preference Loss to unlearn unwanted information associated with UW and Preservation Loss to retain general information in GW, effectively improving the unlearning process while mitigating utility degradation. Extensive experiments on the TOFU and MUSE benchmarks demonstrate that the proposed TIF framework enhances unlearning effectiveness while preserving model utility and achieving state-of-the-art results.

Title: GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Authors: Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.03143
Pdf URL: https://arxiv.org/pdf/2506.03143
Copy Paste: [[2506.03143]] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents(https://arxiv.org/abs/2506.03143)
Keywords: transformer
Abstract: One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

Title: Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM

Authors: Pralaypati Ta, Sriram Venkatesaperumal, Keerthi Ram, Mohanasankar Sivaprakasam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03145
Pdf URL: https://arxiv.org/pdf/2506.03145
Copy Paste: [[2506.03145]] Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM(https://arxiv.org/abs/2506.03145)
Keywords: extraction, large language model
Abstract: Neuroscience research publications encompass a vast wealth of knowledge. Accurately retrieving existing information and discovering new insights from this extensive literature is essential for advancing the field. However, when knowledge is dispersed across multiple sources, current state-of-the-art retrieval methods often struggle to extract the necessary information. A knowledge graph (KG) can integrate and link knowledge from multiple sources, but existing methods for constructing KGs in neuroscience often rely on labeled data and require domain expertise. Acquiring large-scale, labeled data for a specialized area like neuroscience presents significant challenges. This work proposes novel methods for constructing KG from unlabeled large-scale neuroscience research corpus utilizing large language models (LLM), neuroscience ontology, and text embeddings. We analyze the semantic relevance of neuroscience text segments identified by LLM for building the knowledge graph. We also introduce an entity-augmented information retrieval algorithm to extract knowledge from the KG. Several experiments were conducted to evaluate the proposed approaches, and the results demonstrate that our methods significantly enhance knowledge discovery from the unlabeled neuroscience research corpus. It achieves an F1 score of 0.84 for entity extraction, and the knowledge obtained from the KG improves answers to over 54% of the questions.

Title: UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Authors: Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.03147
Pdf URL: https://arxiv.org/pdf/2506.03147
Copy Paste: [[2506.03147]] UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation(https://arxiv.org/abs/2506.03147)
Keywords: generative
Abstract: Although existing unified models deliver strong performance on vision-language understanding and text-to-image generation, their models are limited in exploring image perception and manipulation tasks, which are urgently desired by users for wide applications. Recently, OpenAI released their powerful GPT-4o-Image model for comprehensive image perception and manipulation, achieving expressive capability and attracting community interests. By observing the performance of GPT-4o-Image in our carefully constructed experiments, we infer that GPT-4o-Image leverages features extracted by semantic encoders instead of VAE, while VAEs are considered essential components in many image manipulation models. Motivated by such inspiring observations, we present a unified generative framework named UniWorld based on semantic features provided by powerful visual-language models and contrastive semantic encoders. As a result, we build a strong unified model using only 1% amount of BAGEL's data, which consistently outperforms BAGEL on image editing benchmarks. UniWorld also maintains competitive image understanding and generation capabilities, achieving strong performance across multiple image perception tasks. We fully open-source our models, including model weights, training and evaluation scripts, and datasets.

Title: IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

Authors: Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Ronald Clark, Ming-Hsuan Yang
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.03150
Pdf URL: https://arxiv.org/pdf/2506.03150
Copy Paste: [[2506.03150]] IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation(https://arxiv.org/abs/2506.03150)
Keywords: diffusion
Abstract: Although diffusion-based models can generate high-quality and high-resolution video sequences from textual or image inputs, they lack explicit integration of geometric cues when controlling scene lighting and visual appearance across frames. To address this limitation, we propose IllumiCraft, an end-to-end diffusion framework accepting three complementary inputs: (1) high-dynamic-range (HDR) video maps for detailed lighting control; (2) synthetically relit frames with randomized illumination changes (optionally paired with a static background reference image) to provide appearance cues; and (3) 3D point tracks that capture precise 3D geometry information. By integrating the lighting, appearance, and geometry cues within a unified diffusion architecture, IllumiCraft generates temporally coherent videos aligned with user-defined prompts. It supports background-conditioned and text-conditioned video relighting and provides better fidelity than existing controllable video generation methods. Project Page: this https URL