2025-06-27

Title: Progressive Size-Adaptive Federated Learning: A Comprehensive Framework for Heterogeneous Multi-Modal Data Systems

Authors: Sajid Hussain, Muhammad Sohail, Nauman Ali Khan, Naima Iltaf, Ihtesham ul Islam
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20685
Pdf URL: https://arxiv.org/pdf/2506.20685
Copy Paste: [[2506.20685]] Progressive Size-Adaptive Federated Learning: A Comprehensive Framework for Heterogeneous Multi-Modal Data Systems(https://arxiv.org/abs/2506.20685)
Keywords: privacy, federate
Abstract: Federated Learning (FL) has emerged as a transformative paradigm for distributed machine learning while preserving data privacy. However, existing approaches predominantly focus on model heterogeneity and aggregation techniques, largely overlooking the fundamental impact of dataset size characteristics on federated training dynamics. This paper introduces Size-Based Adaptive Federated Learning (SAFL), a novel progressive training framework that systematically organizes federated learning based on dataset size characteristics across heterogeneous multi-modal data. Our comprehensive experimental evaluation across 13 diverse datasets spanning 7 modalities (vision, text, time series, audio, sensor, medical vision, and multimodal) reveals critical insights: 1) an optimal dataset size range of 1000-1500 samples for federated learning effectiveness; 2) a clear modality performance hierarchy with structured data (time series, sensor) significantly outperforming unstructured data (text, multimodal); and 3) systematic performance degradation for large datasets exceeding 2000 samples. SAFL achieves an average accuracy of 87.68% across all datasets, with structured data modalities reaching 99%+ accuracy. The framework demonstrates superior communication efficiency, reducing total data transfer to 7.38 GB across 558 communications while maintaining high performance. Our real-time monitoring framework provides unprecedented insights into system resource utilization, network efficiency, and training dynamics. This work fills critical gaps in understanding how data characteristics should drive federated learning strategies, providing both theoretical insights and practical guidance for real-world FL deployments in neural network and learning systems.

Title: E-ABIN: an Explainable module for Anomaly detection in BIological Networks

Authors: Ugo Lomoio, Tommaso Mazza, Pierangelo Veltri, Pietro Hiram Guzzi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.20693
Pdf URL: https://arxiv.org/pdf/2506.20693
Copy Paste: [[2506.20693]] E-ABIN: an Explainable module for Anomaly detection in BIological Networks(https://arxiv.org/abs/2506.20693)
Keywords: robust, interpretability
Abstract: The increasing availability of large-scale omics data calls for robust analytical frameworks capable of handling complex gene expression datasets while offering interpretable results. Recent advances in artificial intelligence have enabled the identification of aberrant molecular patterns distinguishing disease states from healthy controls. Coupled with improvements in model interpretability, these tools now support the identification of genes potentially driving disease phenotypes. However, current approaches to gene anomaly detection often remain limited to single datasets and lack accessible graphical interfaces. Here, we introduce E-ABIN, a general-purpose, explainable framework for Anomaly detection in Biological Networks. E-ABIN combines classical machine learning and graph-based deep learning techniques within a unified, user-friendly platform, enabling the detection and interpretation of anomalies from gene expression or methylation-derived networks. By integrating algorithms such as Support Vector Machines, Random Forests, Graph Autoencoders (GAEs), and Graph Adversarial Attributed Networks (GAANs), E-ABIN ensures a high predictive accuracy while maintaining interpretability. We demonstrate the utility of E-ABIN through case studies of bladder cancer and coeliac disease, where it effectively uncovers biologically relevant anomalies and offers insights into disease mechanisms.

Title: Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models

Authors: Vineet Jain, Kusha Sareen, Mohammad Pedramfar, Siamak Ravanbakhsh
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.20701
Pdf URL: https://arxiv.org/pdf/2506.20701
Copy Paste: [[2506.20701]] Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models(https://arxiv.org/abs/2506.20701)
Keywords: diffusion, generative
Abstract: Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling. Existing steering methods suffer from inaccurate value estimation, especially at high noise levels, which biases guidance. Moreover, information from past runs is not reused to improve sample quality, resulting in inefficient use of compute. Inspired by the success of Monte Carlo Tree Search, we address these limitations by casting inference-time alignment as a search problem that reuses past computations. We introduce a tree-based approach that samples from the reward-aligned target density by propagating terminal rewards back through the diffusion chain and iteratively refining value estimates with each additional generation. Our proposed method, Diffusion Tree Sampling (DTS), produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS$^\star$), performs a global search for high reward samples. On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to $10\times$ less compute. In text-to-image generation and language completion tasks, DTS$^\star$ effectively searches for high reward samples that match best-of-N with up to $5\times$ less compute. By reusing information from previous generations, we get an anytime algorithm that turns additional compute into steadily better samples, providing a scalable approach for inference-time alignment of diffusion models.

Title: On Convolutions, Intrinsic Dimension, and Diffusion Models

Authors: Kin Kwan Leung, Rasa Hosseinzadeh, Gabriel Loaiza-Ganem
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.20705
Pdf URL: https://arxiv.org/pdf/2506.20705
Copy Paste: [[2506.20705]] On Convolutions, Intrinsic Dimension, and Diffusion Models(https://arxiv.org/abs/2506.20705)
Keywords: diffusion, generative
Abstract: The manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds. Diffusion models (DMs) -- which operate by convolving data with progressively larger amounts of Gaussian noise and then learning to revert this process -- have risen to prominence as the most performant generative models, and are known to be able to learn distributions with low-dimensional support. For a given datum in one of these submanifolds, we should thus intuitively expect DMs to have implicitly learned its corresponding local intrinsic dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari et al. (2024b) recently showed that this is indeed the case by linking this LID to the rate of change of the log marginal densities of the DM with respect to the amount of added noise, resulting in an LID estimator known as FLIPD. LID estimators such as FLIPD have a plethora of uses, among others they quantify the complexity of a given datum, and can be used to detect outliers, adversarial examples and AI-generated text. FLIPD achieves state-of-the-art performance at LID estimation, yet its theoretical underpinnings are incomplete since Kamkari et al. (2024b) only proved its correctness under the highly unrealistic assumption of affine submanifolds. In this work we bridge this gap by formally proving the correctness of FLIPD under realistic assumptions. Additionally, we show that an analogous result holds when Gaussian convolutions are replaced with uniform ones, and discuss the relevance of this result.

Title: Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset

Authors: Zhiqi Gao, Tianyi Li, Yurii Kvasiuk, Sai Chaitanya Tadepalli, Maja Rudolph, Daniel J.H. Chung, Frederic Sala, Moritz Münchmeyer
Subjects: cs.LG, astro-ph.CO, cs.AI, hep-ph, hep-th
Abstract URL: https://arxiv.org/abs/2506.20729
Pdf URL: https://arxiv.org/pdf/2506.20729
Copy Paste: [[2506.20729]] Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset(https://arxiv.org/abs/2506.20729)
Keywords: large language model
Abstract: Large language models (LLMs) have shown strong capabilities in complex reasoning, and test-time scaling techniques can enhance their performance with comparably low cost. Many of these methods have been developed and evaluated on mathematical reasoning benchmarks such as AIME. This paper investigates whether the lessons learned from these benchmarks generalize to the domain of advanced theoretical physics. We evaluate a range of common test-time scaling methods on the TPBench physics dataset and compare their effectiveness with results on AIME. To better leverage the structure of physics problems, we develop a novel, symbolic weak-verifier framework to improve parallel scaling results. Our empirical results demonstrate that this method significantly outperforms existing test-time scaling approaches on TPBench. We also evaluate our method on AIME, confirming its effectiveness in solving advanced mathematical problems. Our findings highlight the power of step-wise symbolic verification for tackling complex scientific problems.

Title: OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport

Authors: Qin Ren, Yifan Wang, Ruogu Fang, Haibin Ling, Chenyu You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20741
Pdf URL: https://arxiv.org/pdf/2506.20741
Copy Paste: [[2506.20741]] OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport(https://arxiv.org/abs/2506.20741)
Keywords: interpretability
Abstract: Survival prediction using whole slide images (WSIs) can be formulated as a multiple instance learning (MIL) problem. However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally -- through long-tailed morphological distributions, and locally through -- tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of modeling such heterogeneity by incorporating marginal distribution constraints. Building on this insight, we propose OTSurv, a novel MIL framework from an optimal transport perspective. Specifically, OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint that models prior morphological distributions to avert both mode collapse and excessive uniformity by regulating transport mass allocation, and (2) local uncertainty-aware constraint that prioritizes high-confidence patches while suppressing noise by progressively raising the total transport mass. We then recast the initial OT problem, augmented by these constraints, into an unbalanced OT formulation that can be solved with an efficient, hardware-friendly matrix scaling algorithm. Empirically, OTSurv sets new state-of-the-art results across six popular benchmarks, achieving an absolute 3.6% improvement in average C-index. In addition, OTSurv achieves statistical significance in log-rank tests and offers high interpretability, making it a powerful tool for survival prediction in digital pathology. Our codes are available at this https URL.

Title: A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools

Authors: Minh-Hao Van, Prateek Verma, Chen Zhao, Xintao Wu
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2506.20743
Pdf URL: https://arxiv.org/pdf/2506.20743
Copy Paste: [[2506.20743]] A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools(https://arxiv.org/abs/2506.20743)
Keywords: extraction, interpretability, large language model
Abstract: Foundation models (FMs) are catalyzing a transformative shift in materials science (MatSci) by enabling scalable, general-purpose, and multimodal AI systems for scientific discovery. Unlike traditional machine learning models, which are typically narrow in scope and require task-specific engineering, FMs offer cross-domain generalization and exhibit emergent capabilities. Their versatility is especially well-suited to materials science, where research challenges span diverse data types and scales. This survey provides a comprehensive overview of foundation models, agentic systems, datasets, and computational tools supporting this growing field. We introduce a task-driven taxonomy encompassing six broad application areas: data extraction, interpretation and Q\&A; atomistic simulation; property prediction; materials structure, design and discovery; process planning, discovery, and optimization; and multiscale modeling. We discuss recent advances in both unimodal and multimodal FMs, as well as emerging large language model (LLM) agents. Furthermore, we review standardized datasets, open-source tools, and autonomous experimental platforms that collectively fuel the development and integration of FMs into research workflows. We assess the early successes of foundation models and identify persistent limitations, including challenges in generalizability, interpretability, data imbalance, safety concerns, and limited multimodal fusion. Finally, we articulate future research directions centered on scalable pretraining, continual learning, data governance, and trustworthiness.

Title: Multiple Streams of Relation Extraction: Enriching and Recalling in Transformers

Authors: Todd Nief, David Reber, Sean Richardson, Ari Holtzman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.20746
Pdf URL: https://arxiv.org/pdf/2506.20746
Copy Paste: [[2506.20746]] Multiple Streams of Relation Extraction: Enriching and Recalling in Transformers(https://arxiv.org/abs/2506.20746)
Keywords: extraction, transformer
Abstract: When an LLM learns a relation during finetuning (e.g., new movie releases, corporate mergers, etc.), where does this information go? Is it extracted when the model processes an entity, recalled just-in-time before a prediction, or are there multiple separate heuristics? Existing localization approaches (e.g. activation patching) are ill-suited for this analysis because they tend to replace parts of the residual stream, potentially deleting information. To fill this gap, we propose dynamic weight-grafting between fine-tuned and pre-trained language models to show that fine-tuned language models both (1) extract relation information learned during finetuning while processing entities and (2) ``recall" this information in later layers while generating predictions. In some cases, models need both of these pathways to correctly generate finetuned information while, in other cases, a single ``enrichment" or ``recall" pathway alone is sufficient. We examine the necessity and sufficiency of these information pathways, examining what layers they occur at, how much redundancy they exhibit, and which model components are involved -- finding that the ``recall" pathway occurs via both task-specific attention mechanisms and a relation extraction step in the output of the attention and the feedforward networks at the final layers before next token prediction.

Title: Towards Probabilistic Question Answering Over Tabular Data

Authors: Chen Shen, Sajjadur Rahman, Estevam Hruschka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.20747
Pdf URL: https://arxiv.org/pdf/2506.20747
Copy Paste: [[2506.20747]] Towards Probabilistic Question Answering Over Tabular Data(https://arxiv.org/abs/2506.20747)
Keywords: large language model
Abstract: Current approaches for question answering (QA) over tabular data, such as NL2SQL systems, perform well for factual questions where answers are directly retrieved from tables. However, they fall short on probabilistic questions requiring reasoning under uncertainty. In this paper, we introduce a new benchmark LUCARIO and a framework for probabilistic QA over large tabular data. Our method induces Bayesian Networks from tables, translates natural language queries into probabilistic queries, and uses large language models (LLMs) to generate final answers. Empirical results demonstrate significant improvements over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.

Title: Characterization and Mitigation of Training Instabilities in Microscaling Formats

Authors: Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2506.20752
Pdf URL: https://arxiv.org/pdf/2506.20752
Copy Paste: [[2506.20752]] Characterization and Mitigation of Training Instabilities in Microscaling Formats(https://arxiv.org/abs/2506.20752)
Keywords: large language model
Abstract: Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from $2 \times 10^{17}$ to $4.8 \times 10^{19}$ FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through \emph{in situ} intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training. Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We release our code at this https URL.

Title: StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation

Authors: Haodong Li, Chen Wang, Jiahui Lei, Kostas Daniilidis, Lingjie Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20756
Pdf URL: https://arxiv.org/pdf/2506.20756
Copy Paste: [[2506.20756]] StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation(https://arxiv.org/abs/2506.20756)
Keywords: diffusion
Abstract: Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff's SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.

Title: Perry: A High-level Framework for Accelerating Cyber Deception Experimentation

Authors: Brian Singer, Yusuf Saquib, Lujo Bauer, Vyas Sekar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.20770
Pdf URL: https://arxiv.org/pdf/2506.20770
Copy Paste: [[2506.20770]] Perry: A High-level Framework for Accelerating Cyber Deception Experimentation(https://arxiv.org/abs/2506.20770)
Keywords: security, defense, attack
Abstract: Cyber deception aims to distract, delay, and detect network attackers with fake assets such as honeypots, decoy credentials, or decoy files. However, today, it is difficult for operators to experiment, explore, and evaluate deception approaches. Existing tools and platforms have non-portable and complex implementations that are difficult to modify and extend. We address this pain point by introducing Perry, a high-level framework that accelerates the design and exploration of deception what-if scenarios. Perry has two components: a high-level abstraction layer for security operators to specify attackers and deception strategies, and an experimentation module to run these attackers and defenders in realistic emulated networks. To translate these high-level specifications we design four key modules for Perry: 1) an action planner that translates high-level actions into low-level implementations, 2) an observability module to translate low-level telemetry into high-level observations, 3) an environment state service that enables environment agnostic strategies, and 4) an attack graph service to reason about how attackers could explore an environment. We illustrate that Perry's abstractions reduce the implementation effort of exploring a wide variety of deception defenses, attackers, and environments. We demonstrate the value of Perry by emulating 55 unique deception what-if scenarios and illustrate how these experiments enable operators to shed light on subtle tradeoffs.

Title: Stochastic and Non-local Closure Modeling for Nonlinear Dynamical Systems via Latent Score-based Generative Models

Authors: Xinghao Dong, Huchen Yang, Jin-Long Wu
Subjects: cs.LG, math.DS, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.20771
Pdf URL: https://arxiv.org/pdf/2506.20771
Copy Paste: [[2506.20771]] Stochastic and Non-local Closure Modeling for Nonlinear Dynamical Systems via Latent Score-based Generative Models(https://arxiv.org/abs/2506.20771)
Keywords: diffusion, generative
Abstract: We propose a latent score-based generative AI framework for learning stochastic, non-local closure models and constitutive laws in nonlinear dynamical systems of computational mechanics. This work addresses a key challenge of modeling complex multiscale dynamical systems without a clear scale separation, for which numerically resolving all scales is prohibitively expensive, e.g., for engineering turbulent flows. While classical closure modeling methods leverage domain knowledge to approximate subgrid-scale phenomena, their deterministic and local assumptions can be too restrictive in regimes lacking a clear scale separation. Recent developments of diffusion-based stochastic models have shown promise in the context of closure modeling, but their prohibitive computational inference cost limits practical applications for many real-world applications. This work addresses this limitation by jointly training convolutional autoencoders with conditional diffusion models in the latent spaces, significantly reducing the dimensionality of the sampling process while preserving essential physical characteristics. Numerical results demonstrate that the joint training approach helps discover a proper latent space that not only guarantees small reconstruction errors but also ensures good performance of the diffusion model in the latent space. When integrated into numerical simulations, the proposed stochastic modeling framework via latent conditional diffusion models achieves significant computational acceleration while maintaining comparable predictive accuracy to standard diffusion models in physical spaces.

Title: AI-Driven MRI-based Brain Tumour Segmentation Benchmarking

Authors: Connor Ludwig, Khashayar Namdar, Farzad Khalvati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20786
Pdf URL: https://arxiv.org/pdf/2506.20786
Copy Paste: [[2506.20786]] AI-Driven MRI-based Brain Tumour Segmentation Benchmarking(https://arxiv.org/abs/2506.20786)
Keywords: segmentation
Abstract: Medical image segmentation has greatly aided medical diagnosis, with U-Net based architectures and nnU-Net providing state-of-the-art performance. There have been numerous general promptable models and medical variations introduced in recent years, but there is currently a lack of evaluation and comparison of these models across a variety of prompt qualities on a common medical dataset. This research uses Segment Anything Model (SAM), Segment Anything Model 2 (SAM 2), MedSAM, SAM-Med-3D, and nnU-Net to obtain zero-shot inference on the BraTS 2023 adult glioma and pediatrics dataset across multiple prompt qualities for both points and bounding boxes. Several of these models exhibit promising Dice scores, particularly SAM and SAM 2 achieving scores of up to 0.894 and 0.893, respectively when given extremely accurate bounding box prompts which exceeds nnU-Net's segmentation performance. However, nnU-Net remains the dominant medical image segmentation network due to the impracticality of providing highly accurate prompts to the models. The model and prompt evaluation, as well as the comparison, are extended through fine-tuning SAM, SAM 2, MedSAM, and SAM-Med-3D on the pediatrics dataset. The improvements in point prompt performance after fine-tuning are substantial and show promise for future investigation, but are unable to achieve better segmentation than bounding boxes or nnU-Net.

Title: Stochastic Parameter Decomposition

Authors: Lucius Bushnaq, Dan Braun, Lee Sharkey
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20790
Pdf URL: https://arxiv.org/pdf/2506.20790
Copy Paste: [[2506.20790]] Stochastic Parameter Decomposition(https://arxiv.org/abs/2506.20790)
Keywords: robust, interpretability
Abstract: A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition -- a framework that has been proposed to resolve several issues with current decomposition methods -- decomposes neural network parameters into a sum of sparsely used vectors in parameter space. However, the current main method in this framework, Attribution-based Parameter Decomposition (APD), is impractical on account of its computational cost and sensitivity to hyperparameters. In this work, we introduce \textit{Stochastic Parameter Decomposition} (SPD), a method that is more scalable and robust to hyperparameters than APD, which we demonstrate by decomposing models that are slightly larger and more complex than was possible to decompose with APD. We also show that SPD avoids other issues, such as shrinkage of the learned parameters, and better identifies ground truth mechanisms in toy models. By bridging causal mediation analysis and network decomposition methods, this demonstration opens up new research possibilities in mechanistic interpretability by removing barriers to scaling linear parameter decomposition methods to larger models. We release a library for running SPD and reproducing our experiments at this https URL.

Title: Multi-lingual Functional Evaluation for Large Language Models

Authors: Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.20793
Pdf URL: https://arxiv.org/pdf/2506.20793
Copy Paste: [[2506.20793]] Multi-lingual Functional Evaluation for Large Language Models(https://arxiv.org/abs/2506.20793)
Keywords: robust, large language model
Abstract: Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

Title: SIMulator: SIM Tracing on a (Pico-)Budget

Authors: Gabriel K. Gegenhuber, Philipp É. Frenzel, Adrian Dabrowski
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.20800
Pdf URL: https://arxiv.org/pdf/2506.20800
Copy Paste: [[2506.20800]] SIMulator: SIM Tracing on a (Pico-)Budget(https://arxiv.org/abs/2506.20800)
Keywords: security
Abstract: SIM tracing -- the ability to inspect, modify, and relay communication between a SIM card and modem -- has become a significant technique in cellular network research. It enables essential security- and development-related applications such as fuzzing communication interfaces, extracting session keys, monitoring hidden SIM activity (e.g., proactive SIM commands or over-the-air updates), and facilitating scalable, distributed measurement platforms through SIM reuse. Traditionally, achieving these capabilities has relied on specialized hardware, which can pose financial and logistical burdens for researchers, particularly those new to the field. In this work, we show that full SIM tracing functionality can be achieved using only simple, widely available components, such as UART interfaces and GPIO ports. We port these capabilities to low-cost microcontrollers, exemplified by the Raspberry Pi Pico (4~USD). Unlike other approaches, it dramatically reduces hardware complexity by electrically decoupling the SIM and the modem and only transferring on APDU level. By significantly reducing hardware requirements and associated costs, we aim to make SIM tracing techniques accessible to a broader community of researchers and hobbyists, fostering wider exploration and experimentation in cellular network research.

Title: The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

Authors: Chenglei Si, Tatsunori Hashimoto, Diyi Yang
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.20803
Pdf URL: https://arxiv.org/pdf/2506.20803
Copy Paste: [[2506.20803]] The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas(https://arxiv.org/abs/2506.20803)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.

Title: Poster: Enhancing GNN Robustness for Network Intrusion Detection via Agent-based Analysis

Authors: Zhonghao Zhan, Huichi Zhou, Hamed Haddadi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20806
Pdf URL: https://arxiv.org/pdf/2506.20806
Copy Paste: [[2506.20806]] Poster: Enhancing GNN Robustness for Network Intrusion Detection via Agent-based Analysis(https://arxiv.org/abs/2506.20806)
Keywords: security, attack, robust, large language model
Abstract: Graph Neural Networks (GNNs) show great promise for Network Intrusion Detection Systems (NIDS), particularly in IoT environments, but suffer performance degradation due to distribution drift and lack robustness against realistic adversarial attacks. Current robustness evaluations often rely on unrealistic synthetic perturbations and lack demonstrations on systematic analysis of different kinds of adversarial attack, which encompass both black-box and white-box scenarios. This work proposes a novel approach to enhance GNN robustness and generalization by employing Large Language Models (LLMs) in an agentic pipeline as simulated cybersecurity expert agents. These agents scrutinize graph structures derived from network flow data, identifying and potentially mitigating suspicious or adversarially perturbed elements before GNN processing. Our experiments, using a framework designed for realistic evaluation and testing with a variety of adversarial attacks including a dataset collected from physical testbed experiments, demonstrate that integrating LLM analysis can significantly improve the resilience of GNN-based NIDS against challenges, showcasing the potential of LLM agent as a complementary layer in intrusion detection architectures.

Title: Divide, Specialize, and Route: A New Approach to Efficient Ensemble Learning

Authors: Jakub Piwko, Jędrzej Ruciński, Dawid Płudowski, Antoni Zajko, Patryzja Żak, Mateusz Zacharecki, Anna Kozak, Katarzyna Woźnica
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.20814
Pdf URL: https://arxiv.org/pdf/2506.20814
Copy Paste: [[2506.20814]] Divide, Specialize, and Route: A New Approach to Efficient Ensemble Learning(https://arxiv.org/abs/2506.20814)
Keywords: robust, interpretability
Abstract: Ensemble learning has proven effective in boosting predictive performance, but traditional methods such as bagging, boosting, and dynamic ensemble selection (DES) suffer from high computational cost and limited adaptability to heterogeneous data distributions. To address these limitations, we propose Hellsemble, a novel and interpretable ensemble framework for binary classification that leverages dataset complexity during both training and inference. Hellsemble incrementally partitions the dataset into circles of difficulty by iteratively passing misclassified instances from simpler models to subsequent ones, forming a committee of specialised base learners. Each model is trained on increasingly challenging subsets, while a separate router model learns to assign new instances to the most suitable base model based on inferred difficulty. Hellsemble achieves strong classification accuracy while maintaining computational efficiency and interpretability. Experimental results on OpenML-CC18 and Tabzilla benchmarks demonstrate that Hellsemble often outperforms classical ensemble methods. Our findings suggest that embracing instance-level difficulty offers a promising direction for constructing efficient and robust ensemble systems.

Title: Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers

Authors: Furkan Mumcu, Yasin Yilmaz
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2506.20816
Pdf URL: https://arxiv.org/pdf/2506.20816
Copy Paste: [[2506.20816]] Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers(https://arxiv.org/abs/2506.20816)
Keywords: defense, attack, robust
Abstract: Deep Neural Networks (DNNs) are notoriously vulnerable to adversarial input designs with limited noise budgets. While numerous successful attacks with subtle modifications to original input have been proposed, defense techniques against these attacks are relatively understudied. Existing defense approaches either focus on improving DNN robustness by negating the effects of perturbations or use a secondary model to detect adversarial data. Although equally important, the attack detection approach, which is studied in this work, provides a more practical defense compared to the robustness approach. We show that the existing detection methods are either ineffective against the state-of-the-art attack techniques or computationally inefficient for real-time processing. We propose a novel universal and efficient method to detect adversarial examples by analyzing the varying degrees of impact of attacks on different DNN layers. {Our method trains a lightweight regression model that predicts deeper-layer features from early-layer features, and uses the prediction error to detect adversarial samples.} Through theoretical arguments and extensive experiments, we demonstrate that our detection method is highly effective, computationally efficient for real-time processing, compatible with any DNN architecture, and applicable across different domains, such as image, video, and audio.

Title: MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering

Authors: Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2506.20821
Pdf URL: https://arxiv.org/pdf/2506.20821
Copy Paste: [[2506.20821]] MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering(https://arxiv.org/abs/2506.20821)
Keywords: extraction, large language model
Abstract: Financial documents--such as 10-Ks, 10-Qs, and investor presentations--span hundreds of pages and combine diverse modalities, including dense narrative text, structured tables, and complex figures. Answering questions over such content often requires joint reasoning across modalities, which strains traditional large language models (LLMs) and retrieval-augmented generation (RAG) pipelines due to token limitations, layout loss, and fragmented cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation framework purpose-built for financial QA. MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning.

Title: Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes

Authors: Quintin Myers, Yanjun Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20822
Pdf URL: https://arxiv.org/pdf/2506.20822
Copy Paste: [[2506.20822]] Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes(https://arxiv.org/abs/2506.20822)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly proposed for detecting and responding to violent content online, yet their ability to reason about morally ambiguous, real-world scenarios remains underexamined. We present the first study to evaluate LLMs using a validated social science instrument designed to measure human response to everyday conflict, namely the Violent Behavior Vignette Questionnaire (VBVQ). To assess potential bias, we introduce persona-based prompting that varies race, age, and geographic identity within the United States. Six LLMs developed across different geopolitical and organizational contexts are evaluated under a unified zero-shot setting. Our study reveals two key findings: (1) LLMs surface-level text generation often diverges from their internal preference for violent responses; (2) their violent tendencies vary across demographics, frequently contradicting established findings in criminology, social science, and psychology.

Title: Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models

Authors: Cansu Korkmaz, Ahmet Murat Tekalp, Zafer Dogan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20832
Pdf URL: https://arxiv.org/pdf/2506.20832
Copy Paste: [[2506.20832]] Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models(https://arxiv.org/abs/2506.20832)
Keywords: robust, diffusion, generative
Abstract: Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.

Title: Leaner Training, Lower Leakage: Revisiting Memorization in LLM Fine-Tuning with LoRA

Authors: Fei Wang, Baochun Li
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2506.20856
Pdf URL: https://arxiv.org/pdf/2506.20856
Copy Paste: [[2506.20856]] Leaner Training, Lower Leakage: Revisiting Memorization in LLM Fine-Tuning with LoRA(https://arxiv.org/abs/2506.20856)
Keywords: attack, extraction, large language model
Abstract: Memorization in large language models (LLMs) makes them vulnerable to data extraction attacks. While pre-training memorization has been extensively studied, fewer works have explored its impact in fine-tuning, particularly for LoRA fine-tuning, a widely adopted parameter-efficient method. In this work, we re-examine memorization in fine-tuning and uncover a surprising divergence from prior findings across different fine-tuning strategies. Factors such as model scale and data duplication, which strongly influence memorization in pre-training and full fine-tuning, do not follow the same trend in LoRA fine-tuning. Using a more relaxed similarity-based memorization metric, we demonstrate that LoRA significantly reduces memorization risks compared to full fine-tuning, while still maintaining strong task performance.

Title: Empowering Digital Agriculture: A Privacy-Preserving Framework for Data Sharing and Collaborative Research

Authors: Osama Zafar, Rosemarie Santa González, Mina Namazi, Alfonso Morales, Erman Ayday
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.20872
Pdf URL: https://arxiv.org/pdf/2506.20872
Copy Paste: [[2506.20872]] Empowering Digital Agriculture: A Privacy-Preserving Framework for Data Sharing and Collaborative Research(https://arxiv.org/abs/2506.20872)
Keywords: secure, privacy, protect, attack, robust, federate
Abstract: Data-driven agriculture, which integrates technology and data into agricultural practices, has the potential to improve crop yield, disease resilience, and long-term soil health. However, privacy concerns, such as adverse pricing, discrimination, and resource manipulation, deter farmers from sharing data, as it can be used against them. To address this barrier, we propose a privacy-preserving framework that enables secure data sharing and collaboration for research and development while mitigating privacy risks. The framework combines dimensionality reduction techniques (like Principal Component Analysis (PCA)) and differential privacy by introducing Laplacian noise to protect sensitive information. The proposed framework allows researchers to identify potential collaborators for a target farmer and train personalized machine learning models either on the data of identified collaborators via federated learning or directly on the aggregated privacy-protected data. It also allows farmers to identify potential collaborators based on similarities. We have validated this on real-life datasets, demonstrating robust privacy protection against adversarial attacks and utility performance comparable to a centralized system. We demonstrate how this framework can facilitate collaboration among farmers and help researchers pursue broader research objectives. The adoption of the framework can empower researchers and policymakers to leverage agricultural data responsibly, paving the way for transformative advances in data-driven agriculture. By addressing critical privacy challenges, this work supports secure data integration, fostering innovation and sustainability in agricultural systems.

Title: THIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion

Authors: Calin Teodor Ioan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20877
Pdf URL: https://arxiv.org/pdf/2506.20877
Copy Paste: [[2506.20877]] THIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion(https://arxiv.org/abs/2506.20877)
Keywords: transformer
Abstract: Monocular depth estimation methods traditionally train deep models to infer depth directly from RGB pixels. This implicit learning often overlooks explicit monocular cues that the human visual system relies on, such as occlusion boundaries, shading, and perspective. Rather than expecting a network to discover these cues unaided, we present ThirdEye, a cue-aware pipeline that deliberately supplies each cue through specialised, pre-trained, and frozen networks. These cues are fused in a three-stage cortical hierarchy (V1->V2->V3) equipped with a key-value working-memory module that weights them by reliability. An adaptive-bins transformer head then produces a high-resolution disparity map. Because the cue experts are frozen, ThirdEye inherits large amounts of external supervision while requiring only modest fine-tuning. This extended version provides additional architectural detail, neuroscientific motivation, and an expanded experimental protocol; quantitative results will appear in a future revision.

Title: MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

Authors: Shubhankar Borse, Seokeon Choi, Sunghyun Park, Jeongho Kim, Shreya Kadambi, Risheek Garrepalli, Sungrack Yun, Munawar Hayat, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20879
Pdf URL: https://arxiv.org/pdf/2506.20879
Copy Paste: [[2506.20879]] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans(https://arxiv.org/abs/2506.20879)
Keywords: generative, segmentation
Abstract: Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation.

Title: Omniwise: Predicting GPU Kernels Performance with LLMs

Authors: Zixian Wang, Cole Ramos, Muhammad A. Awad, Keith Lowery
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20886
Pdf URL: https://arxiv.org/pdf/2506.20886
Copy Paste: [[2506.20886]] Omniwise: Predicting GPU Kernels Performance with LLMs(https://arxiv.org/abs/2506.20886)
Keywords: large language model
Abstract: In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful architectures have transformed a wide range of downstream applications, tackling tasks beyond human reach. In this paper, we introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline that applies large language models (LLMs) to GPU kernel performance prediction--a novel use case in performance profiling. Omniwise is model-agnostic and lightweight, achieving strong results even with a small 3B-parameter model. It can predict key performance metrics, including memory bandwidth, cache hit rates, GFLOPs, and arithmetic intensity, directly from kernel code without the need for code execution or profiling tools. Our approach achieves over 90% of predictions within 10% relative error on GPU kernels executed on AMD MI250 and MI300X architectures. In addition to the pipeline, we develop an online inference server and a Visual Studio Code plugin that seamlessly integrate LLM-based performance prediction into developers' workflows.

Title: On the Necessity of Output Distribution Reweighting for Effective Class Unlearning

Authors: Yian Wang, Ali Ebrahimpour-Boroojeny, Hari Sundaram
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.20893
Pdf URL: https://arxiv.org/pdf/2506.20893
Copy Paste: [[2506.20893]] On the Necessity of Output Distribution Reweighting for Effective Class Unlearning(https://arxiv.org/abs/2506.20893)
Keywords: attack, robust, membership infer
Abstract: In this work, we introduce an output-reweighting unlearning method, RWFT, a lightweight technique that erases an entire class from a trained classifier without full retraining. Forgetting specific classes from trained models is essential for enforcing user deletion rights and mitigating harmful or biased predictions. The full retraining is costly and existing unlearning methods fail to replicate the behavior of the retrained models when predicting samples from the unlearned class. We prove this failure by designing a variant of membership inference attacks, MIA-NN that successfully reveals the unlearned class for any of these methods. We propose a simple redistribution of the probability mass for the prediction on the samples in the forgotten class which is robust to MIA-NN. We also introduce a new metric based on the total variation (TV) distance of the prediction probabilities to quantify residual leakage to prevent future methods from susceptibility to the new attack. Through extensive experiments with state of the art baselines in machine unlearning, we show that our approach matches the results of full retraining in both metrics used for evaluation by prior work and the new metric we propose in this work. Compare to state-of-the-art methods, we gain 2.79% in previously used metrics and 111.45% in our new TV-based metric over the best existing method.

Title: FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

Authors: Advait Gupta, Rishie Raj, Dang Nguyen, Tianyi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20911
Pdf URL: https://arxiv.org/pdf/2506.20911
Copy Paste: [[2506.20911]] FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing(https://arxiv.org/abs/2506.20911)
Keywords: large language model
Abstract: We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.'' It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A$^*$ search per subtask to find a cost-efficient toolpath -- a sequence of calls to AI tools. To save the cost of A$^*$ on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A$^*$ search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent "FaSTA$^*$'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A$^*$ search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA$^*$ is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.

Title: ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models

Authors: Mina Namazi, Alexander Nemecek, Erman Ayday
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.20915
Pdf URL: https://arxiv.org/pdf/2506.20915
Copy Paste: [[2506.20915]] ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models(https://arxiv.org/abs/2506.20915)
Keywords: security, privacy, large language model
Abstract: As the deployment of large language models (LLMs) grows in sensitive domains, ensuring the integrity of their computational provenance becomes a critical challenge, particularly in regulated sectors such as healthcare, where strict requirements are applied in dataset usage. We introduce ZKPROV, a novel cryptographic framework that enables zero-knowledge proofs of LLM provenance. It allows users to verify that a model is trained on a reliable dataset without revealing sensitive information about it or its parameters. Unlike prior approaches that focus on complete verification of the training process (incurring significant computational cost) or depend on trusted execution environments, ZKPROV offers a distinct balance. Our method cryptographically binds a trained model to its authorized training dataset(s) through zero-knowledge proofs while avoiding proof of every training step. By leveraging dataset-signed metadata and compact model parameter commitments, ZKPROV provides sound and privacy-preserving assurances that the result of the LLM is derived from a model trained on the claimed authorized and relevant dataset. Experimental results demonstrate the efficiency and scalability of the ZKPROV in generating this proof and verifying it, achieving a practical solution for real-world deployments. We also provide formal security guarantees, proving that our approach preserves dataset confidentiality while ensuring trustworthy dataset provenance.

Title: Optimising Language Models for Downstream Tasks: A Post-Training Perspective

Authors: Zhengyan Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20917
Pdf URL: https://arxiv.org/pdf/2506.20917
Copy Paste: [[2506.20917]] Optimising Language Models for Downstream Tasks: A Post-Training Perspective(https://arxiv.org/abs/2506.20917)
Keywords: robust
Abstract: Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often underutilizes available unlabelled data, leads to overfitting on small task-specific sets, and imposes significant computational costs. These limitations hamper their application to the open-ended landscape of real-world language tasks. This thesis proposes a series of methods to better adapt LMs to downstream applications. First, we explore strategies for extracting task-relevant knowledge from unlabelled data, introducing a novel continued pre-training technique that outperforms state-of-the-art semi-supervised approaches. Next, we present a parameter-efficient fine-tuning method that substantially reduces memory and compute costs while maintaining competitive performance. We also introduce improved supervised fine-tuning methods that enable LMs to better follow instructions, especially when labelled data is scarce, enhancing their performance across a range of NLP tasks, including open-ended generation. Finally, we develop new evaluation methods and benchmarks, such as multi-hop spatial reasoning tasks, to assess LM capabilities and adaptation more comprehensively. Through extensive empirical studies across diverse NLP tasks, our results demonstrate that these approaches substantially improve LM robustness, efficiency, and generalization, making them more adaptable to a broad range of applications. These advances mark a significant step towards more robust and efficient LMs, bringing us closer to the goal of artificial general intelligence.

Title: FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Authors: Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.20920
Pdf URL: https://arxiv.org/pdf/2506.20920
Copy Paste: [[2506.20920]] FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language(https://arxiv.org/abs/2506.20920)
Keywords: large language model
Abstract: Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.

Title: LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Authors: Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock, Cheng-Kai Lai, Amir Barati Farimani
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2506.20921
Pdf URL: https://arxiv.org/pdf/2506.20921
Copy Paste: [[2506.20921]] LLM-guided Chemical Process Optimization with a Multi-Agent Approach(https://arxiv.org/abs/2506.20921)
Keywords: large language model
Abstract: Chemical process optimization is crucial to maximize production efficiency and economic performance. Traditional methods, including gradient-based solvers, evolutionary algorithms, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable, requiring engineers to rely on subjective heuristics to estimate feasible parameter ranges. To address this constraint definition bottleneck, we present a multi-agent framework of large language model (LLM) agents that autonomously infer operating constraints from minimal process descriptions, then collaboratively guide optimization using the inferred constraints. Our AutoGen-based agentic framework employs OpenAI's o3 model, with specialized agents for constraint generation, parameter validation, simulation execution, and optimization guidance. Through two phases - autonomous constraint generation using embedded domain knowledge, followed by iterative multi-agent optimization - the framework eliminates the need for predefined operational bounds. Validated on the hydrodealkylation process across cost, yield, and yield-to-cost ratio metrics, the framework demonstrated competitive performance with conventional optimization methods while achieving better computational efficiency, requiring fewer iterations to converge. Our approach converged in under 20 minutes, achieving a 31-fold speedup over grid search. Beyond computational efficiency, the framework's reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs, and applying domain-informed heuristics. This approach shows significant potential for optimization scenarios where operational constraints are poorly characterized or unavailable, particularly for emerging processes and retrofit applications.

Title: M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization

Authors: Ju-Hyeon Nam, Dong-Hyun Moon, Sang-Chul Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20922
Pdf URL: https://arxiv.org/pdf/2506.20922
Copy Paste: [[2506.20922]] M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization(https://arxiv.org/abs/2506.20922)
Keywords: transformer
Abstract: Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map, a curvature metric indicating the difficulty of forgery localization, which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.

Title: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Authors: Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.20923
Pdf URL: https://arxiv.org/pdf/2506.20923
Copy Paste: [[2506.20923]] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model(https://arxiv.org/abs/2506.20923)
Keywords: robust, transformer
Abstract: In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.

Title: CodeGuard: A Generalized and Stealthy Backdoor Watermarking for Generative Code Models

Authors: Haoxuan Li, Jiale Zhang, Xiaobing Sun, Xiapu Luo
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.20926
Pdf URL: https://arxiv.org/pdf/2506.20926
Copy Paste: [[2506.20926]] CodeGuard: A Generalized and Stealthy Backdoor Watermarking for Generative Code Models(https://arxiv.org/abs/2506.20926)
Keywords: protect, steal, watermark, generative
Abstract: Generative code models (GCMs) significantly enhance development efficiency through automated code generation and code summarization. However, building and training these models require computational resources and time, necessitating effective digital copyright protection to prevent unauthorized leaks and misuse. Backdoor watermarking, by embedding hidden identifiers, simplifies copyright verification by breaking the model's black-box nature. Current backdoor watermarking techniques face two main challenges: first, limited generalization across different tasks and datasets, causing fluctuating verification rates; second, insufficient stealthiness, as watermarks are easily detected and removed by automated methods. To address these issues, we propose CodeGuard, a novel watermarking method combining attention mechanisms with distributed trigger embedding strategies. Specifically, CodeGuard employs attention mechanisms to identify watermark embedding positions, ensuring verifiability. Moreover, by using homomorphic character replacement, it avoids manual detection, while distributed trigger embedding reduces the likelihood of automated detection. Experimental results demonstrate that CodeGuard achieves up to 100% watermark verification rates in both code summarization and code generation tasks, with no impact on the primary task performance. In terms of stealthiness, CodeGuard performs exceptionally, with a maximum detection rate of only 0.078 against ONION detection methods, significantly lower than baseline methods.

Title: Interpretable Representation Learning for Additive Rule Ensembles

Authors: Shahrzad Behzadimanesh, Pierre Le Bodic, Geoffrey I. Webb, Mario Boley
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20927
Pdf URL: https://arxiv.org/pdf/2506.20927
Copy Paste: [[2506.20927]] Interpretable Representation Learning for Additive Rule Ensembles(https://arxiv.org/abs/2506.20927)
Keywords: interpretability
Abstract: Small additive ensembles of symbolic rules offer interpretable prediction models. Traditionally, these ensembles use rule conditions based on conjunctions of simple threshold propositions $x \geq t$ on a single input variable $x$ and threshold $t$, resulting geometrically in axis-parallel polytopes as decision regions. While this form ensures a high degree of interpretability for individual rules and can be learned efficiently using the gradient boosting approach, it relies on having access to a curated set of expressive and ideally independent input features so that a small ensemble of axis-parallel regions can describe the target variable well. Absent such features, reaching sufficient accuracy requires increasing the number and complexity of individual rules, which diminishes the interpretability of the model. Here, we extend classical rule ensembles by introducing logical propositions with learnable sparse linear transformations of input variables, i.e., propositions of the form $\mathbf{x}^\mathrm{T}\mathbf{w} \geq t$, where $\mathbf{w}$ is a learnable sparse weight vector, enabling decision regions as general polytopes with oblique faces. We propose a learning method using sequential greedy optimization based on an iteratively reweighted formulation of logistic regression. Experimental results demonstrate that the proposed method efficiently constructs rule ensembles with the same test risk as state-of-the-art methods while significantly reducing model complexity across ten benchmark datasets.

Title: SPA: Towards More Stealth and Persistent Backdoor Attacks in Federated Learning

Authors: Chengcheng Zhu, Ye Li, Bosen Rao, Jiale Zhang, Yunlong Mao, Sheng Zhong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.20931
Pdf URL: https://arxiv.org/pdf/2506.20931
Copy Paste: [[2506.20931]] SPA: Towards More Stealth and Persistent Backdoor Attacks in Federated Learning(https://arxiv.org/abs/2506.20931)
Keywords: security, privacy, defense, attack, robust, steal, federate
Abstract: Federated Learning (FL) has emerged as a leading paradigm for privacy-preserving distributed machine learning, yet the distributed nature of FL introduces unique security challenges, notably the threat of backdoor attacks. Existing backdoor strategies predominantly rely on end-to-end label supervision, which, despite their efficacy, often results in detectable feature disentanglement and limited persistence. In this work, we propose a novel and stealthy backdoor attack framework, named SPA, which fundamentally departs from traditional approaches by leveraging feature-space alignment rather than direct trigger-label association. Specifically, SPA reduces representational distances between backdoor trigger features and target class features, enabling the global model to misclassify trigger-embedded inputs with high stealth and persistence. We further introduce an adaptive, adversarial trigger optimization mechanism, utilizing boundary-search in the feature space to enhance attack longevity and effectiveness, even against defensive FL scenarios and non-IID data distributions. Extensive experiments on various FL benchmarks demonstrate that SPA consistently achieves high attack success rates with minimal impact on model utility, maintains robustness under challenging participation and data heterogeneity conditions, and exhibits persistent backdoor effects far exceeding those of conventional techniques. Our results call urgent attention to the evolving sophistication of backdoor threats in FL and emphasize the pressing need for advanced, feature-level defense techniques.

Title: Model State Arithmetic for Machine Unlearning

Authors: Keivan Rezaei, Mehrdad Saberi, Abhilasha Ravichander, Soheil Feizi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.20941
Pdf URL: https://arxiv.org/pdf/2506.20941
Copy Paste: [[2506.20941]] Model State Arithmetic for Machine Unlearning(https://arxiv.org/abs/2506.20941)
Keywords: large language model
Abstract: Large language models are trained on massive corpora of web data, which may include private data, copyrighted material, factually inaccurate data, or data that degrades model performance. Eliminating the influence of such problematic datapoints through complete retraining -- by repeatedly pretraining the model on datasets that exclude these specific instances -- is computationally prohibitive. For this reason, unlearning algorithms have emerged that aim to eliminate the influence of particular datapoints, while otherwise preserving the model -- at a low computational cost. However, precisely estimating and undoing the influence of individual datapoints has proved to be challenging. In this work, we propose a new algorithm, MSA, for estimating and undoing the influence of datapoints -- by leveraging model checkpoints i.e. artifacts capturing model states at different stages of pretraining. Our experimental results demonstrate that MSA consistently outperforms existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics, suggesting that MSA could be an effective approach towards more flexible large language models that are capable of data erasure.

Title: Hierarchical Sub-action Tree for Continuous Sign Language Recognition

Authors: Dejie Yang, Zhu Xu, Xinjie Gao, Yang Liu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.20947
Pdf URL: https://arxiv.org/pdf/2506.20947
Copy Paste: [[2506.20947]] Hierarchical Sub-action Tree for Continuous Sign Language Recognition(https://arxiv.org/abs/2506.20947)
Keywords: large language model
Abstract: Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.

Title: Antibody Design and Optimization with Multi-scale Equivariant Graph Diffusion Models for Accurate Complex Antigen Binding

Authors: Jiameng Chen, Xiantao Cai, Jia Wu, Wenbin Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20957
Pdf URL: https://arxiv.org/pdf/2506.20957
Copy Paste: [[2506.20957]] Antibody Design and Optimization with Multi-scale Equivariant Graph Diffusion Models for Accurate Complex Antigen Binding(https://arxiv.org/abs/2506.20957)
Keywords: robust, diffusion
Abstract: Antibody design remains a critical challenge in therapeutic and diagnostic development, particularly for complex antigens with diverse binding interfaces. Current computational methods face two main limitations: (1) capturing geometric features while preserving symmetries, and (2) generalizing novel antigen interfaces. Despite recent advancements, these methods often fail to accurately capture molecular interactions and maintain structural integrity. To address these challenges, we propose \textbf{AbMEGD}, an end-to-end framework integrating \textbf{M}ulti-scale \textbf{E}quivariant \textbf{G}raph \textbf{D}iffusion for antibody sequence and structure co-design. Leveraging advanced geometric deep learning, AbMEGD combines atomic-level geometric features with residue-level embeddings, capturing local atomic details and global sequence-structure interactions. Its E(3)-equivariant diffusion method ensures geometric precision, computational efficiency, and robust generalizability for complex antigens. Furthermore, experiments using the SAbDab database demonstrate a 10.13\% increase in amino acid recovery, 3.32\% rise in improvement percentage, and a 0.062~Å reduction in root mean square deviation within the critical CDR-H3 region compared to DiffAb, a leading antibody design model. These results highlight AbMEGD's ability to balance structural integrity with improved functionality, establishing a new benchmark for sequence-structure co-design and affinity optimization. The code is available at: this https URL.

Title: Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

Authors: Chengkuan Chen, Luca L. Weishaupt, Drew F. K. Williamson, Richard J. Chen, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y. Lu, Faisal Mahmood
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20964
Pdf URL: https://arxiv.org/pdf/2506.20964
Copy Paste: [[2506.20964]] Evidence-based diagnostic reasoning with multi-agent copilot for human pathology(https://arxiv.org/abs/2506.20964)
Keywords: large language model
Abstract: Pathology is experiencing rapid digital transformation driven by whole-slide imaging and artificial intelligence (AI). While deep learning-based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text-based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi-image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state-of-the-art (SOTA) general-purpose and other pathology-specific models. Furthermore, we present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open-ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly-interpretable summary reports.

Title: DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing

Authors: Lingling Cai, Kang Zhao, Hangjie Yuan, Xiang Wang, Yingya Zhang, Kejie Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20967
Pdf URL: https://arxiv.org/pdf/2506.20967
Copy Paste: [[2506.20967]] DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing(https://arxiv.org/abs/2506.20967)
Keywords: diffusion, transformer
Abstract: The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) -- a theoretically unbiased estimation of DFV -- and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85\% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.

Title: From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging

Authors: Tao Liu, Dafeng Zhang, Gengchen Li, Shizhuo Liu, Yongqi Song, Senmao Li, Shiqi Yang, Boqian Li, Kai Wang, Yaxing Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20977
Pdf URL: https://arxiv.org/pdf/2506.20977
Copy Paste: [[2506.20977]] From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging(https://arxiv.org/abs/2506.20977)
Keywords: diffusion
Abstract: Face aging has become a crucial task in computer vision, with applications ranging from entertainment to healthcare. However, existing methods struggle with achieving a realistic and seamless transformation across the entire lifespan, especially when handling large age gaps or extreme head poses. The core challenge lies in balancing age accuracy and identity preservation--what we refer to as the Age-ID trade-off. Most prior methods either prioritize age transformation at the expense of identity consistency or vice versa. In this work, we address this issue by proposing a two-pass face aging framework, named Cradle2Cane, based on few-step text-to-image (T2I) diffusion models. The first pass focuses on solving age accuracy by introducing an adaptive noise injection (AdaNI) mechanism. This mechanism is guided by including prompt descriptions of age and gender for the given person as the textual condition. Also, by adjusting the noise level, we can control the strength of aging while allowing more flexibility in transforming the face. However, identity preservation is weakly ensured here to facilitate stronger age transformations. In the second pass, we enhance identity preservation while maintaining age-specific features by conditioning the model on two identity-aware embeddings (IDEmb): SVR-ArcFace and Rotate-CLIP. This pass allows for denoising the transformed image from the first pass, ensuring stronger identity preservation without compromising the aging accuracy. Both passes are jointly trained in an end-to-end way. Extensive experiments on the CelebA-HQ test dataset, evaluated through Face++ and Qwen-VL protocols, show that our Cradle2Cane outperforms existing face aging methods in age accuracy and identity consistency.

Title: PrivacyGo: Privacy-Preserving Ad Measurement with Multidimensional Intersection

Authors: Jian Du, Haohao Qian, Shikun Zhang, Wen-jie Lu, Donghang Lu, Yongchuan Niu, Bo Jiang, Yongjun Zhao, Qiang Yan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.20981
Pdf URL: https://arxiv.org/pdf/2506.20981
Copy Paste: [[2506.20981]] PrivacyGo: Privacy-Preserving Ad Measurement with Multidimensional Intersection(https://arxiv.org/abs/2506.20981)
Keywords: secure, privacy, attack, membership infer
Abstract: This paper tackles the challenging and practical problem of multi-identifier private user profile matching for privacy-preserving ad measurement, a cornerstone of modern advertising analytics. We introduce a comprehensive cryptographic framework leveraging reversed Oblivious Pseudorandom Functions (OPRF) and novel blind key rotation techniques to support secure matching across multiple identifiers. Our design prevents cross-identifier linkages and includes a differentially private mechanism to obfuscate intersection sizes, mitigating risks such as membership inference attacks. We present a concrete construction of our protocol that achieves both strong privacy guarantees and high efficiency. It scales to large datasets, offering a practical and scalable solution for privacy-centric applications like secure ad conversion tracking. By combining rigorous cryptographic principles with differential privacy, our work addresses a critical need in the advertising industry, setting a new standard for privacy-preserving ad measurement frameworks.

Title: Rethink Sparse Signals for Pose-guided Text-to-image Generation

Authors: Wenjie Xuan, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20983
Pdf URL: https://arxiv.org/pdf/2506.20983
Copy Paste: [[2506.20983]] Rethink Sparse Signals for Pose-guided Text-to-image Generation(https://arxiv.org/abs/2506.20983)
Keywords: robust
Abstract: Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at this https URL.

Title: Segment Anything in Pathology Images with Natural Language

Authors: Zhixuan Chen, Junlin Hou, Liqi Lin, Yihui Wang, Yequan Bie, Xi Wang, Yanning Zhou, Ronald Cheong Kin Chan, Hao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20988
Pdf URL: https://arxiv.org/pdf/2506.20988
Copy Paste: [[2506.20988]] Segment Anything in Pathology Images with Natural Language(https://arxiv.org/abs/2506.20988)
Keywords: robust, interpretability, segmentation
Abstract: Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text-prompted segmentation foundation model designed specifically for pathology images. We also introduce PathSeg , the largest and most comprehensive dataset for pathology segmentation, built from 17 public sources and containing 275k image-mask-label triples across 160 diverse categories. With PathSegmentor, users can perform semantic segmentation using natural language prompts, eliminating the need for laborious spatial inputs such as points or boxes. Extensive experiments demonstrate that PathSegmentor outperforms specialized models with higher accuracy and broader applicability, while maintaining a compact architecture. It significantly surpasses existing spatial- and text-prompted models by 0.145 and 0.429 in overall Dice scores, respectively, showing strong robustness in segmenting complex structures and generalizing to external datasets. Moreover, PathSegmentor's outputs enhance the interpretability of diagnostic models through feature importance estimation and imaging biomarker discovery, offering pathologists evidence-based support for clinical decision-making. This work advances the development of explainable AI in precision oncology.

Title: Can Gradient Descent Simulate Prompting?

Authors: Eric Zhang, Leshem Choshen, Jacob Andreas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.20989
Pdf URL: https://arxiv.org/pdf/2506.20989
Copy Paste: [[2506.20989]] Can Gradient Descent Simulate Prompting?(https://arxiv.org/abs/2506.20989)
Keywords: robust
Abstract: There are two primary ways of incorporating new information into a language model (LM): changing its prompt or changing its parameters, e.g. via fine-tuning. Parameter updates incur no long-term storage cost for model changes. However, for many model updates, prompting is significantly more effective: prompted models can generalize robustly from single examples and draw logical inferences that do not occur under standard fine-tuning. Can models be modified so that fine-tuning does emulate prompting? This paper describes a method for meta-training LMs such that gradient updates emulate the effects of conditioning on new information. Our approach uses tools from gradient-based meta-learning but uses an LM's own prompted predictions as targets, eliminating the need for ground-truth labels. Subsequent gradient descent training recovers some (and occasionally all) of prompted model performance -- showing improvement on the ``reversal curse'' tasks, and answering questions about text passages after a single gradient update. These results suggest that, with appropriate initialization, gradient descent can be surprisingly expressive. Our results suggest new avenues for long-context modeling and offer insight into the generalization capabilities of gradient-based learning.

Title: TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation

Authors: Chade Li, Pengju Zhang, Yihong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20991
Pdf URL: https://arxiv.org/pdf/2506.20991
Copy Paste: [[2506.20991]] TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation(https://arxiv.org/abs/2506.20991)
Keywords: segmentation
Abstract: The rapid advancement of 3D vision-language models (VLMs) has spurred significant interest in interactive point cloud processing tasks, particularly for real-world applications. However, existing methods often underperform in point-level tasks, such as segmentation, due to missing direct 3D-text alignment, limiting their ability to link local 3D features with textual context. To solve this problem, we propose TSDASeg, a Two-Stage model coupled with a Direct cross-modal Alignment module and memory module for interactive point cloud Segmentation. We introduce the direct cross-modal alignment module to establish explicit alignment between 3D point clouds and textual/2D image data. Within the memory module, we employ multiple dedicated memory banks to separately store text features, visual features, and their cross-modal correspondence mappings. These memory banks are dynamically leveraged through self-attention and cross-attention mechanisms to update scene-specific features based on prior stored data, effectively addressing inconsistencies in interactive segmentation results across diverse scenarios. Experiments conducted on multiple 3D instruction, reference, and semantic segmentation datasets demonstrate that the proposed method achieves state-of-the-art performance.

Title: SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control

Authors: Adithya Chittem, Aishna Shrivastava, Sai Tarun Pendela, Jagat Sesh Challa, Dhruv Kumar
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.20993
Pdf URL: https://arxiv.org/pdf/2506.20993
Copy Paste: [[2506.20993]] SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control(https://arxiv.org/abs/2506.20993)
Keywords: large language model
Abstract: Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.

Title: DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting

Authors: Yeon-Ji Song, Jaein Kim, Byung-Ju Kim, Byoung-Tak Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20998
Pdf URL: https://arxiv.org/pdf/2506.20998
Copy Paste: [[2506.20998]] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting(https://arxiv.org/abs/2506.20998)
Keywords: robust
Abstract: Novel view synthesis is a task of generating scenes from unseen perspectives; however, synthesizing dynamic scenes from blurry monocular videos remains an unresolved challenge that has yet to be effectively addressed. Existing novel view synthesis methods are often constrained by their reliance on high-resolution images or strong assumptions about static geometry and rigid scene priors. Consequently, their approaches lack robustness in real-world environments with dynamic object and camera motion, leading to instability and degraded visual fidelity. To address this, we propose Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), a method designed for dynamic view synthesis from blurry monocular videos. Our model generates dense 3D Gaussians, restoring sharpness from blurry videos and reconstructing detailed 3D geometry of the scene affected by dynamic motion variations. Our model achieves robust performance in novel view synthesis under dynamic blurry scenes and sets a new benchmark in realistic novel view synthesis for blurry monocular video inputs.

Title: Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology

Authors: Qiuyi Qi, Xin Li, Ming Kong, Zikang Xu, Bingdi Chen, Qiang Zhu, S Kevin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21001
Pdf URL: https://arxiv.org/pdf/2506.21001
Copy Paste: [[2506.21001]] Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology(https://arxiv.org/abs/2506.21001)
Keywords: robust
Abstract: Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at this https URL.

Title: Inverse Scene Text Removal

Authors: Takumi Yoshimatsu, Shumpei Takezaki, Seiichi Uchida
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21002
Pdf URL: https://arxiv.org/pdf/2506.21002
Copy Paste: [[2506.21002]] Inverse Scene Text Removal(https://arxiv.org/abs/2506.21002)
Keywords: privacy
Abstract: Scene text removal (STR) aims to erase textual elements from images. It was originally intended for removing privacy-sensitiveor undesired texts from natural scene images, but is now also appliedto typographic images. STR typically detects text regions and theninpaints them. Although STR has advanced through neural networksand synthetic data, misuse risks have increased. This paper investi-gates Inverse STR (ISTR), which analyzes STR-processed images andfocuses on binary classification (detecting whether an image has un-dergone STR) and localizing removed text regions. We demonstrate inexperiments that these tasks are achievable with high accuracies, en-abling detection of potential misuse and improving STR. We also at-tempt to recover the removed text content by training a text recognizerto understand its difficulty.

Title: Distilling Normalizing Flows

Authors: Steven Walton, Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita Orlov, Humphrey Shi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21003
Pdf URL: https://arxiv.org/pdf/2506.21003
Copy Paste: [[2506.21003]] Distilling Normalizing Flows(https://arxiv.org/abs/2506.21003)
Keywords: generative
Abstract: Explicit density learners are becoming an increasingly popular technique for generative models because of their ability to better model probability distributions. They have advantages over Generative Adversarial Networks due to their ability to perform density estimation and having exact latent-variable inference. This has many advantages, including: being able to simply interpolate, calculate sample likelihood, and analyze the probability distribution. The downside of these models is that they are often more difficult to train and have lower sampling quality. Normalizing flows are explicit density models, that use composable bijective functions to turn an intractable probability function into a tractable one. In this work, we present novel knowledge distillation techniques to increase sampling quality and density estimation of smaller student normalizing flows. We seek to study the capacity of knowledge distillation in Compositional Normalizing Flows to understand the benefits and weaknesses provided by these architectures. Normalizing flows have unique properties that allow for a non-traditional forms of knowledge transfer, where we can transfer that knowledge within intermediate layers. We find that through this distillation, we can make students significantly smaller while making substantial performance gains over a non-distilled student. With smaller models there is a proportionally increased throughput as this is dependent upon the number of bijectors, and thus parameters, in the network.

Title: Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning

Authors: Tyler Ward, Xiaoqin Wang, Braxton McFarland, Md Atik Ahamed, Sahar Nozad, Talal Arshad, Hafsa Nebbache, Jin Chen, Abdullah Imran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21006
Pdf URL: https://arxiv.org/pdf/2506.21006
Copy Paste: [[2506.21006]] Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning(https://arxiv.org/abs/2506.21006)
Keywords: segmentation
Abstract: Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging both local and global contrastive learning for patch-level classification of SR images. After annotating SR images with regions of known maligancy, non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18 backbone with FFCL to classify margin status, then reconstruct coarse binary masks to prompt SAM for refined tumor margin segmentation. Our approach achieved an AUC of 0.8455 for margin classification and segmented margins with a 27.4% improvement in Dice similarity over baseline models, while reducing inference time to 47 milliseconds per image. These results demonstrate that FFCL-SAM significantly enhances both the speed and accuracy of intraoperative margin assessment, with strong potential to reduce re-excision rates and improve surgical outcomes in breast cancer treatment. Our code is available at this https URL.

Title: The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion

Authors: Bang Gong, Luchao Qi, Jiaye Wu, Zhicheng Fu, Chunbo Song, David W. Jacobs, John Nicholson, Roni Sengupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21008
Pdf URL: https://arxiv.org/pdf/2506.21008
Copy Paste: [[2506.21008]] The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion(https://arxiv.org/abs/2506.21008)
Keywords: diffusion
Abstract: We introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.

Title: FedSC: Federated Learning with Semantic-Aware Collaboration

Authors: Huan Wang, Haoran Li, Huaming Chen, Jun Yan, Jiahua Shi, Jun Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21012
Pdf URL: https://arxiv.org/pdf/2506.21012
Copy Paste: [[2506.21012]] FedSC: Federated Learning with Semantic-Aware Collaboration(https://arxiv.org/abs/2506.21012)
Keywords: privacy, federate
Abstract: Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components.

Title: HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation

Authors: Qingyue Jiao, Kangyu Zheng, Yiyu Shi, Zhiding Liang
Subjects: cs.CV, cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2506.21015
Pdf URL: https://arxiv.org/pdf/2506.21015
Copy Paste: [[2506.21015]] HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation(https://arxiv.org/abs/2506.21015)
Keywords: privacy, robust, generative
Abstract: Machine learning-assisted diagnosis is gaining traction in skin disease detection, but training effective models requires large amounts of high-quality data. Skin disease datasets often suffer from class imbalance, privacy concerns, and object bias, making data augmentation essential. While classical generative models are widely used, they demand extensive computational resources and lengthy training time. Quantum computing offers a promising alternative, but existing quantum-based image generation methods can only yield grayscale low-quality images. Through a novel classical-quantum latent space fusion technique, our work overcomes this limitation and introduces the first classical-quantum generative adversarial network (GAN) capable of generating color medical images. Our model outperforms classical deep convolutional GANs and existing hybrid classical-quantum GANs in both image generation quality and classification performance boost when used as data augmentation. Moreover, the performance boost is comparable with that achieved using state-of-the-art classical generative models, yet with over 25 times fewer parameters and 10 times fewer training epochs. Such results suggest a promising future for quantum image generation as quantum hardware advances. Finally, we demonstrate the robust performance of our model on real IBM quantum machine with hardware noise.

Title: Multimodal Prompt Alignment for Facial Expression Recognition

Authors: Fuyan Ma, Yiran He, Bin Sun, Shutao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21017
Pdf URL: https://arxiv.org/pdf/2506.21017
Copy Paste: [[2506.21017]] Multimodal Prompt Alignment for Facial Expression Recognition(https://arxiv.org/abs/2506.21017)
Keywords: large language model
Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.

Title: LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection

Authors: Lei Hao, Lina Xu, Chang Liu, Yanni Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21018
Pdf URL: https://arxiv.org/pdf/2506.21018
Copy Paste: [[2506.21018]] LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection(https://arxiv.org/abs/2506.21018)
Keywords: extraction
Abstract: Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at this https URL.

Title: Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation

Authors: Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, Zicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21022
Pdf URL: https://arxiv.org/pdf/2506.21022
Copy Paste: [[2506.21022]] Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation(https://arxiv.org/abs/2506.21022)
Keywords: diffusion
Abstract: Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.

Title: Large Language Models Acing Chartered Accountancy

Authors: Jatin Gupta, Akhil Sharma, Saransh Singhania, Mohammad Adnan, Sakshi Deo, Ali Imam Abidi, Keshav Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21031
Pdf URL: https://arxiv.org/pdf/2506.21031
Copy Paste: [[2506.21031]] Large Language Models Acing Chartered Accountancy(https://arxiv.org/abs/2506.21031)
Keywords: large language model
Abstract: Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.

Title: DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

Authors: Wenzhou Lyu, Jialing Lin, Wenqi Ren, Ruihao Xia, Feng Qian, Yang Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21034
Pdf URL: https://arxiv.org/pdf/2506.21034
Copy Paste: [[2506.21034]] DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation(https://arxiv.org/abs/2506.21034)
Keywords: robust, diffusion, segmentation
Abstract: Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic this http URL page: this https URL

Title: Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning

Authors: Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21035
Pdf URL: https://arxiv.org/pdf/2506.21035
Copy Paste: [[2506.21035]] Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning(https://arxiv.org/abs/2506.21035)
Keywords: large language model
Abstract: Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approach with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-1 components, each treated as an independent expert, enabling fine-grained mixture of rank-1 expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-1 expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning tasks with CLIP and large language models (LLMs), analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness on enhancing CL with PTMs, and improving generalization while mitigating forgetting.

Title: An Information-Theoretic Analysis for Federated Learning under Concept Drift

Authors: Fu Peng, Meng Zhang, Ming Tang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2506.21036
Pdf URL: https://arxiv.org/pdf/2506.21036
Copy Paste: [[2506.21036]] An Information-Theoretic Analysis for Federated Learning under Concept Drift(https://arxiv.org/abs/2506.21036)
Keywords: federate
Abstract: Recent studies in federated learning (FL) commonly train models on static datasets. However, real-world data often arrives as streams with shifting distributions, causing performance degradation known as concept drift. This paper analyzes FL performance under concept drift using information theory and proposes an algorithm to mitigate the performance degradation. We model concept drift as a Markov chain and introduce the \emph{Stationary Generalization Error} to assess a model's capability to capture characteristics of future unseen data. Its upper bound is derived using KL divergence and mutual information. We study three drift patterns (periodic, gradual, and random) and their impact on FL performance. Inspired by this, we propose an algorithm that regularizes the empirical risk minimization approach with KL divergence and mutual information, thereby enhancing long-term performance. We also explore the performance-cost tradeoff by identifying a Pareto front. To validate our approach, we build an FL testbed using Raspberry Pi4 devices. Experimental results corroborate with theoretical findings, confirming that drift patterns significantly affect performance. Our method consistently outperforms existing approaches for these three patterns, demonstrating its effectiveness in adapting concept drift in FL.

Title: Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability

Authors: Boyong He, Yuxiang Ji, Zhuoyue Tan, Liaoni Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21042
Pdf URL: https://arxiv.org/pdf/2506.21042
Copy Paste: [[2506.21042]] Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability(https://arxiv.org/abs/2506.21042)
Keywords: robust, diffusion
Abstract: Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at \href{this https URL}{Fitness-Generalization-Transferability}.

Title: Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling

Authors: Hansam Cho, Seoung Bum Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21045
Pdf URL: https://arxiv.org/pdf/2506.21045
Copy Paste: [[2506.21045]] Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling(https://arxiv.org/abs/2506.21045)
Keywords: diffusion
Abstract: Text-guided diffusion models have become essential for high-quality image synthesis, enabling dynamic image editing. In image editing, two crucial aspects are editability, which determines the extent of modification, and faithfulness, which reflects how well unaltered elements are preserved. However, achieving optimal results is challenging because of the inherent trade-off between editability and faithfulness. To address this, we propose Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with minimal impact on editability. FGS incorporates faithfulness guidance to strengthen the preservation of input image information and introduces a scheduling strategy to resolve misalignment between editability and faithfulness. Experimental results demonstrate that FGS achieves superior faithfulness while maintaining editability. Moreover, its compatibility with various editing methods enables precise, high-quality image edits across diverse tasks.

Title: Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

Authors: Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu, Yuanzhang Li
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2506.21046
Pdf URL: https://arxiv.org/pdf/2506.21046
Copy Paste: [[2506.21046]] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features(https://arxiv.org/abs/2506.21046)
Keywords: attack, transformer, generative
Abstract: The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at this https URL.

Title: MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection

Authors: Fuqiang Niu, Genan Dai, Yisha Lu, Jiayu Liao, Xiang Li, Hu Huang, Bowen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21053
Pdf URL: https://arxiv.org/pdf/2506.21053
Copy Paste: [[2506.21053]] MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection(https://arxiv.org/abs/2506.21053)
Keywords: large language model
Abstract: In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.

Title: FedDAA: Dynamic Client Clustering for Concept Drift Adaptation in Federated Learning

Authors: Fu Peng, Ming Tang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21054
Pdf URL: https://arxiv.org/pdf/2506.21054
Copy Paste: [[2506.21054]] FedDAA: Dynamic Client Clustering for Concept Drift Adaptation in Federated Learning(https://arxiv.org/abs/2506.21054)
Keywords: federate
Abstract: In federated learning (FL), the data distribution of each client may change over time, introducing both temporal and spatial data heterogeneity, known as concept drift. Data heterogeneity arises from three drift sources: real drift (a shift in the conditional distribution P(y|x)), virtual drift (a shift in the input distribution P(x)), and label drift (a shift in the label distribution P(y)). However, most existing FL methods addressing concept drift primarily focus on real drift. When clients experience virtual or label drift, these methods often fail to selectively retain useful historical knowledge, leading to catastrophic forgetting. A key challenge lies in distinguishing different sources of drift, as they require distinct adaptation strategies: real drift calls for discarding outdated data, while virtual or label drift benefits from retaining historical data. Without explicitly identifying the drift sources, a general adaptation strategy is suboptimal and may harm generalization. To address this challenge, we propose FedDAA, a dynamic clustered FL framework designed to adapt to multi-source concept drift while preserving valuable historical knowledge. Specifically, FedDAA integrates three modules: a cluster number determination module to find the optimal number of clusters; a real drift detection module to distinguish real drift from virtual/label drift; and a concept drift adaptation module to adapt to new data while retaining useful historical information. We provide theoretical convergence guarantees, and experiments show that FedDAA achieves 7.84% to 8.52% accuracy improvements over state-of-the-art methods on Fashion-MNIST, CIFAR-10, and CIFAR-100.

Title: Class-Agnostic Region-of-Interest Matching in Document Images

Authors: Demin Zhang, Jiahao Lyu, Zhijie Shen, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21055
Pdf URL: https://arxiv.org/pdf/2506.21055
Copy Paste: [[2506.21055]] Class-Agnostic Region-of-Interest Matching in Document Images(https://arxiv.org/abs/2506.21055)
Keywords: extraction
Abstract: Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at this https URL.

Title: SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification

Authors: Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21056
Pdf URL: https://arxiv.org/pdf/2506.21056
Copy Paste: [[2506.21056]] SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification(https://arxiv.org/abs/2506.21056)
Keywords: robust, segmentation
Abstract: Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.

Title: TEMPEST-LoRa: Cross-Technology Covert Communication

Authors: Xieyang Sun, Yuanqing Zheng, Wei Xi, Zuhao Chen, Zhizhen Chen, Han Hao, Zhiping Jiang, Sheng Zhong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.21069
Pdf URL: https://arxiv.org/pdf/2506.21069
Copy Paste: [[2506.21069]] TEMPEST-LoRa: Cross-Technology Covert Communication(https://arxiv.org/abs/2506.21069)
Keywords: security, attack
Abstract: Electromagnetic (EM) covert channels pose significant threats to computer and communications security in air-gapped networks. Previous works exploit EM radiation from various components (e.g., video cables, memory buses, CPUs) to secretly send sensitive information. These approaches typically require the attacker to deploy highly specialized receivers near the victim, which limits their real-world impact. This paper reports a new EM covert channel, TEMPEST-LoRa, that builds on Cross-Technology Covert Communication (CTCC), which could allow attackers to covertly transmit EM-modulated secret data from air-gapped networks to widely deployed operational LoRa receivers from afar. We reveal the potential risk and demonstrate the feasibility of CTCC by tackling practical challenges involved in manipulating video cables to precisely generate the EM leakage that could readily be received by third-party commercial LoRa nodes/gateways. Experiment results show that attackers can reliably decode secret data modulated by the EM leakage from a video cable at a maximum distance of 87.5m or a rate of 21.6 kbps. We note that the secret data transmission can be performed with monitors turned off (therefore covertly).

Title: Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph

Authors: Jingwei Wang, Zai Zhang, Hao Qian, Chunjing Gan, Binbin Hu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Bin Shi, Bo Dong
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21071
Pdf URL: https://arxiv.org/pdf/2506.21071
Copy Paste: [[2506.21071]] Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph(https://arxiv.org/abs/2506.21071)
Keywords: large language model
Abstract: Teaching large language models (LLMs) to use tools is crucial for improving their problem-solving abilities and expanding their applications. However, effectively using tools is challenging because it requires a deep understanding of tool functionalities and user intentions. Previous methods relied mainly on LLMs to generate instruction data, but the quality of these data was often insufficient. In this paper, we propose a new method that uses knowledge graphs to generate high-quality instruction data for LLMs. Knowledge graphs are manually curated datasets rich in semantic information. We begin by extracting various query pathways from a given knowledge graph, which are transformed into a broad spectrum of user queries. We then translate the relationships between entities into actionable tools and parse the pathways of each query into detailed solution steps, thereby creating high-quality instruction data. Our experiments show that fine-tuning on just a small sample of this synthetic data can significantly improve the tool utilization and overall capabilities of LLMs.

Title: Chain-of-Thought Enhanced Shallow Transformers for Wireless Symbol Detection

Authors: Li Fan, Peng Wang, Jing Yang, Cong Shen
Subjects: cs.LG, cs.IT, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2506.21093
Pdf URL: https://arxiv.org/pdf/2506.21093
Copy Paste: [[2506.21093]] Chain-of-Thought Enhanced Shallow Transformers for Wireless Symbol Detection(https://arxiv.org/abs/2506.21093)
Keywords: transformer
Abstract: Transformers have shown potential in solving wireless communication problems, particularly via in-context learning (ICL), where models adapt to new tasks through prompts without requiring model updates. However, prior ICL-based Transformer models rely on deep architectures with many layers to achieve satisfactory performance, resulting in substantial storage and computational costs. In this work, we propose CHain Of thOught Symbol dEtection (CHOOSE), a CoT-enhanced shallow Transformer framework for wireless symbol detection. By introducing autoregressive latent reasoning steps within the hidden space, CHOOSE significantly improves the reasoning capacity of shallow models (1-2 layers) without increasing model depth. This design enables lightweight Transformers to achieve detection performance comparable to much deeper models, making them well-suited for deployment on resource-constrained mobile devices. Experimental results demonstrate that our approach outperforms conventional shallow Transformers and achieves performance comparable to that of deep Transformers, while maintaining storage and computational efficiency. This represents a promising direction for implementing Transformer-based algorithms in wireless receivers with limited computational resources.

Title: FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation

Authors: Xenia Heilmann, Luca Corbucci, Mattia Cerrato, Anna Monreale
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21095
Pdf URL: https://arxiv.org/pdf/2506.21095
Copy Paste: [[2506.21095]] FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation(https://arxiv.org/abs/2506.21095)
Keywords: robust, federate, fair
Abstract: Federated Learning (FL) enables collaborative model training across multiple clients without sharing clients' private data. However, fairness remains a key concern, as biases in local clients' datasets can impact the entire federated system. Heterogeneous data distributions across clients may lead to models that are fairer for some clients than others. Although several fairness-enhancing solutions are present in the literature, most focus on mitigating bias for a single sensitive attribute, typically binary, overlooking the diverse and sometimes conflicting fairness needs of different clients. This limited perspective can limit the effectiveness of fairness interventions for the different clients. To support more robust and reproducible fairness research in FL, we aim to enable a consistent benchmarking of fairness-aware FL methods at both the global and client levels. In this paper, we contribute in three ways: (1) We introduce FeDa4Fair, a library to generate tabular datasets tailored to evaluating fair FL methods under heterogeneous client bias; (2) we release four bias-heterogeneous datasets and corresponding benchmarks to compare fairness mitigation methods in a controlled environment; (3) we provide ready-to-use functions for evaluating fairness outcomes for these datasets.

Title: OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

Authors: Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, AndyPian Wu, Chaoyang Wang, Chengjie Wang, Taisong Jin, SevenShu, Yunsheng Wu, Yongge Liu, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21101
Pdf URL: https://arxiv.org/pdf/2506.21101
Copy Paste: [[2506.21101]] OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography(https://arxiv.org/abs/2506.21101)
Keywords: large language model
Abstract: As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.

Title: Interpretable Hierarchical Concept Reasoning through Attention-Guided Graph Learning

Authors: David Debot, Pietro Barbiero, Gabriele Dominici, Giuseppe Marra
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21102
Pdf URL: https://arxiv.org/pdf/2506.21102
Copy Paste: [[2506.21102]] Interpretable Hierarchical Concept Reasoning through Attention-Guided Graph Learning(https://arxiv.org/abs/2506.21102)
Keywords: interpretability
Abstract: Concept-Based Models (CBMs) are a class of deep learning models that provide interpretability by explaining predictions through high-level concepts. These models first predict concepts and then use them to perform a downstream task. However, current CBMs offer interpretability only for the final task prediction, while the concept predictions themselves are typically made via black-box neural networks. To address this limitation, we propose Hierarchical Concept Memory Reasoner (H-CMR), a new CBM that provides interpretability for both concept and task predictions. H-CMR models relationships between concepts using a learned directed acyclic graph, where edges represent logic rules that define concepts in terms of other concepts. During inference, H-CMR employs a neural attention mechanism to select a subset of these rules, which are then applied hierarchically to predict all concepts and the final task. Experimental results demonstrate that H-CMR matches state-of-the-art performance while enabling strong human interaction through concept and model interventions. The former can significantly improve accuracy at inference time, while the latter can enhance data efficiency during training when background knowledge is available.

Title: Learning to Skip the Middle Layers of Transformers

Authors: Tim Lawson, Laurence Aitchison
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21103
Pdf URL: https://arxiv.org/pdf/2506.21103
Copy Paste: [[2506.21103]] Learning to Skip the Middle Layers of Transformers(https://arxiv.org/abs/2506.21103)
Keywords: interpretability, transformer
Abstract: Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at this https URL.

Title: PhishKey: A Novel Centroid-Based Approach for Enhanced Phishing Detection Using Adaptive HTML Component Extraction

Authors: Felipe Castaño, Eduardo Fidalgo, Enrique Alegre, Rocio Alaiz-Rodríguez, Raul Orduna, Francesco Zola
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21106
Pdf URL: https://arxiv.org/pdf/2506.21106
Copy Paste: [[2506.21106]] PhishKey: A Novel Centroid-Based Approach for Enhanced Phishing Detection Using Adaptive HTML Component Extraction(https://arxiv.org/abs/2506.21106)
Keywords: security, attack, robust, extraction
Abstract: Phishing attacks pose a significant cybersecurity threat, evolving rapidly to bypass detection mechanisms and exploit human vulnerabilities. This paper introduces PhishKey to address the challenges of adaptability, robustness, and efficiency. PhishKey is a novel phishing detection method using automatic feature extraction from hybrid sources. PhishKey combines character-level processing with Convolutional Neural Networks (CNN) for URL classification, and a Centroid-Based Key Component Phishing Extractor (CAPE) for HTML content at the word level. CAPE reduces noise and ensures complete sample processing avoiding crop operations on the input data. The predictions from both modules are integrated using a soft-voting ensemble to achieve more accurate and reliable classifications. Experimental evaluations on four state-of-the-art datasets demonstrate the effectiveness of PhishKey. It achieves up to 98.70% F1 Score and shows strong resistance to adversarial manipulations such as injection attacks with minimal performance degradation.

Title: Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges

Authors: Changxi Chi, Jun Xia, Yufei Huang, Jingbo Zhou, Siyuan Li, Yunfan Liu, Chang Yu, Stan Z. Li
Subjects: cs.LG, q-bio.MN
Abstract URL: https://arxiv.org/abs/2506.21107
Pdf URL: https://arxiv.org/pdf/2506.21107
Copy Paste: [[2506.21107]] Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges(https://arxiv.org/abs/2506.21107)
Keywords: diffusion
Abstract: Estimating single-cell responses across various perturbations facilitates the identification of key genes and enhances drug screening, significantly boosting experimental efficiency. However, single-cell sequencing is a destructive process, making it impossible to capture the same cell's phenotype before and after perturbation. Consequently, data collected under perturbed and unperturbed conditions are inherently unpaired. Existing methods either attempt to forcibly pair unpaired data using random sampling, or neglect the inherent relationship between unperturbed and perturbed cells during the modeling. In this work, we propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions, effectively addressing the challenge of unpaired data. We further interpret this framework as a form of data augmentation. We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way, and further incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles. Moreover, gene expression under the same perturbation often varies significantly across cells, frequently exhibiting a bimodal distribution that reflects intrinsic heterogeneity. To capture this, we introduce a more suitable evaluation metric. We propose Unlasting, dual conditional diffusion models that overcome the problem of unpaired single-cell perturbation data and strengthen the model's insight into perturbations under the guidance of the GRN, with a dedicated mask model designed to improve generation quality by predicting silent genes. In addition, we introduce a biologically grounded evaluation metric that better reflects the inherent heterogeneity in single-cell responses.

Title: IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Authors: Yujia Liang, Jile Jiao, Zhicheng Wang, Xuetao Feng, Zixuan Ye, Yuan Wang, Hao Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21116
Pdf URL: https://arxiv.org/pdf/2506.21116
Copy Paste: [[2506.21116]] IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes(https://arxiv.org/abs/2506.21116)
Keywords: large language model
Abstract: Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.

Title: CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization

Authors: Jan Ackermann, Jonas Kulhanek, Shengqu Cai, Haofei Xu, Marc Pollefeys, Gordon Wetzstein, Leonidas Guibas, Songyou Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21117
Pdf URL: https://arxiv.org/pdf/2506.21117
Copy Paste: [[2506.21117]] CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization(https://arxiv.org/abs/2506.21117)
Keywords: robust, segmentation
Abstract: In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.

Title: Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models

Authors: Xiaoshuang Ji, Zhendong Zhao, Xiaojun Chen, Xin Zhao, Zeyao Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21119
Pdf URL: https://arxiv.org/pdf/2506.21119
Copy Paste: [[2506.21119]] Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models(https://arxiv.org/abs/2506.21119)
Keywords: transformer
Abstract: Fine-tuning is a promising technique for leveraging Transformer-based language models in downstream tasks. As model sizes continue to grow, updating all model parameters becomes increasingly costly. Parameter-efficient fine-tuning methods effectively address this issue by selectively updating a small subset of parameters. However, fine-tuning and most existing parameter-efficient fine-tuning methods require updating the same number of parameters as the initial size, ignoring the unequal contribution across Transformer blocks and leading to extremely inefficient allocation of computing resources. In this paper, we propose Progtuning, the novel fine-tuning framework combined with progressive learning for Transformer-based language models. Specifically, Progtuning progressively reduces the number of updated transformer blocks based on the contribution. Remarkably, Progtuning optimizes resource allocation and reduces the number of updated parameters by approximately 25\%, while still maintaining competitive performance. And it also exhibits high adaptability with parameter-efficient fine-tuning methods, demonstrating excellent performance across various adaptation scenarios.

Title: Robust Policy Switching for Antifragile Reinforcement Learning for UAV Deconfliction in Adversarial Environments

Authors: Deepak Kumar Panda, Weisi Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21127
Pdf URL: https://arxiv.org/pdf/2506.21127
Copy Paste: [[2506.21127]] Robust Policy Switching for Antifragile Reinforcement Learning for UAV Deconfliction in Adversarial Environments(https://arxiv.org/abs/2506.21127)
Keywords: attack, robust
Abstract: The increasing automation of navigation for unmanned aerial vehicles (UAVs) has exposed them to adversarial attacks that exploit vulnerabilities in reinforcement learning (RL) through sensor manipulation. Although existing robust RL methods aim to mitigate such threats, their effectiveness has limited generalization to out-of-distribution shifts from the optimal value distribution, as they are primarily designed to handle fixed perturbation. To address this limitation, this paper introduces an antifragile RL framework that enhances adaptability to broader distributional shifts by incorporating a switching mechanism based on discounted Thompson sampling (DTS). This mechanism dynamically selects among multiple robust policies to minimize adversarially induced state-action-value distribution shifts. The proposed approach first derives a diverse ensemble of action robust policies by accounting for a range of perturbations in the policy space. These policies are then modeled as a multiarmed bandit (MAB) problem, where DTS optimally selects policies in response to nonstationary Bernoulli rewards, effectively adapting to evolving adversarial strategies. Theoretical framework has also been provided where by optimizing the DTS to minimize the overall regrets due to distributional shift, results in effective adaptation against unseen adversarial attacks thus inducing antifragility. Extensive numerical simulations validate the effectiveness of the proposed framework in complex navigation environments with multiple dynamic three-dimensional obstacles and with stronger projected gradient descent (PGD) and spoofing attacks. Compared to conventional robust, non-adaptive RL methods, the antifragile approach achieves superior performance, demonstrating shorter navigation path lengths and a higher rate of conflict-free navigation trajectories compared to existing robust RL techniques

Title: Curriculum-Guided Antifragile Reinforcement Learning for Secure UAV Deconfliction under Observation-Space Attacks

Authors: Deepak Kumar Panda, Adolfo Perrusquia, Weisi Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21129
Pdf URL: https://arxiv.org/pdf/2506.21129
Copy Paste: [[2506.21129]] Curriculum-Guided Antifragile Reinforcement Learning for Secure UAV Deconfliction under Observation-Space Attacks(https://arxiv.org/abs/2506.21129)
Keywords: secure, attack, robust
Abstract: Reinforcement learning (RL) policies deployed in safety-critical systems, such as unmanned aerial vehicle (UAV) navigation in dynamic airspace, are vulnerable to out-ofdistribution (OOD) adversarial attacks in the observation space. These attacks induce distributional shifts that significantly degrade value estimation, leading to unsafe or suboptimal decision making rendering the existing policy fragile. To address this vulnerability, we propose an antifragile RL framework designed to adapt against curriculum of incremental adversarial perturbations. The framework introduces a simulated attacker which incrementally increases the strength of observation-space perturbations which enables the RL agent to adapt and generalize across a wider range of OOD observations and anticipate previously unseen attacks. We begin with a theoretical characterization of fragility, formally defining catastrophic forgetting as a monotonic divergence in value function distributions with increasing perturbation strength. Building on this, we define antifragility as the boundedness of such value shifts and derive adaptation conditions under which forgetting is stabilized. Our method enforces these bounds through iterative expert-guided critic alignment using Wasserstein distance minimization across incrementally perturbed observations. We empirically evaluate the approach in a UAV deconfliction scenario involving dynamic 3D obstacles. Results show that the antifragile policy consistently outperforms standard and robust RL baselines when subjected to both projected gradient descent (PGD) and GPS spoofing attacks, achieving up to 15% higher cumulative reward and over 30% fewer conflict events. These findings demonstrate the practical and theoretical viability of antifragile reinforcement learning for secure and resilient decision-making in environments with evolving threat scenarios.

Title: Learning to See in the Extremely Dark

Authors: Hai Jiang, Binhao Guan, Zhen Liu, Xiaohong Liu, Jian Yu, Zheng Liu, Songchen Han, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21132
Pdf URL: https://arxiv.org/pdf/2506.21132
Copy Paste: [[2506.21132]] Learning to See in the Extremely Dark(https://arxiv.org/abs/2506.21132)
Keywords: diffusion, generative
Abstract: Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at this https URL.

Title: Inside Job: Defending Kubernetes Clusters Against Network Misconfigurations

Authors: Jacopo Bufalino, Jose Luis Martin-Navarro, Mario Di Francesco, Tuomas Aura
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2506.21134
Pdf URL: https://arxiv.org/pdf/2506.21134
Copy Paste: [[2506.21134]] Inside Job: Defending Kubernetes Clusters Against Network Misconfigurations(https://arxiv.org/abs/2506.21134)
Keywords: security
Abstract: Kubernetes has emerged as the de facto standard for container orchestration. Unfortunately, its increasing popularity has also made it an attractive target for malicious actors. Despite extensive research on securing Kubernetes, little attention has been paid to the impact of network configuration on the security of application deployments. This paper addresses this gap by conducting a comprehensive analysis of network misconfigurations in a Kubernetes cluster with specific reference to lateral movement. Accordingly, we carried out an extensive evaluation of 287 open-source applications belonging to six different organizations, ranging from IT companies and public entities to non-profits. As a result, we identified 634 misconfigurations, well beyond what could be found by solutions in the state of the art. We responsibly disclosed our findings to the concerned organizations and engaged in a discussion to assess their severity. As of now, misconfigurations affecting more than thirty applications have been fixed with the mitigations we proposed.

Title: YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection

Authors: Jiawei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21135
Pdf URL: https://arxiv.org/pdf/2506.21135
Copy Paste: [[2506.21135]] YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection(https://arxiv.org/abs/2506.21135)
Keywords: robust
Abstract: Surface defect detection in industrial scenarios is both crucial and technically demanding due to the wide variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Although recent advances in AI-based detectors have improved performance, existing methods often suffer from redundant features, limited detail sensitivity, and weak robustness under multiscale conditions. To address these challenges, we propose YOLO-FDA, a novel YOLO-based detection framework that integrates fine-grained detail enhancement and attention-guided feature fusion. Specifically, we adopt a BiFPN-style architecture to strengthen bidirectional multilevel feature aggregation within the YOLOv5 backbone. To better capture fine structural changes, we introduce a Detail-directional Fusion Module (DDFM) that introduces a directional asymmetric convolution in the second-lowest layer to enrich spatial details and fuses the second-lowest layer with low-level features to enhance semantic consistency. Furthermore, we propose two novel attention-based fusion strategies, Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF) to improve contextual representation and reduce feature noise. Extensive experiments on benchmark datasets demonstrate that YOLO-FDA consistently outperforms existing state-of-the-art methods in terms of both accuracy and robustness across diverse types of defects and scales.

Title: NaLaFormer: Norm-Aware Linear Attention for Transformer Models

Authors: Weikang Meng, Yadan Luo, Liangyu Huo, Yaowei Wang, Xin Li, Zheng Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21137
Pdf URL: https://arxiv.org/pdf/2506.21137
Copy Paste: [[2506.21137]] NaLaFormer: Norm-Aware Linear Attention for Transformer Models(https://arxiv.org/abs/2506.21137)
Keywords: transformer
Abstract: Linear attention has emerged as a viable alternative to softmax attention by reducing complexity from quadratic to linear in sequence length. To preserve two fundamental properties of softmax, non-negativity and entropy reduction, current works employ various linearly separatable kernel functions with $L1$ normalization instead of softmax operator. However, query norms are neglected by the normalization operation in linear attention, such degradation heavily leads to an entropy gap. Meanwhile, existing works inhibit negative values of query and key vectors resulting in a missing inner-product interactions after being mapped. To address these dual challenges, we propose a novel Norm-Aware Linear Attention mechanism serving to restore norm-guided dynamic spikiness and recover kernel-perturbed norm distributions. Specifically, we first decouple query and key matrices into two components: norm and direction, to achieve norm-aware spikiness control and norm consistency, respectively. We mathematically reveal that the extent of entropy reduction varies with the query norm in softmax normalization, motivating a query-norm aware kernel function for dynamic control over entropy reduction. Furthermore, to ensure norm consistency and enforce non-negativity constraints, we employ a norm-preserving mapping to project all elements of the angular matrix into positive values, leveraging cosine similarity to inhibit dimensions with opposite directions. We conduct extensive experiments demonstrating that the NaLaFormer improves performance on vision and language tasks, enhancing both expressiveness and efficiency by up to 4.2\%.

Title: DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding

Authors: Ziwei Wang, Hongbin Wang, Tianwang Jia, Xingyi He, Siyang Li, Dongrui Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21140
Pdf URL: https://arxiv.org/pdf/2506.21140
Copy Paste: [[2506.21140]] DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding(https://arxiv.org/abs/2506.21140)
Keywords: robust, interpretability, transformer
Abstract: Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform spontaneous/evoked neural activity into control commands for external communication. While convolutional neural networks (CNNs) remain the mainstream backbone for EEG decoding, their inherently short receptive field makes it difficult to capture long-range temporal dependencies and global inter-channel relationships. Recent CNN-Transformer (Conformers) hybrids partially address this issue, but most adopt a serial design, resulting in suboptimal integration of local and global features, and often overlook explicit channel-wise modeling. To address these limitations, we propose DBConformer, a dual-branch convolutional Transformer network tailored for EEG decoding. It integrates a temporal Conformer to model long-range temporal dependencies and a spatial Conformer to extract inter-channel interactions, capturing both temporal dynamics and spatial patterns in EEG signals. A lightweight channel attention module further refines spatial representations by assigning data-driven importance to EEG channels. Extensive experiments on five motor imagery (MI) datasets and two seizure detection datasets under three evaluation settings demonstrate that DBConformer consistently outperforms 10 competitive baseline models, with over eight times fewer parameters than the high-capacity EEG Conformer baseline. Further, the visualization results confirm that the features extracted by DBConformer are physiologically interpretable and aligned with sensorimotor priors in MI. The superior performance and interpretability of DBConformer make it reliable for robust and explainable EEG decoding. Code is publicized at this https URL.

Title: Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks

Authors: Deepak Kumar Panda, Weisi Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21142
Pdf URL: https://arxiv.org/pdf/2506.21142
Copy Paste: [[2506.21142]] Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks(https://arxiv.org/abs/2506.21142)
Keywords: attack, robust, steal, generative
Abstract: The growing integration of UAVs into civilian airspace underscores the need for resilient and intelligent intrusion detection systems (IDS), as traditional anomaly detection methods often fail to identify novel threats. A common approach treats unfamiliar attacks as out-of-distribution (OOD) samples; however, this leaves systems vulnerable when mitigation is inadequate. Moreover, conventional OOD detectors struggle to distinguish stealthy adversarial attacks from genuine OOD events. This paper introduces a conditional generative adversarial network (cGAN)-based framework for crafting stealthy adversarial attacks that evade IDS mechanisms. We first design a robust multi-class IDS classifier trained on benign UAV telemetry and known cyber-attacks, including Denial of Service (DoS), false data injection (FDI), man-in-the-middle (MiTM), and replay attacks. Using this classifier, our cGAN perturbs known attacks to generate adversarial samples that misclassify as benign while retaining statistical resemblance to OOD distributions. These adversarial samples are iteratively refined to achieve high stealth and success rates. To detect such perturbations, we implement a conditional variational autoencoder (CVAE), leveraging negative log-likelihood to separate adversarial inputs from authentic OOD samples. Comparative evaluation shows that CVAE-based regret scores significantly outperform traditional Mahalanobis distance-based detectors in identifying stealthy adversarial threats. Our findings emphasize the importance of advanced probabilistic modeling to strengthen IDS capabilities against adaptive, generative-model-based cyber intrusions.

Title: Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion

Authors: Yuguang Zhang, Kuangpu Guo, Zhihe Lu, Yunbo Wang, Jian Liang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21144
Pdf URL: https://arxiv.org/pdf/2506.21144
Copy Paste: [[2506.21144]] Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion(https://arxiv.org/abs/2506.21144)
Keywords: federate
Abstract: Federated learning (FL) enables collaborative model training across decentralized clients without sharing local data, but is challenged by heterogeneity in data, computation, and communication. Pretrained vision-language models (VLMs), with their strong generalization and lightweight tuning via prompts, offer a promising solution. However, existing federated prompt-learning methods rely only on text prompts and overlook joint label-domain distribution shifts. In this paper, we propose a personalized FL framework based on dual-prompt learning and cross fusion, termed pFedDC. Specifically, each client maintains both global and local prompts across vision and language modalities: global prompts capture common knowledge shared across the federation, while local prompts encode client-specific semantics and domain characteristics. Meanwhile, a cross-fusion module is designed to adaptively integrate prompts from different levels, enabling the model to generate personalized representations aligned with each client's unique data distribution. Extensive experiments across nine datasets with various types of heterogeneity show that pFedDC consistently outperforms state-of-the-art methods.

Title: Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation

Authors: Junwen Wang, Oscar Maccormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21150
Pdf URL: https://arxiv.org/pdf/2506.21150
Copy Paste: [[2506.21150]] Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation(https://arxiv.org/abs/2506.21150)
Keywords: segmentation
Abstract: Hyperspectral imaging (HSI) shows great promise for surgical applications, offering detailed insights into biological tissue differences beyond what the naked eye can perceive. Refined labelling efforts are underway to train vision systems to distinguish large numbers of subtly varying classes. However, commonly used learning methods for biomedical segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. In this work, we introduce two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations. Extensive experiments demonstrate that our proposed method reaches state-of-the-art performance on a sparsely annotated HSI dataset comprising $107$ classes organised in a clinically-defined semantic tree structure. Furthermore, our method enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels.

Title: Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels

Authors: Aida Moafi, Danial Moafi, Evgeny M. Mirkes, Gerry P. McCann, Abbas S. Alatrany, Jayanth R. Arnold, Mostafa Mehdipour Ghazi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21151
Pdf URL: https://arxiv.org/pdf/2506.21151
Copy Paste: [[2506.21151]] Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels(https://arxiv.org/abs/2506.21151)
Keywords: robust, segmentation
Abstract: The accurate segmentation of myocardial scars from cardiac MRI is essential for clinical assessment and treatment planning. In this study, we propose a robust deep-learning pipeline for fully automated myocardial scar detection and segmentation by fine-tuning state-of-the-art models. The method explicitly addresses challenges of label noise from semi-automatic annotations, data heterogeneity, and class imbalance through the use of Kullback-Leibler loss and extensive data augmentation. We evaluate the model's performance on both acute and chronic cases and demonstrate its ability to produce accurate and smooth segmentations despite noisy labels. In particular, our approach outperforms state-of-the-art models like nnU-Net and shows strong generalizability in an out-of-distribution test set, highlighting its robustness across various imaging conditions and clinical tasks. These results establish a reliable foundation for automated myocardial scar quantification and support the broader clinical adoption of deep learning in cardiac imaging.

Title: Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image

Authors: Pufan Li, Bi'an Du, Wei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21152
Pdf URL: https://arxiv.org/pdf/2506.21152
Copy Paste: [[2506.21152]] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image(https://arxiv.org/abs/2506.21152)
Keywords: robust, diffusion
Abstract: Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To takle these issues, we present a novel method that seamlessly integrates geometry and perception priors without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we train three different Gaussian branches initialized from the geometry prior, perception prior and Gaussian noise, respectively. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we refine 3D Gaussian branches through mutual interaction between geometry and perception priors, further enhanced by a reprojection-based strategy that enforces depth consistency. Experiments demonstrate the higher-fidelity reconstruction results of our method, outperforming existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.

Title: Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design

Authors: Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, Morteza Haghir Chehreghani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21158
Pdf URL: https://arxiv.org/pdf/2506.21158
Copy Paste: [[2506.21158]] Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design(https://arxiv.org/abs/2506.21158)
Keywords: generative
Abstract: In many real-world applications, evaluating the goodness of instances is often costly and time-consuming, e.g., human feedback and physics simulations, in contrast to proposing new instances. In particular, this is even more critical in reinforcement learning, as new interactions with the environment (i.e., new instances) need to be evaluated to provide a reward signal to learn from. As sufficient exploration is crucial, learning from a diverse mini-batch can have a large impact and help mitigate mode collapse. In this paper, we introduce diverse mini-batch selection for reinforcement learning and propose to use determinantal point processes for this task. We study this framework in the context of a real-world problem, namely drug discovery. We experimentally study how our proposed framework can improve the effectiveness of chemical exploration in de novo drug design, where finding diverse and high-quality solutions is essential. We conduct a comprehensive evaluation with three well-established molecular generation oracles over numerous generative steps. Our experiments conclude that our diverse mini-batch selection framework can substantially improve the diversity of the solutions, while still obtaining solutions of high quality. In drug discovery, such outcome can potentially lead to fulfilling unmet medication needs faster.

Title: Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition

Authors: Longkun Zou, Kangjun Liu, Ke Chen, Kailing Guo, Kui Jia, Yaowei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21165
Pdf URL: https://arxiv.org/pdf/2506.21165
Copy Paste: [[2506.21165]] Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition(https://arxiv.org/abs/2506.21165)
Keywords: robust
Abstract: Learning semantic representations from point sets of 3D object shapes is often challenged by significant geometric variations, primarily due to differences in data acquisition methods. Typically, training data is generated using point simulators, while testing data is collected with distinct 3D sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits the generalization ability of point classifiers. Current unsupervised domain adaptation (UDA) techniques struggle with this gap, as they often lack robust, domain-insensitive descriptors capable of capturing global topological information, resulting in overfitting to the limited semantic patterns of the source domain. To address this issue, we introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures, and by modeling the topological relations of local geometric features through a novel self-supervised learning task. Additionally, we propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training, effectively reducing the impact of noisy pseudo-labels and enhancing the robustness of the adaptation process. Experimental results on three public Sim2Real benchmarks validate the effectiveness of our TAM framework, showing consistent improvements over state-of-the-art methods across all evaluated tasks. The source code of this work will be available at this https URL.

Title: Compressed and Smooth Latent Space for Text Diffusion Modeling

Authors: Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21170
Pdf URL: https://arxiv.org/pdf/2506.21170
Copy Paste: [[2506.21170]] Compressed and Smooth Latent Space for Text Diffusion Modeling(https://arxiv.org/abs/2506.21170)
Keywords: robust, diffusion, generative
Abstract: Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference.

Title: Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Authors: Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2506.21182
Pdf URL: https://arxiv.org/pdf/2506.21182
Copy Paste: [[2506.21182]] Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks(https://arxiv.org/abs/2506.21182)
Keywords: robust
Abstract: The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB's continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: this https URL

Title: Task-Aware KV Compression For Cost-Effective Long Video Understanding

Authors: Minghao Qin, Yan Shu, Peitian Zhang, Kun Lun, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21184
Pdf URL: https://arxiv.org/pdf/2506.21184
Copy Paste: [[2506.21184]] Task-Aware KV Compression For Cost-Effective Long Video Understanding(https://arxiv.org/abs/2506.21184)
Keywords: large language model
Abstract: Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM's decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.

Title: Artificial Delegates Resolve Fairness Issues in Perpetual Voting with Partial Turnout

Authors: Apurva Shah, Axel Abels, Ann Nowé, Tom Lenaerts
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21186
Pdf URL: https://arxiv.org/pdf/2506.21186
Copy Paste: [[2506.21186]] Artificial Delegates Resolve Fairness Issues in Perpetual Voting with Partial Turnout(https://arxiv.org/abs/2506.21186)
Keywords: robust, fair
Abstract: Perpetual voting addresses fairness in sequential collective decision-making by evaluating representational equity over time. However, existing perpetual voting rules rely on full participation and complete approval information, assumptions that rarely hold in practice, where partial turnout is the norm. In this work, we study the integration of Artificial Delegates, preference-learning agents trained to represent absent voters, into perpetual voting systems. We examine how absenteeism affects fairness and representativeness under various voting methods and evaluate the extent to which Artificial Delegates can compensate for missing participation. Our findings indicate that while absenteeism significantly affects fairness, Artificial Delegates reliably mitigate these effects and enhance robustness across diverse scenarios.

Title: GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Authors: Zijun Lin, Shuting He, Cheston Tan, Bihan Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21188
Pdf URL: https://arxiv.org/pdf/2506.21188
Copy Paste: [[2506.21188]] GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding(https://arxiv.org/abs/2506.21188)
Keywords: large language model
Abstract: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as "it", "here" and "the same" to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow -- a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.

Title: Prompt-Guided Turn-Taking Prediction

Authors: Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21191
Pdf URL: https://arxiv.org/pdf/2506.21191
Copy Paste: [[2506.21191]] Prompt-Guided Turn-Taking Prediction(https://arxiv.org/abs/2506.21191)
Keywords: transformer, large language model
Abstract: Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as "faster" or "calmer" adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

Title: Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation

Authors: Yihong Cao, Jiaming Zhang, Xu Zheng, Hao Shi, Kunyu Peng, Hang Liu, Kailun Yang, Hui Zhang
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2506.21198
Pdf URL: https://arxiv.org/pdf/2506.21198
Copy Paste: [[2506.21198]] Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation(https://arxiv.org/abs/2506.21198)
Keywords: segmentation
Abstract: Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at this https URL.

Title: MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification

Authors: Shadman Sobhan, Kazi Abrar Mahmud, Abduz Zami
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2506.21199
Pdf URL: https://arxiv.org/pdf/2506.21199
Copy Paste: [[2506.21199]] MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification(https://arxiv.org/abs/2506.21199)
Keywords: interpretability, large language model, segmentation
Abstract: Current medical image analysis systems are typically task-specific, requiring separate models for classification and segmentation, and lack the flexibility to support user-defined workflows. To address these challenges, we introduce MedPrompt, a unified framework that combines a few-shot prompted Large Language Model (Llama-4-17B) for high-level task planning with a modular Convolutional Neural Network (DeepFusionLab) for low-level image processing. The LLM interprets user instructions and generates structured output to dynamically route task-specific pretrained weights. This weight routing approach avoids retraining the entire framework when adding new tasks-only task-specific weights are required, enhancing scalability and deployment. We evaluated MedPrompt across 19 public datasets, covering 12 tasks spanning 5 imaging modalities. The system achieves a 97% end-to-end correctness in interpreting and executing prompt-driven instructions, with an average inference latency of 2.5 seconds, making it suitable for near real-time applications. DeepFusionLab achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and strong classification performance (F1 0.9744 on tuberculosis). Overall, MedPrompt enables scalable, prompt-driven medical imaging by combining the interpretability of LLMs with the efficiency of modular CNNs.

Title: BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models

Authors: Louis Kerner, Michel Meintz, Bihe Zhao, Franziska Boenisch, Adam Dziedzic
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21209
Pdf URL: https://arxiv.org/pdf/2506.21209
Copy Paste: [[2506.21209]] BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models(https://arxiv.org/abs/2506.21209)
Keywords: robust, watermark, diffusion, generative
Abstract: State-of-the-art text-to-image models like Infinity generate photorealistic images at an unprecedented speed. These models operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework for Infinity. Our method embeds a watermark directly at the bit level of the token stream across multiple scales (also referred to as resolutions) during Infinity's image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model's outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs.

Title: Complexity-aware fine-tuning

Authors: Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21220
Pdf URL: https://arxiv.org/pdf/2506.21220
Copy Paste: [[2506.21220]] Complexity-aware fine-tuning(https://arxiv.org/abs/2506.21220)
Keywords: large language model
Abstract: General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($\approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.55$ vs $0.43$ average accuracy) and provides comparable with distillation performance while using $62\%$ less data ($0.55$ average accuracy for both). We publish our code and data to facilitate further research in this direction.

Title: Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval

Authors: Yongchan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21222
Pdf URL: https://arxiv.org/pdf/2506.21222
Copy Paste: [[2506.21222]] Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval(https://arxiv.org/abs/2506.21222)
Keywords: extraction, large language model
Abstract: Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.

Title: ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation

Authors: Xiwei Xuan, Ziquan Deng, Kwan-Liu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21233
Pdf URL: https://arxiv.org/pdf/2506.21233
Copy Paste: [[2506.21233]] ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation(https://arxiv.org/abs/2506.21233)
Keywords: segmentation
Abstract: Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at this https URL .

Title: Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping

Authors: Qifei Cui, Yuang Zhou, Ruichen Deng
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21234
Pdf URL: https://arxiv.org/pdf/2506.21234
Copy Paste: [[2506.21234]] Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping(https://arxiv.org/abs/2506.21234)
Keywords: transformer
Abstract: This paper presents ESFP, an end-to-end pipeline that converts monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a 24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence Transformer with self-attention-combines long-range temporal context with a differentiable forward-kinematics decoder, enforcing constant bone lengths and anatomical plausibility while jointly predicting joint means and full covariances. (3) Filtering: root-normalized trajectories are variance-weighted according to HPSTM's uncertainty estimates, suppressing residual noise. (4) Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist triples into the uArm's polar workspace, preserving wrist orientation.

Title: DiMPLe -- Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation

Authors: Umaima Rahman, Mohammad Yaqub, Dwarikanath Mahapatra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21237
Pdf URL: https://arxiv.org/pdf/2506.21237
Copy Paste: [[2506.21237]] DiMPLe -- Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation(https://arxiv.org/abs/2506.21237)
Keywords: robust
Abstract: We introduce DiMPLe (Disentangled Multi-Modal Prompt Learning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe disentangles features within and across modalities while maintaining consistent alignment, enabling better generalization to novel classes and robustness to distribution shifts. Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy.

Title: Zero-Shot Learning for Obsolescence Risk Forecasting

Authors: Elie Saad, Aya Mrabah, Mariem Besbes, Marc Zolghadri, Victor Czmil, Claude Baron, Vincent Bourgeois
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21240
Pdf URL: https://arxiv.org/pdf/2506.21240
Copy Paste: [[2506.21240]] Zero-Shot Learning for Obsolescence Risk Forecasting(https://arxiv.org/abs/2506.21240)
Keywords: security, large language model
Abstract: Component obsolescence poses significant challenges in industries reliant on electronic components, causing increased costs and disruptions in the security and availability of systems. Accurate obsolescence risk prediction is essential but hindered by a lack of reliable data. This paper proposes a novel approach to forecasting obsolescence risk using zero-shot learning (ZSL) with large language models (LLMs) to address data limitations by leveraging domain-specific knowledge from tabular datasets. Applied to two real-world datasets, the method demonstrates effective risk prediction. A comparative evaluation of four LLMs underscores the importance of selecting the right model for specific forecasting tasks.

Title: Temporal Rate Reduction Clustering for Human Motion Segmentation

Authors: Xianghan Meng, Zhengyu Tong, Zhiyuan Huang, Chun-Guang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21249
Pdf URL: https://arxiv.org/pdf/2506.21249
Copy Paste: [[2506.21249]] Temporal Rate Reduction Clustering for Human Motion Segmentation(https://arxiv.org/abs/2506.21249)
Keywords: segmentation
Abstract: Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the frame sequences in video. Specifically, the structured representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.

Title: Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents

Authors: Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21252
Pdf URL: https://arxiv.org/pdf/2506.21252
Copy Paste: [[2506.21252]] Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents(https://arxiv.org/abs/2506.21252)
Keywords: large language model
Abstract: As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.

Title: DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

Authors: Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21263
Pdf URL: https://arxiv.org/pdf/2506.21263
Copy Paste: [[2506.21263]] DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster(https://arxiv.org/abs/2506.21263)
Keywords: large language model
Abstract: The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.

Title: Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

Authors: Cheng Zou, Senlin Cheng, Bolei Xu, Dandan Zheng, Xiaobo Li, Jingdong Chen, Ming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21270
Pdf URL: https://arxiv.org/pdf/2506.21270
Copy Paste: [[2506.21270]] Video Virtual Try-on with Conditional Diffusion Transformer Inpainter(https://arxiv.org/abs/2506.21270)
Keywords: diffusion, transformer
Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.

Title: Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?

Authors: Andrea McGlinchey, Peter J Barclay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21274
Pdf URL: https://arxiv.org/pdf/2506.21274
Copy Paste: [[2506.21274]] Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?(https://arxiv.org/abs/2506.21274)
Keywords: large language model
Abstract: Large language models can produce convincing "fake text" in domains such as academic writing, product reviews, and political news. Many approaches have been investigated for the detection of artificially generated text. While this may seem to presage an endless "arms race", we note that newer LLMs use ever more parameters, training data, and energy, while relatively simple classifiers demonstrate a good level of detection accuracy with modest resources. To approach the question of whether the models' ability to beat the detectors may therefore reach a plateau, we examine the ability of statistical classifiers to identify "fake text" in the style of classical detective fiction. Over a 0.5 version increase, we found that Gemini showed an increased ability to generate deceptive text, while GPT did not. This suggests that reliable detection of fake text may remain feasible even for ever-larger models, though new model architectures may improve their deceptiveness

Title: HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

Authors: Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21277
Pdf URL: https://arxiv.org/pdf/2506.21277
Copy Paste: [[2506.21277]] HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context(https://arxiv.org/abs/2506.21277)
Keywords: large language model
Abstract: With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

Title: Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Authors: Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21285
Pdf URL: https://arxiv.org/pdf/2506.21285
Copy Paste: [[2506.21285]] Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning(https://arxiv.org/abs/2506.21285)
Keywords: large language model
Abstract: While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.

Title: HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation

Authors: Diego Biagini, Nassir Navab, Azade Farshad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21287
Pdf URL: https://arxiv.org/pdf/2506.21287
Copy Paste: [[2506.21287]] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation(https://arxiv.org/abs/2506.21287)
Keywords: diffusion, segmentation
Abstract: Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.

Title: Small Encoders Can Rival Large Decoders in Detecting Groundedness

Authors: Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21288
Pdf URL: https://arxiv.org/pdf/2506.21288
Copy Paste: [[2506.21288]] Small Encoders Can Rival Large Decoders in Detecting Groundedness(https://arxiv.org/abs/2506.21288)
Keywords: large language model
Abstract: Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : this https URL

Title: Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

Authors: Bram Willemsen, Gabriel Skantze
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21294
Pdf URL: https://arxiv.org/pdf/2506.21294
Copy Paste: [[2506.21294]] Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models(https://arxiv.org/abs/2506.21294)
Keywords: extraction, large language model
Abstract: In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.

Title: Balancing Privacy and Utility in Correlated Data: A Study of Bayesian Differential Privacy

Authors: Martin Lange, Patricia Guerra-Balboa, Javier Parra-Arnau, Thorsten Strufe
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2506.21308
Pdf URL: https://arxiv.org/pdf/2506.21308
Copy Paste: [[2506.21308]] Balancing Privacy and Utility in Correlated Data: A Study of Bayesian Differential Privacy(https://arxiv.org/abs/2506.21308)
Keywords: privacy, protect
Abstract: Privacy risks in differentially private (DP) systems increase significantly when data is correlated, as standard DP metrics often underestimate the resulting privacy leakage, leaving sensitive information vulnerable. Given the ubiquity of dependencies in real-world databases, this oversight poses a critical challenge for privacy protections. Bayesian differential privacy (BDP) extends DP to account for these correlations, yet current BDP mechanisms indicate notable utility loss, limiting its adoption. In this work, we address whether BDP can be realistically implemented in common data structures without sacrificing utility -- a key factor for its applicability. By analyzing arbitrary and structured correlation models, including Gaussian multivariate distributions and Markov chains, we derive practical utility guarantees for BDP. Our contributions include theoretical links between DP and BDP and a novel methodology for adapting DP mechanisms to meet the BDP requirements. Through evaluations on real-world databases, we demonstrate that our novel theorems enable the design of BDP mechanisms that maintain competitive utility, paving the way for practical privacy-preserving data practices in correlated settings.

Title: Continual Self-Supervised Learning with Masked Autoencoders in Remote Sensing

Authors: Lars Möllenbrok, Behnood Rasti, Begüm Demir
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21312
Pdf URL: https://arxiv.org/pdf/2506.21312
Copy Paste: [[2506.21312]] Continual Self-Supervised Learning with Masked Autoencoders in Remote Sensing(https://arxiv.org/abs/2506.21312)
Keywords: robust
Abstract: The development of continual learning (CL) methods, which aim to learn new tasks in a sequential manner from the training data acquired continuously, has gained great attention in remote sensing (RS). The existing CL methods in RS, while learning new tasks, enhance robustness towards catastrophic forgetting. This is achieved by using a large number of labeled training samples, which is costly and not always feasible to gather in RS. To address this problem, we propose a novel continual self-supervised learning method in the context of masked autoencoders (denoted as CoSMAE). The proposed CoSMAE consists of two components: i) data mixup; and ii) model mixup knowledge distillation. Data mixup is associated with retaining information on previous data distributions by interpolating images from the current task with those from the previous tasks. Model mixup knowledge distillation is associated with distilling knowledge from past models and the current model simultaneously by interpolating their model weights to form a teacher for the knowledge distillation. The two components complement each other to regularize the MAE at the data and model levels to facilitate better generalization across tasks and reduce the risk of catastrophic forgetting. Experimental results show that CoSMAE achieves significant improvements of up to 4.94% over state-of-the-art CL methods applied to MAE. Our code is publicly available at: this https URL.

Title: DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

Authors: Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21316
Pdf URL: https://arxiv.org/pdf/2506.21316
Copy Paste: [[2506.21316]] DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images(https://arxiv.org/abs/2506.21316)
Keywords: robust, interpretability, large language model
Abstract: Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present \drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at this https URL.

Title: Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts

Authors: Jiajie Yang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21328
Pdf URL: https://arxiv.org/pdf/2506.21328
Copy Paste: [[2506.21328]] Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts(https://arxiv.org/abs/2506.21328)
Keywords: large language model
Abstract: Mixture-of-Experts (MoE) architectures have emerged as a key strategy for scaling large language models (LLMs) efficiently. However, current MoE systems suffer from severe load imbalance, where only a small subset of experts is consistently activated during training and inference, leading to significant underutilization of model capacity and computational resources. In this work, we revisit expert routing through a clustering perspective and propose Latent Prototype Routing (LPR), a novel routing framework that generalizes existing approaches while promoting balanced expert utilization without compromising downstream performance. Extensive experiments across multiple open-source MoE models -- including DeepSeek-V3, Qwen3-MoE, and Mixtral -- demonstrate that LPR reduces the Gini coefficient of expert load from 0.70 to 0.035 on average, improves the min-max expert load ratio from 1e-6 to 0.70, achieving near-perfect load balancing.

Title: Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models

Authors: Haoyang Wu, Tsun-Hsuan Wang, Mathias Lechner, Ramin Hasani, Jennifer A. Eckhoff, Paul Pak, Ozanan R. Meireles, Guy Rosman, Yutong Ban, Daniela Rus
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21330
Pdf URL: https://arxiv.org/pdf/2506.21330
Copy Paste: [[2506.21330]] Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models(https://arxiv.org/abs/2506.21330)
Keywords: transformer
Abstract: Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.

Title: AGTCNet: A Graph-Temporal Approach for Principled Motor Imagery EEG Classification

Authors: Galvin Brice S. Lim, Brian Godwin S. Lim, Argel A. Bandala, John Anthony C. Jose, Timothy Scott C. Chu, Edwin Sybingco
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2506.21338
Pdf URL: https://arxiv.org/pdf/2506.21338
Copy Paste: [[2506.21338]] AGTCNet: A Graph-Temporal Approach for Principled Motor Imagery EEG Classification(https://arxiv.org/abs/2506.21338)
Keywords: robust
Abstract: Brain-computer interface (BCI) technology utilizing electroencephalography (EEG) marks a transformative innovation, empowering motor-impaired individuals to engage with their environment on equal footing. Despite its promising potential, developing subject-invariant and session-invariant BCI systems remains a significant challenge due to the inherent complexity and variability of neural activity across individuals and over time, compounded by EEG hardware constraints. While prior studies have sought to develop robust BCI systems, existing approaches remain ineffective in capturing the intricate spatiotemporal dependencies within multichannel EEG signals. This study addresses this gap by introducing the attentive graph-temporal convolutional network (AGTCNet), a novel graph-temporal model for motor imagery EEG (MI-EEG) classification. Specifically, AGTCNet leverages the topographic configuration of EEG electrodes as an inductive bias and integrates graph convolutional attention network (GCAT) to jointly learn expressive spatiotemporal EEG representations. The proposed model significantly outperformed existing MI-EEG classifiers, achieving state-of-the-art performance while utilizing a compact architecture, underscoring its effectiveness and practicality for BCI deployment. With a 49.87% reduction in model size, 64.65% faster inference time, and shorter input EEG signal, AGTCNet achieved a moving average accuracy of 66.82% for subject-independent classification on the BCI Competition IV Dataset 2a, which further improved to 82.88% when fine-tuned for subject-specific classification. On the EEG Motor Movement/Imagery Dataset, AGTCNet achieved moving average accuracies of 64.14% and 85.22% for 4-class and 2-class subject-independent classifications, respectively, with further improvements to 72.13% and 90.54% for subject-specific classifications.

Title: DynamicBench: Evaluating Real-Time Report Generation in Large Language Models

Authors: Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, Jiaya Jia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21343
Pdf URL: https://arxiv.org/pdf/2506.21343
Copy Paste: [[2506.21343]] DynamicBench: Evaluating Real-Time Report Generation in Large Language Models(https://arxiv.org/abs/2506.21343)
Keywords: large language model
Abstract: Traditional benchmarks for large language models (LLMs) typically rely on static evaluations through storytelling or opinion expression, which fail to capture the dynamic requirements of real-time information processing in contemporary applications. To address this limitation, we present DynamicBench, a benchmark designed to evaluate the proficiency of LLMs in storing and processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval pipeline, integrating web searches with local report databases. It necessitates domain-specific knowledge, ensuring accurate responses report generation within specialized fields. By evaluating models in scenarios that either provide or withhold external documents, DynamicBench effectively measures their capability to independently process recent information or leverage contextual enhancements. Additionally, we introduce an advanced report generation system adept at managing dynamic information synthesis. Our experimental results confirm the efficacy of our approach, with our method achieving state-of-the-art performance, surpassing GPT4o in document-free and document-assisted scenarios by 7.0% and 5.8%, respectively. The code and data will be made publicly available.

Title: PanSt3R: Multi-view Consistent Panoptic Segmentation

Authors: Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, Gabriela Csurka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21348
Pdf URL: https://arxiv.org/pdf/2506.21348
Copy Paste: [[2506.21348]] PanSt3R: Multi-view Consistent Panoptic Segmentation(https://arxiv.org/abs/2506.21348)
Keywords: segmentation
Abstract: Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.

Title: Generalizable Neural Electromagnetic Inverse Scattering

Authors: Yizhe Cheng, Chunxun Tian, Haoru Wang, Wentao Zhu, Xiaoxuan Ma, Yizhou Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.21349
Pdf URL: https://arxiv.org/pdf/2506.21349
Copy Paste: [[2506.21349]] Generalizable Neural Electromagnetic Inverse Scattering(https://arxiv.org/abs/2506.21349)
Keywords: robust
Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.

Title: Lipschitz Bounds for Persistent Laplacian Eigenvalues under One-Simplex Insertions

Authors: Le Vu Anh, Mehmet Dik, Nguyen Viet Anh
Subjects: cs.LG, cs.IT, math.MG
Abstract URL: https://arxiv.org/abs/2506.21352
Pdf URL: https://arxiv.org/pdf/2506.21352
Copy Paste: [[2506.21352]] Lipschitz Bounds for Persistent Laplacian Eigenvalues under One-Simplex Insertions(https://arxiv.org/abs/2506.21352)
Keywords: robust
Abstract: Persistent Laplacians are matrix operators that track how the shape and structure of data transform across scales and are popularly adopted in biology, physics, and machine learning. Their eigenvalues are concise descriptors of geometric and topological features in a filtration. Although earlier work established global algebraic stability for these operators, the precise change in a single eigenvalue when one simplex, such as a vertex, edge, or triangle, is added has remained unknown. This is important because downstream tools, including heat-kernel signatures and spectral neural networks, depend directly on these eigenvalues. We close this gap by proving a uniform Lipschitz bound: after inserting one simplex, every up-persistent Laplacian eigenvalue can vary by at most twice the Euclidean norm of that simplex's boundary, independent of filtration scale and complex size. This result delivers the first eigenvalue-level robustness guarantee for spectral topological data analysis. It guarantees that spectral features remain stable under local updates and enables reliable error control in dynamic data settings.

Title: SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

Authors: Melanie Rieff, Maya Varma, Ossian Rabow, Subathra Adithan, Julie Kim, Ken Chang, Hannah Lee, Nidhi Rohatgi, Christian Bluethgen, Mohamed S. Muneer, Jean-Benoit Delbrouck, Michael Moor
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21355
Pdf URL: https://arxiv.org/pdf/2506.21355
Copy Paste: [[2506.21355]] SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning(https://arxiv.org/abs/2506.21355)
Keywords: large language model
Abstract: Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, example ordering exhibits a recency bias, i.e., placing the most relevant example last can lead to substantial performance improvements by up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context.

Title: ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Authors: Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21356
Pdf URL: https://arxiv.org/pdf/2506.21356
Copy Paste: [[2506.21356]] ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models(https://arxiv.org/abs/2506.21356)
Keywords: robust
Abstract: Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce \textbf{ShotBench}, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60\% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct \textbf{ShotQA}, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop \textbf{ShotVL} through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new \textbf{state-of-the-art} performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.

Title: Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models

Authors: Fangzhou Dong, Yifan Zeng, Yingpeng Sang, Hong Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21360
Pdf URL: https://arxiv.org/pdf/2506.21360
Copy Paste: [[2506.21360]] Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models(https://arxiv.org/abs/2506.21360)
Keywords: large language model
Abstract: Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs' ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework's results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.

Title: GenFlow: Interactive Modular System for Image Generation

Authors: Duc-Hung Nguyen, Huu-Phuc Huynh, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21369
Pdf URL: https://arxiv.org/pdf/2506.21369
Copy Paste: [[2506.21369]] GenFlow: Interactive Modular System for Image Generation(https://arxiv.org/abs/2506.21369)
Keywords: generative
Abstract: Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow's ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.

Title: Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation

Authors: Guanting Dong, Xiaoxi Li, Yuyao Zhang, Mengjie Deng
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21384
Pdf URL: https://arxiv.org/pdf/2506.21384
Copy Paste: [[2506.21384]] Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation(https://arxiv.org/abs/2506.21384)
Keywords: robust, large language model
Abstract: Real-world live retrieval-augmented generation (RAG) systems face significant challenges when processing user queries that are often noisy, ambiguous, and contain multiple intents. While RAG enhances large language models (LLMs) with external knowledge, current systems typically struggle with such complex inputs, as they are often trained or evaluated on cleaner data. This paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs LLM-assisted query understanding to preprocess user inputs through three key modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs with tailored prompts to denoise queries (e.g., correcting spelling errors) and decompose multi-intent queries into structured sub-queries; (2) Intent-Aware Knowledge Retrieval, which performs retrieval for each sub-query from a corpus (i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking and Generation, where a reranker (i.e., BGE) refines document selection before a final response is generated by an LLM (i.e., Falcon-10B) using a chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications, such as those highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex and noisy queries.

Title: Early Stopping Tabular In-Context Learning

Authors: Jaris Küken, Lennart Purucker, Frank Hutter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21387
Pdf URL: https://arxiv.org/pdf/2506.21387
Copy Paste: [[2506.21387]] Early Stopping Tabular In-Context Learning(https://arxiv.org/abs/2506.21387)
Keywords: robust, transformer
Abstract: Tabular foundation models have shown strong performance across various tabular learning tasks via in-context learning, offering robust generalization without any downstream finetuning. However, their inference-time costs remain high, particularly for larger datasets. To address this, we propose early-stopping the in-context learning process. We achieve this by dynamically evaluating whether to stop in-context learning after each Transformer encoder layer. Once stopped, we decode the embedding using a pre-trained layer-wise decoder. Experiments across 34 small classification tasks size show that early stopping in-context learning accelerates inference by up to x1.3 with negligible degradation in predictive performance. To assess scalability, we further evaluate our method on five larger classification tasks, achieving speedups of up to x2.2. Our results demonstrate the potential of early exiting as an effective and practical strategy for improving the efficiency of tabular in-context learning.

Title: Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction

Authors: Zhirui Gao. Renjiao Yi, Yaqiao Dai, Xuening Zhu, Wei Chen, Chenyang Zhu, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21401
Pdf URL: https://arxiv.org/pdf/2506.21401
Copy Paste: [[2506.21401]] Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction(https://arxiv.org/abs/2506.21401)
Keywords: robust
Abstract: This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.

Title: Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference

Authors: Colin Samplawski, Adam D. Cobb, Manoj Acharya, Ramneet Kaur, Susmit Jha
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21408
Pdf URL: https://arxiv.org/pdf/2506.21408
Copy Paste: [[2506.21408]] Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference(https://arxiv.org/abs/2506.21408)
Keywords: large language model
Abstract: Despite their widespread use, large language models (LLMs) are known to hallucinate incorrect information and be poorly calibrated. This makes the uncertainty quantification of these models of critical importance, especially in high-stakes domains, such as autonomy and healthcare. Prior work has made Bayesian deep learning-based approaches to this problem more tractable by performing inference over the low-rank adaptation (LoRA) parameters of a fine-tuned model. While effective, these approaches struggle to scale to larger LLMs due to requiring further additional parameters compared to LoRA. In this work we present $\textbf{Scala}$ble $\textbf{B}$ayesian $\textbf{L}$ow-Rank Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform Bayesian inference in an $r$-dimensional subspace, for LoRA rank $r$. By repurposing the LoRA parameters as projection matrices, we are able to map samples from this subspace into the full weight space of the LLM. This allows us to learn all the parameters of our approach using stochastic variational inference. Despite the low dimensionality of our subspace, we are able to achieve competitive performance with state-of-the-art approaches while only requiring ${\sim}1000$ additional parameters. Furthermore, it allows us to scale up to the largest Bayesian LLM to date, with four times as a many base parameters as prior work.

Title: Distributed Cross-Channel Hierarchical Aggregation for Foundation Models

Authors: Aristeidis Tsaris, Isaac Lyngaas, John Lagregren, Mohamed Wahib, Larry York, Prasanna Balaprakash, Dan Lu, Feiyi Wang, Xiao Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21411
Pdf URL: https://arxiv.org/pdf/2506.21411
Copy Paste: [[2506.21411]] Distributed Cross-Channel Hierarchical Aggregation for Foundation Models(https://arxiv.org/abs/2506.21411)
Keywords: transformer
Abstract: Vision-based scientific foundation models hold significant promise for advancing scientific discovery and innovation. This potential stems from their ability to aggregate images from diverse sources such as varying physical groundings or data acquisition systems and to learn spatio-temporal correlations using transformer architectures. However, tokenizing and aggregating images can be compute-intensive, a challenge not fully addressed by current distributed methods. In this work, we introduce the Distributed Cross-Channel Hierarchical Aggregation (D-CHAG) approach designed for datasets with a large number of channels across image modalities. Our method is compatible with any model-parallel strategy and any type of vision transformer architecture, significantly improving computational efficiency. We evaluated D-CHAG on hyperspectral imaging and weather forecasting tasks. When integrated with tensor parallelism and model sharding, our approach achieved up to a 75% reduction in memory usage and more than doubled sustained throughput on up to 1,024 AMD GPUs on the Frontier Supercomputer.

Title: XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

Authors: Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21416
Pdf URL: https://arxiv.org/pdf/2506.21416
Copy Paste: [[2506.21416]] XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation(https://arxiv.org/abs/2506.21416)
Keywords: robust, diffusion, transformer
Abstract: Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.

Title: Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Authors: Prajwal Koirala, Cody Fleming
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21427
Pdf URL: https://arxiv.org/pdf/2506.21427
Copy Paste: [[2506.21427]] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning(https://arxiv.org/abs/2506.21427)
Keywords: diffusion, generative
Abstract: Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.

Title: HyperSORT: Self-Organising Robust Training with hyper-networks

Authors: Samuel Joutard, Marijn Stollenga, Marc Balle Sanchez, Mohammad Farid Azampour, Raphael Prevost
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21430
Pdf URL: https://arxiv.org/pdf/2506.21430
Copy Paste: [[2506.21430]] HyperSORT: Self-Organising Robust Training with hyper-networks(https://arxiv.org/abs/2506.21430)
Keywords: robust, segmentation
Abstract: Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets' parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: this https URL

Title: Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

Authors: Ali Şenol, Garima Agrawal, Huan Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21443
Pdf URL: https://arxiv.org/pdf/2506.21443
Copy Paste: [[2506.21443]] Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection(https://arxiv.org/abs/2506.21443)
Keywords: attack, robust, interpretability, large language model
Abstract: Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)\-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk\-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)\-Enhanced LLM framework that integrates pretrained LLMs with structured, task\-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK\-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK\-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA\-based implementation achieves 98\% classification accuracy. Comparative studies against zero\-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high\-stakes NLP applications.

Title: Text2Cypher Across Languages: Evaluating Foundational Models Beyond English

Authors: Makbule Gulcin Ozsoy, William Tai
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.21445
Pdf URL: https://arxiv.org/pdf/2506.21445
Copy Paste: [[2506.21445]] Text2Cypher Across Languages: Evaluating Foundational Models Beyond English(https://arxiv.org/abs/2506.21445)
Keywords: fair, large language model
Abstract: Recent advances in large language models have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses solely on English, with limited evaluation in other languages. This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. We evaluate multiple foundational models using standardized prompts and metrics. Our results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic characteristics. Additionally, we explore the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Our findings highlight the need for more inclusive evaluation and development in multilingual query generation. Future work includes schema localization and fine-tuning across diverse languages.

Title: Controllable 3D Placement of Objects with Scene-Aware Diffusion Models

Authors: Mohamed Omran, Dimitris Kalatzis, Jens Petersen, Amirhossein Habibian, Auke Wiggers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21446
Pdf URL: https://arxiv.org/pdf/2506.21446
Copy Paste: [[2506.21446]] Controllable 3D Placement of Objects with Scene-Aware Diffusion Models(https://arxiv.org/abs/2506.21446)
Keywords: diffusion, generative
Abstract: Image editing approaches have become more powerful and flexible with the advent of powerful text-conditioned generative models. However, placing objects in an environment with a precise location and orientation still remains a challenge, as this typically requires carefully crafted inpainting masks or prompts. In this work, we show that a carefully designed visual map, combined with coarse object masks, is sufficient for high quality object placement. We design a conditioning signal that resolves ambiguities, while being flexible enough to allow for changing of shapes or object orientations. By building on an inpainting model, we leave the background intact by design, in contrast to methods that model objects and background jointly. We demonstrate the effectiveness of our method in the automotive setting, where we compare different conditioning signals in novel object placement tasks. These tasks are designed to measure edit quality not only in terms of appearance, but also in terms of pose and location accuracy, including cases that require non-trivial shape changes. Lastly, we show that fine location control can be combined with appearance control to place existing objects in precise locations in a scene.

Title: A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario

Authors: Cyrus Addy, Ajay Kumar Gurumadaiah, Yixiang Gao, Kwame Awuah-Offei
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21451
Pdf URL: https://arxiv.org/pdf/2506.21451
Copy Paste: [[2506.21451]] A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario(https://arxiv.org/abs/2506.21451)
Keywords: robust
Abstract: Underground mining operations face significant safety challenges that make emergency response capabilities crucial. While robots have shown promise in assisting with search and rescue operations, their effectiveness depends on reliable miner detection capabilities. Deep learning algorithms offer potential solutions for automated miner detection, but require comprehensive training datasets, which are currently lacking for underground mining environments. This paper presents a novel thermal imaging dataset specifically designed to enable the development and validation of miner detection systems for potential emergency applications. We systematically captured thermal imagery of various mining activities and scenarios to create a robust foundation for detection algorithms. To establish baseline performance metrics, we evaluated several state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11, and RT-DETR on our dataset. While not exhaustive of all possible emergency situations, this dataset serves as a crucial first step toward developing reliable thermal-based miner detection systems that could eventually be deployed in real emergency scenarios. This work demonstrates the feasibility of using thermal imaging for miner detection and establishes a foundation for future research in this critical safety application.

Title: Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency

Authors: Kaiyu Song, Hanjiang Lai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21452
Pdf URL: https://arxiv.org/pdf/2506.21452
Copy Paste: [[2506.21452]] Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency(https://arxiv.org/abs/2506.21452)
Keywords: diffusion
Abstract: Classifier-free guidance (CFG) succeeds in condition diffusion models that use a guidance scale to balance the influence of conditional and unconditional terms. A high guidance scale is used to enhance the performance of the conditional term. However, the high guidance scale often results in oversaturation and unrealistic artifacts. In this paper, we introduce a new perspective based on low-frequency signals, identifying the accumulation of redundant information in these signals as the key factor behind oversaturation and unrealistic artifacts. Building on this insight, we propose low-frequency improved classifier-free guidance (LF-CFG) to mitigate these issues. Specifically, we introduce an adaptive threshold-based measurement to pinpoint the locations of redundant information. We determine a reasonable threshold by analyzing the change rate of low-frequency information between prior and current steps. We then apply a down-weight strategy to reduce the impact of redundant information in the low-frequency signals. Experimental results demonstrate that LF-CFG effectively alleviates oversaturation and unrealistic artifacts across various diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL.

Title: Aligning Spoken Dialogue Models from User Interactions

Authors: Anne Wu, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21463
Pdf URL: https://arxiv.org/pdf/2506.21463
Copy Paste: [[2506.21463]] Aligning Spoken Dialogue Models from User Interactions(https://arxiv.org/abs/2506.21463)
Keywords: segmentation
Abstract: We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker this http URL create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.

Title: TopK Language Models

Authors: Ryosuke Takahashi, Tatsuro Inaba, Kentaro Inui, Benjamin Heinzerling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21468
Pdf URL: https://arxiv.org/pdf/2506.21468
Copy Paste: [[2506.21468]] TopK Language Models(https://arxiv.org/abs/2506.21468)
Keywords: robust, interpretability, transformer
Abstract: Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE's side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model's hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.

Title: Logios : An open source Greek Polytonic Optical Character Recognition system

Authors: Perifanos Konstantinos, Goutsos Dionisis
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21474
Pdf URL: https://arxiv.org/pdf/2506.21474
Copy Paste: [[2506.21474]] Logios : An open source Greek Polytonic Optical Character Recognition system(https://arxiv.org/abs/2506.21474)
Keywords: extraction
Abstract: In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.

Title: Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection

Authors: Tobias J. Riedlinger, Kira Maag, Hanno Gottschalk
Subjects: cs.CV, cs.LG, math.PR
Abstract URL: https://arxiv.org/abs/2506.21486
Pdf URL: https://arxiv.org/pdf/2506.21486
Copy Paste: [[2506.21486]] Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection(https://arxiv.org/abs/2506.21486)
Keywords: segmentation
Abstract: Deep neural networks have set the state-of-the-art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the model's uncertainty in object detection or pixel-wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood-based training and provides well-defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.

Title: Bridging Offline and Online Reinforcement Learning for LLMs

Authors: Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21495
Pdf URL: https://arxiv.org/pdf/2506.21495
Copy Paste: [[2506.21495]] Bridging Offline and Online Reinforcement Learning for LLMs(https://arxiv.org/abs/2506.21495)
Keywords: large language model
Abstract: We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

Title: Potemkin Understanding in Large Language Models

Authors: Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21521
Pdf URL: https://arxiv.org/pdf/2506.21521
Copy Paste: [[2506.21521]] Potemkin Understanding in Large Language Models(https://arxiv.org/abs/2506.21521)
Keywords: large language model
Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.

Title: "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets

Authors: Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.21532
Pdf URL: https://arxiv.org/pdf/2506.21532
Copy Paste: [[2506.21532]] "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets(https://arxiv.org/abs/2506.21532)
Keywords: large language model
Abstract: People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: this https URL

Title: Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval

Authors: Hani Alomari, Anushka Sivakumar, Andrew Zhang, Chris Thomas
Subjects: cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21538
Pdf URL: https://arxiv.org/pdf/2506.21538
Copy Paste: [[2506.21538]] Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval(https://arxiv.org/abs/2506.21538)
Keywords: robust
Abstract: Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities. Traditional methods learn a single-vector embedding to represent semantics of each sample, but struggle to capture nuanced and diverse relationships that can exist across modalities. Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative, as they can capture richer and more diverse relationships. In this paper, we show that, despite their promise, these set-based representations continue to face issues including sparse supervision and set collapse, which limits their effectiveness. To address these challenges, we propose Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets which preserve semantic diversity within the set. We also introduce two loss functions to further enhance the representations: Global Discriminative Loss to enhance distinction among embeddings, and Intra-Set Divergence Loss to prevent collapse within each set. Our method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data.

Title: DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion

Authors: Yansong Qu, Shaohui Dai, Xinyang Li, Yuze Wang, You Shen, Liujuan Cao, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21544
Pdf URL: https://arxiv.org/pdf/2506.21544
Copy Paste: [[2506.21544]] DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion(https://arxiv.org/abs/2506.21544)
Keywords: diffusion
Abstract: Reconstructing 3D objects from a single image is a long-standing challenge, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they generally assume fully visible inputs and fail when parts of the object are occluded. This leads to inconsistent views and degraded 3D reconstruction quality. To overcome this limitation, we propose an end-to-end framework for occlusion-aware multi-view generation. Our method directly synthesizes six structurally consistent novel views from a single partially occluded image, enabling downstream 3D reconstruction without requiring prior inpainting or manual annotations. We construct a self-supervised training pipeline using the Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, encompassing diverse occlusion levels, object categories, and mask patterns. This benchmark provides a standardized protocol for evaluating future methods under partial occlusions. Our code is available at this https URL.

Title: HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

Authors: Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Ismini Lourentzou
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21546
Pdf URL: https://arxiv.org/pdf/2506.21546
Copy Paste: [[2506.21546]] HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation(https://arxiv.org/abs/2506.21546)
Keywords: segmentation
Abstract: Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

Title: SAM4D: Segment Anything in Camera and LiDAR Streams

Authors: Jianyun Xu, Song Wang, Ziqian Ni, Chunyong Hu, Sheng Yang, Jianke Zhu, Qiang Li
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21547
Pdf URL: https://arxiv.org/pdf/2506.21547
Copy Paste: [[2506.21547]] SAM4D: Segment Anything in Camera and LiDAR Streams(https://arxiv.org/abs/2506.21547)
Keywords: robust, segmentation
Abstract: We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

Title: SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark

Authors: Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, Alessandro Oliva, Giuseppe Lisanti, Luigi Di Stefano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21549
Pdf URL: https://arxiv.org/pdf/2506.21549
Copy Paste: [[2506.21549]] SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark(https://arxiv.org/abs/2506.21549)
Keywords: segmentation
Abstract: We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.

Title: mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

Authors: Xiaona Zhou, Constantin Brif, Ismini Lourentzou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21550
Pdf URL: https://arxiv.org/pdf/2506.21550
Copy Paste: [[2506.21550]] mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale(https://arxiv.org/abs/2506.21550)
Keywords: security, robust, large language model
Abstract: Multivariate time series anomaly detection (MTS-AD) is critical in domains like healthcare, cybersecurity, and industrial monitoring, yet remains challenging due to complex inter-variable dependencies, temporal dynamics, and sparse anomaly labels. We introduce mTSBench, the largest benchmark to date for MTS-AD and unsupervised model selection, spanning 344 labeled time series across 19 datasets and 12 diverse application domains. mTSBench evaluates 24 anomaly detection methods, including large language model (LLM)-based detectors for multivariate time series, and systematically benchmarks unsupervised model selection techniques under standardized conditions. Consistent with prior findings, our results confirm that no single detector excels across datasets, underscoring the importance of model selection. However, even state-of-the-art selection methods remain far from optimal, revealing critical gaps. mTSBench provides a unified evaluation suite to enable rigorous, reproducible comparisons and catalyze future advances in adaptive anomaly detection and robust model selection.

Title: Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Authors: Ziyue Li, Chenrui Fan, Tianyi Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21551
Pdf URL: https://arxiv.org/pdf/2506.21551
Copy Paste: [[2506.21551]] Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test(https://arxiv.org/abs/2506.21551)
Keywords: large language model
Abstract: Grokking, i.e., test performance keeps improving long after training loss converged, has been recently witnessed in neural network training, making the mechanism of generalization and other emerging capabilities such as reasoning mysterious. While prior studies usually train small models on a few toy or highly-specific tasks for thousands of epochs, we conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE. We compute the training loss and evaluate generalization on diverse benchmark tasks, including math reasoning, code generation, and commonsense/domain-specific knowledge retrieval tasks. Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models, though different data may enter grokking stages asynchronously. We further demystify grokking's "emergence of generalization" by investigating LLM internal dynamics. Specifically, we find that training samples' pathways (i.e., expert choices across layers) evolve from random, instance-specific to more structured and shareable between samples during grokking. Also, the complexity of a sample's pathway reduces despite the converged loss. These indicate a memorization-to-generalization conversion, providing a mechanistic explanation of delayed generalization. In the study, we develop two novel metrics to quantify pathway distance and the complexity of a single pathway. We show their ability to predict the generalization improvement on diverse downstream tasks. They are efficient, simple to compute and solely dependent on training data. Hence, they have practical value for pretraining, enabling us to monitor the generalization performance without finetuning and test. Theoretically, we show that more structured pathways reduce model complexity and improve the generalization bound.

Title: Whole-Body Conditioned Egocentric Video Prediction

Authors: Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21552
Pdf URL: https://arxiv.org/pdf/2506.21552
Copy Paste: [[2506.21552]] Whole-Body Conditioned Egocentric Video Prediction(https://arxiv.org/abs/2506.21552)
Keywords: diffusion, transformer
Abstract: We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.